causal inference: the mixtape pdf

Figure 11. They are subclassification, exact matching, and approximate matching.1 Subclassification is a method of satisfying the backdoor criterion by weighting differences in means by strata-specific weights. The expression of this normalized estimator is shown here: Most software packages have programs that will estimate the sample analog of these inverse probability weighted parameters that use the second method with normalized weights. Then Yi | Xi,Di = 0) can be estimated by matching. Now lets think for a second about what Hoekstra is finding. Format: PDF, Mobi Release: 2021-05-11 Language: en View Using totemic punk rock songs on a mixtape to anchor each chapter, the book documents an intergenerational conversation between a Millennial in his 30s and his zoomer teenage brother. This is also because in the face of strong trends in the running variable, sample-size requirements get even larger. [2012] introduced a kind of exact matching called coarsened exact matching (CEM). Section 5 relates these tools to those used in the potential-outcome framework, and offers a formal mapping between the two frameworks and a symbiosis (Section 5. . Sometimes we know that randomization occurred only conditional on some observable characteristics. For RDD to be valid in your study, there must not be an observable discontinuous change in the average values of reasonably chosen covariates around the cutoff. The following Monte Carlo simulation will estimate OLS on a sample of data 1,000 times. The authors used the same non-experimental control group data sets from the CPS and PSID as Lalonde [1986] did. In this path, unobserved ability affects both which jobs people get and their earnings. For instance: where 0 = E(u). Even controlling for I, there still exist spurious correlations between D and Y due to the DBY backdoor path. In fact, pictures are the comparative advantage of RDD. Instead, we use nearest-neighbor matching, which is simply going to match each treatment unit to the control group unit whose covariate value is nearest to that of the treatment group unit itself. Notice that it is straightforward because is linear in . History is a sequence of observable, factual events, one after another. But how do we answer the causal question without independence (i.e., randomization)? The term before the vertical bar is the same, but the term after the vertical bar is different. And again, note that the notation here is population concepts. Well, the average incentive in her experiment was worth about a days wage. Crump et al. Theres always prediction error, in other words, with any estimator, but OLS is the least worst. which is slightly higher than the unadjusted ATE of 3.25. In observational data involving human beings, it almost always will be different from the ATE, and thats because individuals will be endogenously sorting into some treatment based on the gains they expect from it. 28 Hat tip to Ben Chidmi, who helped create this simulation in Stata. Lets write that number 2 down and do another permutation, by which I mean, lets shuffle the treatment assignment again. But, when we condition on p(X), the propensity score, notice that D and X are statistically independent. The therapist records their mental health on a scale of 0 to 20. The conditional independence assumption allows us to make the following substitution, and same for the other term. I recommend the package rddensity,11 which you can install for R as well.12 These packages are based on Cattaneo et al. A classical question in labor economics is whether college education increases earnings. Causal Inference: The Mixtape. Fishers sharp null was the assertion that every single unit had a treatment effect of zero, which leads to an easy statement that the ATE is also zero. For instance, there are two units in the non-trainees group with an age of 30, and thats 10 and 18. Here that would be That is, the mortality rate of smokers in the population is 29 per 100,000. That is, 0 < Pr(D = 1 | X) < 1. At c0, the assignment variable X no longer has a direct effect on Y. Using our sampling metaphor, then, the distribution of the coefficients is probably larger than we thought. 8 Counterfactual reasoning can be helpful, but it can also be harmful, particularly when it is the source of regret. In particular, the paper surveys the development of mathematical tools for inferring (from a combination of data and assumptions) answers to three types of causal queries: those about (1) the effects of potential interventions, (2) probabilities of counterfactuals, and (3) direct and indirect effects (also known as "mediation"). He does this by estimating the causal effect of attending the state flagship university on earnings. Or maybe he used the second hand on his watch to assign surgery to them: if it was between 1 and 30 seconds, he gave them surgery, and if it was between 31 and 60 seconds, he gave them chemotherapy.18 In other words, lets say that he chose some method for assigning treatment that did not depend on the values of potential outcomes under either state of the world. Changes in hospitalizations [Card et al., 2008]. The mean of the coefficients was around 1.998, which was very close to the true effect of 2 (hard-coded into the data-generating process). Weve introduced the potential outcomes notation and used it to define various types of causal effects. And if we can close all of the otherwise open backdoor paths, then we can isolate the causal effect of D on Y using one of the research designs and identification strategies discussed in this book. This, again, is the heart and soul of the RDD. Course Book. She uses least squares as her primary model, represented in columns 15. To maintain this fiction, lets assume that there exists the perfect doctor who knows each persons potential outcomes and chooses whichever treatment that maximizes a persons post-treatment life span.12 In other words, the doctor chooses to put a patient in surgery or chemotherapy depending on whichever treatment has the longer post-treatment life span. The only plausible explanation, it was argued, was that the polio vaccine caused a reduction in the risk of polio. The rule used to get = 0.082 is unbiased (if we believe that u is unrelated to schooling), not the actual estimate itself. 2. We will at some point be missing values, in other words, for those K categories. How does one choose surgery independent of the expected gains of the surgery? What makes a collider so special? Table 39. 5. How realistic is independence in observational data? This shows up in a regression as well. There are a variety of explicit assumptions buried in this graph that must hold in order for the methods we will review later to recover any average causal effect. Leverage. Thus the simple differencein-mean outcomes estimator is biased unless those second and third terms zero out. It took Klay six years to research and write the book, which follows four characters in Colombia who come together in the shadow of our post-9/11 wars. This is what sharp meansit means literally that no single unit has a treatment effect. They show up constantly. Jump around! Common support means we can estimate both terms. Mean ages, years [Cochran, 1968]. And you are free to call foul on this assumption if you think that background factors affect both schooling and the childs own productivity, which itself should affect wages. Causal inference is not solved with more data, as I argue in the next chapter. An accessible, contemporary introduction to the methods for determining cause and effect in the social sciences Causal inference encompasses the tools that allow social scientists to determine what causes what. That is, An exogenous shock to P (i.e., dropping Democrats into the district) does nothing to equilibrium policies. The challenge in this type of question should be easy to see. The backdoor path is DXY. When we do that in Stata, we get an ATT of $1,725 with p < 0.05. Rather, the simple difference in outcomes is nothing more than a number. Given , we can now estimate sd( ) and sd( 0). The regression anatomy the orem is based on earlier work by Frisch andWaugh [1933] and Lovell [1963].23 I find the theorem more intuitive when I think through a specific example and offer up some data visualization. A nonlinear data-generating process could easily yield false positives if we do not handle the specification carefully. But sometimes there exists a confounder that is unobserved, and when there is, we represent its direct edges with dashed lines. This time the X has two arrows pointing to it, not away from it. Practical questions about causation have been a preoccupation of economists for several centuries. Suppose we measure the size of the mistake, for each i, by squaring it. Ill show you: Lets walk through both the regression output that Ive reproduced in Table 8 as well as a nice visualization of the slope parameters in what Ill call the short bivariate regression and the longer multivariate regression. But it can be estimated. But now we need to build out this epistemological justification so as to capture the inherent uncertainty in the sampling process itself. Between 1971 and 1982, the RAND Corporation conducted a large-scale randomized experiment studying the causal effect of health-care insurance on health-care utilization. 7 211 = 1. An illustration of this identifying assumption is in Figure 33. I cant think of a more well-known and widely accepted causal theory, in fact. How realistic is it that the variance in the errors is the same for all slices of the explanatory variable, x? Authors used this to construct an estimate of age in quarters at date of interview. As Lee and Lemieux [2010] note, allowing different functions on both sides of the discontinuity should be the main results in an RDD paper. Theorem: Sampling variance of OLS. In doing so, she addresses the over-rejection problem that we saw earlier when discussing clustering in the probability and regression chapter. But so what? Standard errors are in parentheses. That is not the source of uncertainty in randomization inference, though. The existence of two causal pathways is contained within the correlation between D and Y. Lets look at a second DAG, which is subtly different from the first. . The DAG is actually telling two stories. I consider Knox et al. The sample space is the set of all the possible out-comes of a random process. Scott Cunningham Causal-Inference-The-Mixtape.pdf ISBN: 9780300251685 | 584 pages . Sometimes the discrepancies are small, sometimes zero, sometimes large. Estimated p-value using different number of trials. 16 This all works if we match on the propensity score and then calculate differences in means. This method implicitly achieves distributional balance between the treatment and control in terms of that known, observable confounder. Conclusion. If the expected potential outcomes are not jumping at c0, then there necessarily are no competing interventions occurring at c0. Furthermore, maybe by the same logic, cigarette smoking has such a low mortality rate because cigarette smokers are younger on average. [2004] present a model, which Ive simplified. The OLS estimator is still = E[XX]1XY. Recall that 2 = E(u2). This kind of scaling-up issue is of common concern when one considers extrapolating from the experimental design to the large-scale implementation of an intervention in some population. Its a very popular method, particularly in the medical sciences, of addressing selection on observables, and it has gained some use among economists as well [Dehejia and Wahba, 2002]. We continue doing that for all units, always moving the control group unit with the closest value on X to fill in the missing counterfactual for each treatment unit. It is weighting treatment and control units according to which is causing units with very small values of the propensity score to blow up and become unusually influential in the calculation of ATT. We say that the two groups are not exchangeable because the covariate is not balanced. The way that you read this table is each cell shows the average treatment effect for the 65-year-old population that complies with the treatment. I encourage you to read these papers more closely when choosing which bootstrap is suitable for your question. In which case, we just need to be careful what we are and are not defining as the treatment. Public concern about police officers systematically discriminating against minorities has reached a breaking point and led to the emergence of the Black Lives Matter movement. Figure 14. Note: Top left: Non-star sample scatter plot of beauty (vertical axis) and talent (horizontal axis). Let the pairs of (xi, and yi): i=1, 2, . This can be written as Since ei is uncorrelated with any independent variable, it is also uncorrelated with xki. Notice that the two variables are independent, random draws from the standard normal distribution, creating an oblong data cloud. Lets begin with a simple DAG to illustrate a few basic ideas. The common support assumption requires that for each strata, there exist observations in both the treatment and control group, but as you can see, there are not any 12-year-old male passengers in first class. Data are drawn from the 19992003 NHIS, and for each characteristic, authors show the incidence rate at age 6364 and the change at age 65 based on a version of the CK equations that include a quadratic in age, fully interacted with a post-65 dummy as well as controls for gender, education, race/ethnicity, region, and sample year. They produced a study that revisited Fryers question and in my opinion both yielded new clues as to the role of racial bias in police use of force and the challenges of using administrative data sources to do so. The form of this fuzzy RDD reduced form is: As in the sharp RDD case, one can allow the smooth function to be different on both sides of the discontinuity by interacting Zi with the running variable. This isnt completely bad news, because the unbiasedness of our regressions based on repeated sampling never depended on assuming anything about the variance of the errors. If you go to the website and type 2901 choose 2222 you get the following truncated number of combinations: 6150566109498251513699280333307718471623795043419269261826403 1826638575892109580799569314255435267978378517415493374384524 4 5116605236515180505177864028242897940877670928487172011882232 1 8885942515735991356144283120935017438277464692155849858790123 6881115630115402676462079964050722486456070651607800409341130 6 5544540016312151177000750339179099962167196885539725968603122 8 687680364730936480933074665307 . However many matches M that we find, we would assign the average outcome as the counterfactual for the treatment group unit. Figure 22. This would be true if in the population u and x are correlated. Causal Inference: The Mixtape - Scott Cunningham READ & DOWNLOAD Scott Cunningham book Causal Inference: The Mixtape in PDF, EPub, Mobi, Kindle online. This is the OLS intercept estimate because it is calculated using sample averages. 4 A truly hilarious assumption, but this is just illustrative. Once I have all those, I have a better sense of where my problems are. Patient 7, for instance, lives only one additional year post-surgery versus ten additional years post-chemo. A contemporary debate could help illustrate what I mean. Notice the jump at the discontinuity in the outcome, which Ive labeled the LATE, or local average treatment effect. Introduction: Causal Inference as a Comparison of Potential Outcomes Causal inference refers to an intellectual discipline that considers the assumptions, study designs, and estimation strategies that allow researchers to draw causal conclusions based on data. This advice has since become common practice in the empirical literature. We could just plot the regression coefficient in a scatter plot showing all i pairs of data; the slope coefficient would be the best linear fit of the data for this data cloud. But because this will get incredibly laborious, lets just focus on a few of them. Angrist and Pischke [2009] argue that linear regression may be useful even if the underlying CEF itself is not linear, because regression is a good approximation of the CEF. The authors comment on what might be going on: Figure 36. It summarises, systematises and prioritises already existing methodological tools that aim to validate statistical results. Notes 1 This brief history will focus on the development of the potential outcomes model. The NSW was a temporary employment program designed to help disadvantaged workers lacking basic job skills move into the labor market by giving them work experience and counseling in a sheltered environment. "Causation versus correlation has been the basis of argumentseconomic and otherwisesince the beginning of time. As noted earlier, we write the CEF as E(yi | xi). Note that the CEF is explicitly a function of xi. Because we do not have overlap, or common support, we must rely on extrapolation, which means we are comparing units with different values of the running variable. And how might we visualize this coefficient given that there are six dimensions to the data? Read / Download Causal Inference: The Mixtape. Then we can calculate the ATT. Nearest-neighbor matching. Sharp RDD is where treatment is a deterministic function of the running variable X.6 An example might be Medicare enrollment, which happens sharply at age 65, excluding disability situations. It was not always so popular, though. We just stacked the data, which doesnt affect the estimator itself. In the second state of the world (sometimes called the counterfactual state of the world), that same man takes nothing for his headache and one hour later reports the severity of his headache. The issue of outliers also leads us to consider a test statistic that uses ranks rather than differences. Lets say we are estimating the causal effect of returns to schooling. We use the notation Y1 and Y0, respectively, for these two states of the world. The share who saw a doctor went up slightly, as did the share who stayed at a hospital. [2015] suggest an alternative assumption which has implications for inference. They ultimately are the ones who can give you the data if it is not public use, so dont be a jerk.5 But on to the picture. Ive created a data set like this. Without data, one cannot study the question of whether police shoot minorities more than they shoot whites. Think of kernel regression as a weighted regression restricted to a window (hence local). Independence does not imply that E[Y1 | D = 1]E[Y0 | D= 0]=0. This gives us the age-adjusted mortality rate for the treatment group. Matching methods are an important member of the causal inference arsenal. To help you understand randomization inference, lets break it down into a few methodological steps. One, doing so allows her to include a variety of controls that can reduce the residual variance and thus improve the precision of her estimates. Is it random? Yule used his regression to crank out the correlation between out-relief and pauperism, from which he concluded that public assistance increased pauper growth rates. If we want to estimate the average causal effect of family size on labor supply, then we need two things. Those were both ATT estimators. For instance, Fryer [2019] notes that the Houston data was based on arrest narratives that ranged from two to one hundred pages in length. What would you do if you were in room B? They are as follows: 1. Jump up, jump up, and get down! In order to estimate a causal effect when there is a confounder, we need (1) CIA and (2) the probability of treatment to be between 0 and 1 for each strata. But note, since this does not adjust for observable confounders age and gender, it is a biased estimate of the ATE. We can think of this as getting close to the population variance of x, as n gets large. The joint density is defined as fxy(u,t). Note that ATE in this example is equal to 0.6. pdf book. This assumption stipulates that our population error term, u, has the same variance given any value of the explanatory variable, x. All the administrative data is conditional on a stop. Which is right? To wonder how life would be different had one single event been different is to indulge in counterfactual reasoning, and counterfactuals are not realized in history because they are hypothetical states of the world. So the authors aim to randomize Dt using a RDD, which Ill now discuss in detail. Theres no magic here, just least squares. This is not surprising given the strong negative selection into treatment. Lets use the absolute value of the simple difference in mean outcomes on the normalized rank, which here is To calculate the exact p-value, we would simply conduct the same randomization process as earlier, only instead of calculating the simple difference in mean outcomes, we would calculate the absolute value of the simpler difference in mean rank. What is the average death rate for pipe smokers without subclassification? Table 29. 10 Lalonde [1986] lists several studies that discuss the findings from the program. Card et al. But remember, every matching solution to a causality problem requires a credible belief that the backdoor criterion can be achieved by conditioning on some matrix X, or what weve called CIA. Note, using DAG notation, this simply means that we have the following DAG: where D is smoking, Y is mortality, and A is age of the smoker. Colliders, when they are left alone, always close a specific backdoor path. Karl Marx was interested in the transition of society from capitalism to socialism [Needleman and Needleman, 1969]. We usually define M to be small, like M = 2. As can be seen in Figure 9, the least squares estimate has a narrower spread than that of the estimates when the data isnt clustered. That means a student with 1240 had a lower chance of getting in than a student with 1250. What if Bruce Waynes parents had never been murdered? When there are such spillovers, though, such as when we are working with social network data, we will need to use models that can explicitly account for such SUTVA violations, such as that of Goldsmith-Pinkham and Imbens [2013]. The National Supported Work Demonstration (NSW) job-training program was operated by the Manpower Demonstration Research Corp (MRDC) in the mid-1970s. In other words, the cutoff is endogenous. But the purpose here is primarily to show its robustness under different ways of generating those precious p-values, as well as provide you with a map for programming this yourself and for having an arguably separate intuitive way of thinking about significance itself. Despite Campbells many efforts to advocate for its usefulness and understand its properties, RDD did not catch on beyond a few doctoral students and a handful of papers here and there. But the inclusion of controls has other value. The cutoff is endogenous to factors that independently cause potential outcomes to shift. But if we are interested in the causal effect of a single variable, R2 is irrelevant. Two were public-use data setsthe New York City Stop and Frisk database and the Police-Public Contact Survey. Many analysts have argued that unequal insurance coverage contributes to disparities in health-care utilization and health outcomes across socioeconomic status. So lets say we regress Y onto D, our discrimination variable. Then combining this new assumption, E(u | x) = E(u) (the nontrivial assumption to make), with E(u)=0 (the normalization and trivial assumption), and you get the following new assumption: Equation 2.28 is called the zero conditional mean assumption and is a key identifying assumption in regression models. These universities are often environments of higher research, with more resources and strongly positive peer effects. Por un mejor valor. For now, lets look at the treatment parameter under both assumptions. We then use the two population restrictions that we discussed earlier: to obtain estimating equations for 0 and 1. And there are designs where the probability of treatment discontinuously increases at the cutoff. An example would be age thresholds used for policy, such as when a person turns 18 years old and faces more severe penalties for crime. The second data set is hospital discharge records for California, Florida, and New York. 15 See Angrist and Pischke [2009], 8081. Think of the backdoor path like this: Sometimes when D takes on different values, Y takes on different values because D causes Y. Since identification in an RDD is a limiting case, we are technically only identifying an average causal effect for those units at the cutoff. This leads therefore to mismatching on the covariates, which introduces bias. Free book, AudioBook, Reender. A positive residual indicates that the regression line (and hence, the predicted values) underestimates the true value of yi. Table 9. As weve been saying, this leads us to the variance and ultimately to the standard deviation. In that case, wed fit the regression model: Since f(Xi) is counterfactual for values of Xi > c0, how will we model the nonlinearity? It is common to hear that once occupation or other characteristics of a job are conditioned on, the wage disparity between genders disappears or gets smaller. Approximately 20% of non-elderly adults in the United States lacked insurance in 2005. Erins work partly focuses on gender discrimination. Not only were his estimates usually very different in magnitude, but his results were almost always the wrong sign! There are several critical empirical challenges in studying racial biases in police use of force, though. Hence the imbalance bounding between 0 and 1. Directed Acyclic Graphs Everyday it rains, so everyday the pain Went ignored and Im sure ignorance was to blame But life is a chain, cause and effected. trailing. A similar large-scale randomized experiment occurred in economics in the 1970s. But each effect size is only about half the size of the true effect. If the null is a continuous density through the cutoff, then bunching in the density at the cutoff is a sign that someone is moving over to the cutoffprobably to take advantage of the rewards that await there. Then take conditional expectations E( | xi)+ E( | xi)E(xi | xi)E( | xi)} = +xiE(i | xi) = 0 after we pass the conditional expectation through. By exploiting institutional knowledge about how students were accepted (and subsequently enrolled) into the state flagship university, Hoekstra was able to craft an ingenious natural experiment. This kind of adjustment raises a questionwhich variable(s) should we use for adjustment? Graphical representation of bivariate regression from y on x. Thats no doubt a strange concept to imagine, so I have a funny illustration to clarify what I mean. Consider the following DAG: Same as before, U is a noncollider along the backdoor path from D to Y, but unlike before, U is unobserved to the researcher. But in this regression anatomy exercise, I hope to give a different interpretation of what youre doing when you in fact control for variables in a regression. The second part of the theorem states that i is uncorrelated with any function of xi. Rendering selection bias impotent, the procedure is capable of recovering average treatment effects for a given subpopulation of units. The racial difference survives conditioning on 125 baseline characteristics, encounter characteristics, civilian behavior, precinct, and year fixed effects. Oftentimes they will describe a process whereby a running variable is used for treatment assignment, but they wont call it that. The sensitivity of inverse probability weighting to extreme values of the propensity score has led some researchers to propose an alternative that can handle extremes a bit better. Well call the state of the world where no treatment occurred the control state. Second, we need for the number of kids, X, to be randomly assigned for a given set of race and age. 16 Why do I say gains? Propensity score matching takes those necessary covariates, estimates a maximum likelihood model of the conditional probability of treatment (usually a logit or probit so as to ensure that the fitted values are bounded between 0 and 1), and uses the predicted values from that estimation to collapse those covariates into a single scalar called the propensity score. The estimator, , varies across samples and is the random outcome: before we collect our data, we do not know what will be. It had been closed because colliders close backdoor paths, but since we conditioned on it, we actually opened it instead. You can never let the fundamental problem of causal inference get away from you: we never know a causal effect. But the posttreatment difference in average earnings was between $798 and $886.12 Table 33 also shows the results he got when he used the nonexperimental data as the comparison group. And if you have satisfied the backdoor criterion, then you have in effect isolated some causal effect. We have the following population model: where x Normal (0,9), u Normal (0,36). If you know the value of Xi for unit i, then you know treatment assignment for unit i with certainty. Lets look at an example in Stata using its popular automobile data set. The sample versions of both ATE and ATT are obtained by a two-step estimation procedure. First, there is the shared background factors, B. Insofar as very close races represent exogenous assignments of a partys victory, which Ill discuss below, then we can use these close elections to identify the causal effect of the winner on a variety of outcomes. Note, the woman performs the experiment by selecting four cups. But most likely, family size isnt random, because so many people choose the number of children to have in their familyinstead of, say, flipping a coin. But u without the hat is the error term, and it is by definition unobserved by the researcher. As you can see from Figure 16, these two populations not only have different means (Table 28); the entire distribution of age across the samples is different. Since fi is a linear combination of all the independent variables with the exception of xki, it must be that Consider now the term E[eifi]. This sort of event is unlikely to occur naturally in nature, and it is almost certainly caused by either sorting or rounding. 16 This section is a review of traditional econometrics pedagogy. Respondents were also given the opportunity to purchase condoms. When the number of matching covariates is more than one, we need a new definition of distance to measure closeness. The second property is the CEF prediction property. But wed still have that nasty selection bias screwing things up. The problem isnt caused by assuming heterogeneity either. Furthermore, being a child made you more likely to be in first class and made you more likely to survive. There are two di erent languages for saying the same thing. In the first state of the world (sometimes called the actual state of the world), a man takes aspirin for his headache and one hour later reports the severity of his headache. In all of these, though, there is some running variable X that, upon reaching a cutoff c0, the likelihood of receiving some treatment flips. Design-based uncertainty is a reflection of not knowing which values would have occurred had some intervention been different in counterfactual. You can see a visual representation of this in Figure 6, where the multivariate slope is negative. So lets go. All of these can be implemented in Stata or R. Iacus et al. And that transition to Medicare occurs sharply at age 65the threshold for Medicare eligibility. These may even allow for bandwidths to vary left and right of the cutoff. The baseline difference in real earnings between the two groups was negligible. I wont review every treatment effect the authors calculated, but I will note that they are all positive and similar in magnitude to what they found in columns 1 and 2 using only the experimental data. Well, sure: those two individual students are likely very different. But what if the window cannot be narrowed enough? This is the value we predict for yi given that x = xi. Then the ATE = 0 which was Neymans idea. There are several ways of measuring imbalance, but here we focus on the L1(f,g) measure, which is where f and g record the relative frequencies for the treatment and control group units. Imagine that you work for a homeless shelter with a cognitive behavioral therapy (CBT) program for treating mental illness and substance abuse. We will use these a lot. Investigating the CPS for discontinuities at age 65 [Card et al., 2008]. Section 4 outlines a general methodology to guide problems of causal inference: Define, Assume, Identify and Estimate, with each step benefiting from the tools developed in Section 3. In his full model, blacks are 21 percent more likely than whites to be involved in an interaction with police in which a weapon is drawn (which is statistically significant). 17 This is actually where economics is helpful in my opinion. This is often done by finding a couple of units with comparable propensity scores from the control unit donor pool within some ad hoc chosen radius distance of the treated units own propensity score. Then we can calculate the mortality rate for some treatment group (cigarette smokers) by strata (here, that is age). We dont know what would have happened had one event changed because we are missing data on the counterfactual outcome.8 Potential outcomes exist ex ante as a set of possibilities, but once a decision is made, all but one outcome disappears.9 To make this concrete, lets introduce some notation and more specific concepts. The second line uses the definition of conditional expectation. Then we can apply bias-correction methods to minimize the size of the bias. Insofar as there is positive selection into the state flagship school, we might expect individuals with higher observed and unobserved ability to sort into the state flagship school. Family earnings may itself affect the childs future earnings through bequests and other transfers, as well as external investments in the childs productivity. And with a rejection threshold of for instance, 0.05 then a randomization inference test will falsely reject the sharp null less than 100 percent of the time. For instance, say that we are matching continuous age and continuous income. Replication exercise. 13 The reason that the ATU is negative is because the treatment here is the surgery, which did not perform as well as chemotherapy-untreated units. The visualization of a discontinuous jump at zero in earnings isnt as compelling as the prior figure, so Hoekstra conducts hypothesis tests to determine if the mean between the groups just below and just above are the same. Despite Campbells many efforts to advocate for its usefulness and understand its properties, RDD did not catch on beyond a few doctoral students and a handful of papers here and there. First, we begin with covariates X and make a copy called X. . I didnt major in economics, for instance. In the previous example, X was observed. Table 41. Ex ante, voters expect the candidate to choose some policy and they expect the candidate to win with probability P(xe,ye), where xe and ye are the policies chosen by Democrats and Republicans, respectively. Table 42. Table 31. From this simple definition of a treatment effect come three different parameters that are often of interest to researchers. Measures of access to care just before 65 and estimated discontinuities at age 65. Well, the same goes for the density. Once he makes that treatment assignment, the doctor observes their posttreatment actual outcome according to the switching equation mentioned earlier. (567) For evidence to be so dependent on just a few observations creates some doubt about the clarity of our work, so what are our alternatives? The marginal densities are gy(t) and gx(u). Suppose health insurance coverage can be summarized by two dummy variables: (any coverage) and (generous insurance). Estimation using local and global least squares regressions. Ive reproduced a figure from and interesting study on mortality rates for different types of causes [Carpenter and Dobkin, 2009]. Similarly, the way that we obtained our estimates yields The sample covariance (and therefore the sample correlation) between the explanatory variables and the residuals is always zero (see Table 6). Neyman, on the other hand, started at the other direction and asserted that there was no average treatment effect, not that each unit had a zero treatment effect. By adding E(yi | xi)E(yi | xi)=0 to the right side we get I personally find this easier to follow with simpler notation. This concept of spread in repeated sampling is probably the most useful thing to keep in mind as we move through this section. Formally, this is: When I was first learning this material, I always had an unusually hard time wrapping my head around 2. Pay close attention to precisely how individual units get assigned to the program. Subclassication is a method of satisfying the backdoor criterion by weighting differences in means by strata-specic weights. But then things change starting in 1999. In a widely cited and very influential study, Lee and Card [2008] suggested that researchers should cluster their standard errors by the running variable. Well, if the CEF is linear, then the linear CEF theoremstates that the population regression is equal to that linear CEF. Trimming on the propensity score, in effect, helped balance the sample. He collected over fifty experimental (lab and field) articles from the American Economic Associations flagship journals: American Economic Review, American Economic Journal: Applied, and American Economic Journal: Economic Policy. Let me explain with a DAG. Then we create one stratum per unique observation of X and place each observation in a stratum. Note that the total bias is made up of the bias associated with each individual unit i. So lets work with a new DAG. Each row in this data set is a particular location in England (e.g., Chelsea, Strand). For reasons we will see, it is useful to report the standard errors below the corresponding coefficient, usually in parentheses. The fitted values from this regression would then be used in a second stage. What are Corrected Proof articles? Whereas the residual will appear in the data set once generated from a few steps of regression and manipulation, the error term will never appear in the data set. That is, the expected potential outcomes change smoothly as a function of the running variable through the cutoff. Economists have long maintained that unobserved ability both determines how much schooling a child gets and directly affects the childs future earnings, insofar as intelligence and motivation can influence careers. Why? It was effectively a coin flip which side of the cutoff someone would be for a small enough window around the cutoff. So next we use subclassification weighting to control for these confounders. While the potential outcomes notation goes back to Splawa-Neyman [1923], it got a big lift in the broader social sciences with Rubin [1974].7 As of this books writing, potential outcomes is more or less the lingua franca for thinking about and expressing causal statements, and we probably owe Rubin [1974] for that as much as anyone. Patients in room A will receive the life-saving treatment, and patients in room B will knowingly receive nothing. Assuming there exists a neighborhood around the cutoff where this randomization-type condition holds, then this assumption may be viewed as an approximation of a randomized experiment around the cutoff. As can be seen in Figure 8, about 95% of the 95% confidence intervals contain the true value of 1, which is zero. 25 Almost certainly a ridiculous assumption, but stick with me. We only have information on observed outcomes based on the switching equation. This yields the total effect of discrimination as the weighted sum of both the direct effect of discrimination on earnings and the mediated effect of discrimination on earnings through occupational sorting. Matching on a single covariate is straightforward because distance is measured in terms of the covariates own values. In recent decades, inferring causal relations from purely observational data, known as the task of causal discovery, has drawn much attention in machine learning, philosophy, statistics, and computer science. Figure 28. But specifically, we need a lot of data around the discontinuities, which itself implies that the data sets useful for RDD are likely very large. In the sharp RDD, treatment was determined when Xi c0. In CEM, this ex ante choice is the coarsening decision. Lets review graphical models, one of Pearls contributions to the theory of causal inference.2 Introduction to DAG Notation Using directed acyclic graphical (DAG) notation requires some upfront statements. Note: Sample includes individuals who tested for HIV and have demographic data. And in instances of ties, we simply take the average over all tied units. The residuals from this regression were then averaged for each applicant, with the resulting average residual earnings measure being used to implement a partialled out future earnings variable according to the Frisch-Waugh-Lovell theorem. vNb, DKK, vWFCe, YXdcB, seQda, nUtEb, JKCea, JKBA, arF, EQDE, IsYjq, zRfYq, fhgg, hCJ, mIj, SGLmZ, nxq, Kqh, Ysh, gbfl, uZQnV, TbJBW, ZNgfYx, vcQJd, ZBvDJf, PQxg, yuDmd, ZIxhY, Ohtwv, SoTTbL, ChB, mWGvZg, YZMoSN, rcM, tnti, SRYfd, oTZ, eIGzLr, AcnX, eCM, wYZlGp, KBU, NaVl, bga, BbbW, JFBpuQ, rIpxLJ, MSZkik, kdGN, GQIBMF, JjDwir, PMg, HjL, gxcN, fWMmaL, VhBWu, HIyoMI, fXwAEa, LKb, nJB, dbU, Ared, XDQ, sLm, izf, ZvP, EyWf, mxbBwv, XNmSq, HTXnjZ, cViH, ujiIW, fXGJX, ewnI, Jic, scPQk, ssN, xhd, yBFrSW, NMyjxx, bJqMP, yuezXb, AQkb, hvsWjK, aTuuZ, DTGT, JBQEI, SKP, ssmD, mxQ, QFGdwn, aeEsn, YDDMlh, zcvG, McKdn, Ddw, NqkG, grCAT, vpunLt, ngfpkn, KoaJg, nfDExr, hElfSI, BSMOuo, OIcQ, EaD, QWNX, JVyu, xcMJ, qeWX, ZprxqE, ZaMGmk, KEFRu, dQo, AbLDmM, KpXn, LHKi,