1. Background

Difference in differences (DiD) is a quasi-experimental design that makes use of longitudinal data from treatment and control groups to obtain an appropriate counterfactual to estimate a causal effect. DID is typically used to estimate the effect of a specific intervention or treatment Di (such as a passage of law, enactment of policy, or large-scale program implementation) by comparing the changes in outcomes Yi (e.g. wages, health, etc.) over time between a population that is enrolled in a program (the intervention group) and a population that is not (the control group) as in:

\[ Yit=αi+λt+ρDit+Xitβ+ϵit \]

where αi are individual fixed effects (characteristics of individuals that do not change over time), λt are time fixed effects, Xit are time-varying covariates like individuals’ age, and ϵit is an error term. Individuals and time are indexed by i and t, respectively. If there is a correlation between the fixed effects and Dit then estimating this regression via OLS will be biased given that the fixed effects are not controlled for.

To see the effect of a treatment we would like to know the difference between a person in a world in which she received the treatment and one in which she does not. Of course, only one of these is ever observable in practice. Therefore, we look for people with the same pre-treatment trends in the outcome. Suppose we have two periods t=0,1 and two groups s=A, B. Then, under the assumption that the trends in the treatment and control groups would have continued the same way as before in the absence of treatment, we can estimate the treatment effect as:

\[ ρ=(E[Yist|s=A,t=1]−E[Yist|s=A,t=0])−(E[Yist|s=B,t=1]−E[Yist|s=B,t=0]) \]

Graphically this would look something like this:

You can simply calculate these means by hand, i.e. obtain the mean outcome of group A in both periods and take their difference. Then obtain the mean outcome of group B in both periods and take their difference. Then take the difference in the differences and that’s the treatment effect.

However, it is more convenient to do this in a regression framework because this allows you

To do this, you can follow either of two equivalent strategies. Generate a control group dummy treati which is equal to 1 if a person is in group A and 0 otherwise, generate a time dummy timet which is equal to 1 if t=1 and 0 otherwise, and then regress

\[ Yit=β_1+β_2(treat_i)+β_3(time_t)+ρ(treat_i⋅time_t)+ ϵit \]

Or you simply generate a dummy Tit which equals one if a person is in the treatment group and the time period is the post-treatment period and is zero otherwise. Then you would regress

\[ Yit=β_1γs+β_2λt+ρTit+ ϵit \]

where γs is again a dummy for the control group and λt are time dummies. The two regressions give you the same results for two periods and two groups. The second equation is more general though as it easily extends to multiple groups and time periods. In either case, this is how you can estimate the difference in differences parameter in a way such that you can include control variables (I left those out from the above equations to not clutter them up but you can simply include them) and obtain standard errors for inference.

2. Difference-in-Difference Estimation

In this tutorial, we will reproduce the a famous minimum wage study of Card and Krueger (1994). This was a famous study primarily because of its use of an explicit counterfactual for estimation. Card and Krueger (1994) were interested in the effect of minimum wages on employment. Their strategy was to do a simple DD between two neighboring states. New Jersey was set to experience an increase in the state minimum wage from $4.25 to $5.05, but neighboring Pennsylvania’s minimum wage was staying at $4.25. They surveyed about 400 fast food stores both in New Jersey and Pennsylvania before and after the minimum wage increase. This was used to measure the outcomes they cared about (i.e., employment).

The data ck1994.csv is based on the original data provided by Card and Krueger (1994).

Below variables that are in the dataset:

First, we need to load packages and data:

library(magrittr) 
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(plotly)
## Loading required package: ggplot2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(pastecs)
## 
## Attaching package: 'pastecs'
## The following objects are masked from 'package:dplyr':
## 
##     first, last
## The following object is masked from 'package:magrittr':
## 
##     extract
ck1994 <- read.csv("C:/Users/DuyTung/Desktop/ck1994.csv")

Descriptive Statistics: Let’s do a summary statistic that quantitatively describes main characteristics of variables in the data set.

stat.desc(ck1994, basic=F)
##                        id     emptot postperiod         nj       wage
## median       2.055000e+02 20.0000000 0.50000000 1.00000000 5.00000000
## mean         2.055000e+02 21.0265113 0.50000000 0.80731707 4.80571245
## SE.mean      4.135709e+00  0.3344007 0.01747141 0.01378167 0.01284082
## CI.mean.0.95 8.117838e+00  0.6564153 0.03429402 0.02705155 0.02520675
## var          1.402535e+04 88.7881393 0.25030525 0.15574615 0.12844663
## std.dev      1.184287e+02  9.4227458 0.50030516 0.39464687 0.35839452
## coef.var     5.762953e-01  0.4481364 1.00061031 0.48883751 0.07457677
##              control_co_owned control_burgerking control_kfc control_roys
## median             0.00000000         0.00000000  0.00000000   0.00000000
## mean               0.34390244         0.41707317  0.19512195   0.24146341
## SE.mean            0.01659816         0.01722944  0.01384765   0.01495450
## CI.mean.0.95       0.03257994         0.03381906  0.02718107   0.02935367
## var                0.22590905         0.24341999  0.15724113   0.18338247
## std.dev            0.47529891         0.49337612  0.39653642   0.42823180
## coef.var           1.38207483         1.18294859  2.03224916   1.77348522

2.1 Create Figures

We start by reproducing figures shown in the paper

Figure 1: Wage Before

st_wage_before_nj <- na.omit(ck1994$wage[ck1994$nj == 1 & ck1994$postperiod==0])
st_wage_before_pa <- na.omit(ck1994$wage[ck1994$nj == 0 & ck1994$postperiod==0])

Make a stacked bar plot - Plotly

xbins <- list(start=4.20, end=5.60, size=0.1)
xaxis = list(tickvals=seq(4.25, 5.55, 0.1))
yaxis = list(range = c(0, 50))

Plotly histogram

p <- plot_ly(alpha = 0.6) %>%
  add_histogram(x = st_wage_before_nj, xbins = xbins, histnorm = "percent", name = "Wage Before (New Jersey)") %>%
  add_histogram(x = st_wage_before_pa, xbins = xbins, histnorm = "percent", name = "Wage Before (Pennsylvania)") %>%
  layout(barmode = "group", title = "February 1992", xaxis = xaxis, title = "Wage in $ per hour", yaxis = yaxis)
p

Figure 2: Wage After

st_wage_after_nj <- na.omit(ck1994$wage[ck1994$nj == 1 & ck1994$postperiod==1])
st_wage_after_pa <- na.omit(ck1994$wage[ck1994$nj == 0 & ck1994$postperiod==1])

Plotly histogram

p <- plot_ly(alpha = 0.6) %>%
  add_histogram(x = st_wage_after_nj, xbins = xbins, histnorm = "percent", name = "Wage After (New Jersey)") %>%
  add_histogram(x = st_wage_after_pa, xbins = xbins, histnorm = "percent", name = "Wage After (Pennsylvania)") %>%
  layout(barmode = "group", title = "November 1992", xaxis = xaxis, title = "Wage in $ per hour", yaxis = yaxis)
p

2.2 DiD by hand

Compute the four data points needed in the DID calculation:

FTE employment in New Jersey after a minimum wage increase

emptot_nj_after = mean(ck1994$emptot[ck1994$nj == 1 & ck1994$postperiod == 1], na.rm = TRUE)
emptot_nj_after
## [1] 21.02743

FTE employment in New Jersey before a minimum wage increase

emptot_nj_before = mean(ck1994$emptot[ck1994$nj == 1 & ck1994$postperiod == 0], na.rm = TRUE)
emptot_nj_before
## [1] 20.43941

FTE employment in Pennsylvania after a minimum wage increase

emptot_pa_after = mean(ck1994$emptot[ck1994$nj == 0 & ck1994$postperiod == 1], na.rm = TRUE)
emptot_pa_after
## [1] 21.16558

FTE employment in Pennsylvania before a minimum wage increase

emptot_pa_before = mean(ck1994$emptot[ck1994$nj == 0 & ck1994$postperiod == 0], na.rm = TRUE)
emptot_pa_before
## [1] 23.33117

The effect of a minimum wage increase on employment:

did = (emptot_nj_after-emptot_nj_before) - (emptot_pa_after-emptot_pa_before)
did
## [1] 2.753606

2.3 A simple DiD Regression

Now we will run a regression to estimate the conditional difference-in-difference estimate of the effect of a minimum wage increase on employment, using observations in New Jersey as the treatment group. This is exactly the same as what we did manually above, now using ordinary least squares. The regression equation is as follows:

ols <- lm(emptot ~ postperiod * nj, data = ck1994)
summary(ols)
## 
## Call:
## lm(formula = emptot ~ postperiod * nj, data = ck1994)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.166  -6.439  -1.027   4.473  64.561 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     23.331      1.072  21.767   <2e-16 ***
## postperiod      -2.166      1.516  -1.429   0.1535    
## nj              -2.892      1.194  -2.423   0.0156 *  
## postperiod:nj    2.754      1.688   1.631   0.1033    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.406 on 790 degrees of freedom
##   (26 observations deleted due to missingness)
## Multiple R-squared:  0.007401,   Adjusted R-squared:  0.003632 
## F-statistic: 1.964 on 3 and 790 DF,  p-value: 0.118

To be clear, the coefficient on (postperiodnj) is the value we are interested in. The coefficient estimate on (postperiodnj) should match the value calculated manually above.

2.4 DiD Regression with Controls

ols1 <- lm(emptot ~ postperiod * nj+ control_burgerking + control_kfc + control_roys + control_co_owned, data = ck1994)
summary(ols1)
## 
## Call:
## lm(formula = emptot ~ postperiod * nj + control_burgerking + 
##     control_kfc + control_roys + control_co_owned, data = ck1994)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.196  -5.075  -1.196   3.272  64.219 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         24.8875     1.2044  20.663   <2e-16 ***
## postperiod          -2.2236     1.3677  -1.626   0.1044    
## nj                  -2.3766     1.0792  -2.202   0.0279 *  
## control_burgerking   1.0637     0.9292   1.145   0.2526    
## control_kfc         -9.3897     1.0658  -8.810   <2e-16 ***
## control_roys        -0.5613     1.0688  -0.525   0.5996    
## control_co_owned    -1.1685     0.7162  -1.632   0.1032    
## postperiod:nj        2.8451     1.5233   1.868   0.0622 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.484 on 786 degrees of freedom
##   (26 observations deleted due to missingness)
## Multiple R-squared:  0.1964, Adjusted R-squared:  0.1893 
## F-statistic: 27.45 on 7 and 786 DF,  p-value: < 2.2e-16

2.5 Regression: Panel Data

With panel data we can estimate the difference-in-differences effect using a fixed effects regression with unit and period fixed effects. Or equivalently we can use regression with the dependent variable in first differences:

library(plm)
## 
## Attaching package: 'plm'
## The following objects are masked from 'package:dplyr':
## 
##     between, lag, lead
library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
ck1994$Dit <- ck1994$nj * ck1994$postperiod
ck1994 <- plm.data(ck1994, indexes = c("id", "postperiod"))
## Warning: use of 'plm.data' is discouraged, better use 'pdata.frame' instead
did.reg <- plm(emptot ~ postperiod + Dit, data = ck1994,model = "within")
coeftest(did.reg, vcov=function(x) vcovHC(x, cluster="group", type="HC1"))
## 
## t test of coefficients:
## 
##             Estimate Std. Error t value Pr(>|t|)  
## postperiod1  -2.2833     1.2465 -1.8319  0.06775 .
## Dit           2.7500     1.3359  2.0585  0.04022 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3. Disussion

Why is the difference in differences estimator useful?

As stated before, DiD is a method to estimate treatment effects with non-experimental data. That’s the most useful feature. DiD is also a version of fixed effects estimation. Whereas the fixed effects model assumes E(Y0it|i,t)=α_i+λ_t, DiD makes a similar assumption but at the group level, E(Y0it|s,t)=γ_s+λ_t. So the expected value of the outcome here is the sum of a group and a time effect. So what’s the difference? For DiD you don’t necessarily need panel data as long as your repeated cross sections are drawn from the same aggregate unit ss. This makes DiD applicable to a wider array of data than the standard fixed effects models that require panel data.

Can we trust difference in differences?

The most important assumption in DiD is the parallel trends assumption (see the figure above). Never trust a study that does not graphically show these trends! Papers in the 1990s might have gotten away with this but nowadays our understanding of DiD is much better. If there is no convincing graph that shows the parallel trends in the pre-treatment outcomes for the treatment and control groups, be cautious. If the parallel trends assumption holds and we can credibly rule out any other time-variant changes that may confound the treatment, then DiD is a trustworthy method. Another word of caution should be applied when it comes to the treatment of standard errors. With many years of data you need to adjust the standard errors for autocorrelation. In the past, this has been neglected but since Bertrand et al. (2004) “How Much Should We Trust Differences-In-Differences Estimates?” we know that this is an issue. In the paper they provide several remedies for dealing with autocorrelation. The easiest is to cluster on the individual panel identifier which allows for arbitrary correlation of the residuals among individual time series. This corrects for both autocorrelation and heteroscedasticity.