The idea with propensity score methods is to compare units who, based on observables, had very similar probabilities of being placed into the treatment group even though those units differed with regards to actual treatment assignment. If conditional on X, two units have the same probability of being treated, then we say they have similar propensity scores. If two units have the same propensity score, but one is the treatment group and the other is not, and the conditional independence assumption credibly holds in the data, then differences between their observed outcomes are attributable to the treatment.
In this tutorial, we aim to examine the effect of the mother smoked during pregnancy on birth weight of baby. Because the birth weight of babies who have the mother smoked on average are different from babies who have the mother does not smoke, we will use propensity score matching method to get more creditable casual estimates of smoking during pregnancy.
Variables:
First, we need to load packages and data:
library(MatchIt)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
birth_weight <- read.csv("C:/Users/DuyTung/Desktop/Birth_Weight.csv")
To view the first few records in the dataset
head(birth_weight)
## bweight mbsmoke mmarried mage medu fbaby
## 1 3459 0 1 24 14 0
## 2 3260 0 0 20 10 0
## 3 3572 0 1 22 9 0
## 4 2948 0 1 26 12 0
## 5 2410 0 1 20 12 1
## 6 3147 0 0 27 12 1
Let’s do a summary statistic that quantitatively describes main characteristics of variables in the data set.
library(pastecs)
##
## Attaching package: 'pastecs'
## The following objects are masked from 'package:dplyr':
##
## first, last
stat.desc(birth_weight, basic=F)
## bweight mbsmoke mmarried mage medu
## median 3.390000e+03 0.000000000 1.000000000 26.00000000 12.00000000
## mean 3.361680e+03 0.186126670 0.699698406 26.50452391 12.68957346
## SE.mean 8.495534e+00 0.005713167 0.006728658 0.08247237 0.03699661
## CI.mean.0.95 1.665528e+01 0.011200523 0.013191368 0.16168505 0.07253094
## var 3.350322e+05 0.151516173 0.210165822 31.57345507 6.35373311
## std.dev 5.788196e+02 0.389250784 0.458438460 5.61902617 2.52066124
## coef.var 1.721817e-01 2.091321920 0.655194375 0.21200253 0.19864034
## fbaby
## median 0.00000000
## mean 0.43795778
## SE.mean 0.00728274
## CI.mean.0.95 0.01427763
## var 0.24620380
## std.dev 0.49618928
## coef.var 1.13296145
birth_weight %>%
group_by(mbsmoke) %>%
summarise(obs = n(),
mean = mean(bweight),
std_error = sd(bweight)/sqrt(obs))
## # A tibble: 2 x 4
## mbsmoke obs mean std_error
## <int> <int> <dbl> <dbl>
## 1 0 3778 3413. 9.28
## 2 1 864 3138. 19.1
t-test
with(birth_weight, t.test(bweight ~ mbsmoke))
##
## Welch Two Sample t-test
##
## data: bweight by mbsmoke
## t = 12.971, df = 1303.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 233.6210 316.8827
## sample estimates:
## mean in group 0 mean in group 1
## 3412.912 3137.660
The result of t-test shows that the birth weight of babies who have the mother smoked are significantly lower from babies who have the mother does not smoke. Should we conclude that the treatment (smoking during pregnancy) is dangerous? Maybe! However, is this a reliable result? Before answering this question, we need to check whether there is a difference in covariates between two groups.
Here is the mean for each covariate by the treatment status:
bw_cov <- c("mmarried","mage","medu", "fbaby")
birth_weight %>%
group_by(mbsmoke) %>%
select(one_of(bw_cov)) %>%
summarise_all(funs(mean(., na.rm = T)))
## Adding missing grouping variables: `mbsmoke`
## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.
## # A tibble: 2 x 5
## mbsmoke mmarried mage medu fbaby
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 0 0.751 26.8 12.9 0.453
## 2 1 0.473 25.2 11.6 0.372
In addition, we carry out t-tests to evaluate whether these means are statistically distinguishable: The ‘mmarried’ variable
with(birth_weight, t.test(mmarried ~ mbsmoke))
##
## Welch Two Sample t-test
##
## data: mmarried by mbsmoke
## t = 15.118, df = 1175.9, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.2419887 0.3141636
## sample estimates:
## mean in group 0 mean in group 1
## 0.7514558 0.4733796
The ‘mage’ variable
with(birth_weight, t.test(mage ~ mbsmoke))
##
## Welch Two Sample t-test
##
## data: mage by mbsmoke
## t = 8.1218, df = 1348, p-value = 1.028e-15
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1.246770 2.040861
## sample estimates:
## mean in group 0 mean in group 1
## 26.81048 25.16667
The ‘medu’ variable
with(birth_weight, t.test(medu ~ mbsmoke))
##
## Welch Two Sample t-test
##
## data: medu by mbsmoke
## t = 15.279, df = 1454.4, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1.125228 1.456708
## sample estimates:
## mean in group 0 mean in group 1
## 12.92986 11.63889
The ‘fbaby’ variable
with(birth_weight, t.test(fbaby ~ mbsmoke))
##
## Welch Two Sample t-test
##
## data: fbaby by mbsmoke
## t = 4.4517, df = 1314.6, p-value = 9.24e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.04565306 0.11759102
## sample estimates:
## mean in group 0 mean in group 1
## 0.4531498 0.3715278
Now let’s prepare a logit (probit) model to estimate the propensity scores. That is, we will use the covariates to predict the probability of being exposed (smoking during pregnancy). The more true covariates we use, the better our prediction of the probability of being exposed. We calculate a propensity score for all subjects, exposed and unexposed. In particular, we will create the logit model of the treatment (i.e. smoking during pregnancy) regressed upon the following covariates: mmarried, mage, medu, and fbaby.
pscores.model <- glm(mbsmoke ~ mmarried + mage + medu + fbaby, family = binomial("logit"), data = birth_weight)
summary(pscores.model)
##
## Call:
## glm(formula = mbsmoke ~ mmarried + mage + medu + fbaby, family = binomial("logit"),
## data = birth_weight)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6037 -0.6002 -0.4817 -0.3744 2.3845
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.024681 0.244809 4.186 2.84e-05 ***
## mmarried -1.036252 0.090000 -11.514 < 2e-16 ***
## mage -0.003663 0.008388 -0.437 0.662
## medu -0.127238 0.016766 -7.589 3.23e-14 ***
## fbaby -0.476541 0.087178 -5.466 4.59e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4461.5 on 4641 degrees of freedom
## Residual deviance: 4115.2 on 4637 degrees of freedom
## AIC: 4125.2
##
## Number of Fisher Scoring iterations: 4
Using this model, we can now calculate the propensity score for each mother. It is simply the mother’s predicted probability of being Treated, given the estimates from the logit model. Below, we calculate this propensity score using fitted.values and add the predicted propensity score into the original dataframe.
birth_weight$PScores <- pscores.model$fitted.values
After estimating the propensity score, it is useful to plot histograms of the estimated propensity scores by treatment status. We create histograms showing the density of propensity score distribution in the treated and control groups.
hist(birth_weight$PScores[birth_weight$mbsmoke==1],main = "PScores of Response = 1")
hist(birth_weight$PScores[birth_weight$mbsmoke==0],main = "PScores of Response = 0")
We can now proceed with the matching algorithms with our ‘pscores.model’ and the estimated propensity scores. The matching algorithms create sets of participants for treatment and control groups. A matched set will consist of at least one person from the treatment group (i.e., the smoked mother) and one from the control group (i.e., the non-smoked mothers) with similar propensity scores. The basic goal is to approximate a random experiment, eliminating many of the problems that come with observational data analysis.
There are various matching algorithms in R, namely, exact matching, nearest neighbor, optimal matching, full matching and caliper matching. Let’s try the Nearest Neighbour Matching algorithm. Nearest Matching is a technique that matches individuals with controls (it could be two or more controls per treated unit) based on a distance. We use the package MatchIt for this. This package estimates the propensity score in the background and then matches observations based on the method of choice (“nearest” in this case).
match_nearest <- matchit(pscores.model, method="nearest", radio=1,data=birth_weight)
We can get some information about how successful the matching was using summary() and plot()
summary(match_nearest)
##
## Call:
## matchit(formula = pscores.model, data = birth_weight, method = "nearest",
## radio = 1)
##
## Summary of balance for all data:
## Means Treated Means Control SD Control Mean Diff eQQ Med eQQ Mean
## distance 0.2484 0.1719 0.1028 0.0765 0.0619 0.0764
## mmarried 0.4734 0.7515 0.4322 -0.2781 0.0000 0.2778
## mage 25.1667 26.8105 5.6455 -1.6438 2.0000 1.6863
## medu 11.6389 12.9299 2.5344 -1.2910 1.0000 1.2836
## fbaby 0.3715 0.4531 0.4979 -0.0816 0.0000 0.0810
## eQQ Max
## distance 0.2293
## mmarried 1.0000
## mage 3.0000
## medu 6.0000
## fbaby 1.0000
##
##
## Summary of balance for matched data:
## Means Treated Means Control SD Control Mean Diff eQQ Med eQQ Mean
## distance 0.2484 0.2483 0.1205 0.0001 0 0.0004
## mmarried 0.4734 0.4826 0.5000 -0.0093 0 0.0093
## mage 25.1667 24.9514 5.4384 0.2153 0 0.2616
## medu 11.6389 11.6053 2.3053 0.0336 0 0.0984
## fbaby 0.3715 0.3634 0.4813 0.0081 0 0.0081
## eQQ Max
## distance 0.0162
## mmarried 1.0000
## mage 3.0000
## medu 5.0000
## fbaby 1.0000
##
## Percent Balance Improvement:
## Mean Diff. eQQ Med eQQ Mean eQQ Max
## distance 99.8182 100 99.4529 92.9373
## mmarried 96.6702 0 96.6667 0.0000
## mage 86.9038 100 84.4887 0.0000
## medu 97.4000 100 92.3354 16.6667
## fbaby 90.0739 0 90.0000 0.0000
##
## Sample sizes:
## Control Treated
## All 3778 864
## Matched 864 864
## Unmatched 2914 0
## Discarded 0 0
plot(match_nearest, type="hist")
To create a dataframe containing only the matched observations, we can use the match.data() function:
data_matched <- match.data(match_nearest)
dim(data_matched)
## [1] 1728 9
Let’s do a summary statistic for covariates in the matched sample
library(tableone)
table_match <- CreateTableOne(vars = bw_cov,strata = "mbsmoke",data = data_matched,test = FALSE)
print(table_match, smd = TRUE)
## Stratified by mbsmoke
## 0 1 SMD
## n 864 864
## mmarried (mean (SD)) 0.48 (0.50) 0.47 (0.50) 0.019
## mage (mean (SD)) 24.95 (5.44) 25.17 (5.30) 0.040
## medu (mean (SD)) 11.61 (2.31) 11.64 (2.17) 0.015
## fbaby (mean (SD)) 0.36 (0.48) 0.37 (0.48) 0.017
The first thing we see above is that we matched 864 controls to 864 treated subjects and also we can see we get very small numbers for SMD in the covariates, hence we can conclude that we did a pretty good job of matching with nearest neighbour. We have a great balance in the dataset now to proceed with our further analysis.
We’ll do more formal assessment for covariate balance in the matched sample by: visual inspection and t-tests of difference-in-means.
It is useful to plot the mean of each covariate against the estimated propensity score, separately by treatment status. If matching is done well, the treatment and control groups will have (near) identical means of each covariate at each value of the propensity score. Below is an example using these covariates in our model. Here I use a loess smoother to estimate the mean of each covariate, by treatment status, at each value of the propensity score.
fn_bal <- function(dta, variable) {
dta$variable <- dta[, variable]
dta$mbsmoke <- as.factor(dta$mbsmoke)
support <- c(min(dta$variable), max(dta$variable))
ggplot(dta, aes(x = distance, y = variable, color = mbsmoke)) +
geom_point(alpha = 0.2, size = 1.3) +
geom_smooth(method = "loess", se = F) +
xlab("Propensity score") +
ylab(variable) +
theme_bw() +
ylim(support)
}
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
grid.arrange(
fn_bal(data_matched, "mmarried") + theme(legend.position = "none"),
fn_bal(data_matched, "mage"),
fn_bal(data_matched, "medu"),
fn_bal(data_matched, "fbaby") + theme(legend.position = "none"),
nrow = 3, widths = c(1, 0.8)
)
## Warning: Removed 72 rows containing missing values (geom_smooth).
## Warning: Removed 6 rows containing missing values (geom_smooth).
## Warning: Removed 65 rows containing missing values (geom_smooth).
We will test the difference in covariates between groups after matching using t-tests. Ideally, we should not be able to reject the null hypothesis of no mean difference for each covariate:
with(data_matched, t.test(mmarried ~ mbsmoke))
##
## Welch Two Sample t-test
##
## data: mmarried by mbsmoke
## t = 0.38507, df = 1726, p-value = 0.7002
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.03790285 0.05642137
## sample estimates:
## mean in group 0 mean in group 1
## 0.4826389 0.4733796
with(data_matched, t.test(mage ~ mbsmoke))
##
## Welch Two Sample t-test
##
## data: mage by mbsmoke
## t = -0.83319, df = 1724.9, p-value = 0.4049
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.7220458 0.2914903
## sample estimates:
## mean in group 0 mean in group 1
## 24.95139 25.16667
with(data_matched, t.test(medu ~ mbsmoke))
##
## Welch Two Sample t-test
##
## data: medu by mbsmoke
## t = -0.31178, df = 1719.5, p-value = 0.7552
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.2447155 0.1775858
## sample estimates:
## mean in group 0 mean in group 1
## 11.60532 11.63889
with(data_matched, t.test(fbaby ~ mbsmoke))
##
## Welch Two Sample t-test
##
## data: fbaby by mbsmoke
## t = -0.34909, df = 1726, p-value = 0.7271
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.05362164 0.03741794
## sample estimates:
## mean in group 0 mean in group 1
## 0.3634259 0.3715278
We will now do a t test to test our hypothesis that there is an negative effect of smoking during pregnancy on birth weight of baby:
with(data_matched, t.test(bweight ~ mbsmoke))
##
## Welch Two Sample t-test
##
## data: bweight by mbsmoke
## t = 7.4043, df = 1698.7, p-value = 2.066e-13
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 157.1736 270.4491
## sample estimates:
## mean in group 0 mean in group 1
## 3351.471 3137.660
From the results above we see a very small p-value that makes the model highly significant. The birth weight of babies who have the mother smoked on average are 3137.66 gram, while the birth weight of babies who have the mother does not smoke on average are 3355.99 gram.
Further, we can do OLS regression with or without covariates using the matched sample: OLS Regression without covariates
ols1 <- lm(bweight ~ mbsmoke, data_matched)
summary(ols1)
##
## Call:
## lm(formula = bweight ~ mbsmoke, data = data_matched)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3011.47 -316.02 48.53 376.59 1880.34
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3351.47 20.42 164.135 < 2e-16 ***
## mbsmoke -213.81 28.88 -7.404 2.05e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 600.2 on 1726 degrees of freedom
## Multiple R-squared: 0.03079, Adjusted R-squared: 0.03022
## F-statistic: 54.82 on 1 and 1726 DF, p-value: 2.051e-13
OLS Regression with covariates
ols2 <- lm(bweight ~ mbsmoke+mmarried + mage + medu + fbaby, data_matched)
summary(ols2)
##
## Call:
## lm(formula = bweight ~ mbsmoke + mmarried + mage + medu + fbaby,
## data = data_matched)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3096.23 -301.59 34.19 369.30 1917.14
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3243.827 95.594 33.933 < 2e-16 ***
## mbsmoke -212.276 28.688 -7.400 2.13e-13 ***
## mmarried 147.406 31.109 4.738 2.33e-06 ***
## mage -3.024 3.044 -0.993 0.321
## medu 8.948 6.833 1.309 0.191
## fbaby 22.283 31.332 0.711 0.477
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 596 on 1722 degrees of freedom
## Multiple R-squared: 0.04652, Adjusted R-squared: 0.04375
## F-statistic: 16.8 on 5 and 1722 DF, p-value: 3.04e-16