DSC 180A/B

Causal Random Forests

How to perform statistical inference using machine learning algorithms? This is still largely unknown in general. For some specific tasks and questions of interest we have made some progress. In this project you will learn how to use, adapt, modify and utilize random forests for causal inference questions pertaining to but not limited to average treatment effects.

Paper

By Susan Athey and Guido Imbens (Nobel prize winner) paper entitled as "Recursive Partitioning for Heterogenous Treatment Effects" published in 2016 at PNAS.

Make sure to also read the Supplementary Materials to this paper.

Start getting familiar with the GitHub webpage that follows this paper (and 2 additional ones).

Zoom Meetings

Zoom Link is the same for all types of meetings:
https://ucsd.zoom.us/j/97835636701

Regular Meeting Time: Tuesdays @ 5pm
Office Hours: Thursdays @ 7pm

Contact Email: jbradic@ucsd.edu

Supplementary Readings

For an Introduction to Average Treatment Effects and Potential Outcomes Framework please read the first 28 pages of these NOTES from Stefan Wager.
In addition, you could read these NOTES Ch 4 until section 4.2.1.
This ML tutorial on Causal Inference might also be useful if you are interested.

Weekly Schedule and Summary

Week 1: Red the introduction of the paper and literature review. During the meeting discussed in general what is this algorithm doing in terms of notion of Honest Tree and Honest Inference. Briefly talked about other existing algorithms.
Week 2: Read section entitled the Problem, Honest Inference for Population Averages and The Honest Target. Discussed some misconceptions about the goal of the algorithms and how to best prepare for the meetings in the future. During the meeting discussed missing data problem at the center of causal inference and why any of the shelf algorithm will necessarily have biases.
Week 3: Read same sections as Week 2 and add subsections entitled the Adaptive Target, the Implementation of CART, and Honest Splitting. During the meeting discussed random forest algorithm, that includes discussions on construction of the tree/splits vs the estimators constructed within each leaf. Also discussed what is MSE and its purpose for this paper (to evaluate the performance of forest/tree based algorithms) and emphasized that MSE is a population quantity which we do not know and typically need to estimate.
Week 4: Read again Sections of Week 2 and Week 3 and fix the errors. Add to the reading section on Honest Inference for Treatment Effects, Modifying Conventional CART for Treatment Effects, Modifying the Honest Approach. Focus on adding details to the reading. Also read and get familiar with the Github webpage that accompanies this paper linked above. During the meeting discussed issues of random forests predictions and how it relates to splitting and how it does not relate to splitting. Discussed proper notation and how to always begin with what we want to estimate and then how is that being estimated. Issues regarding Mean Squared Error and its meaning and its estimation were discussed.
Week 5: Read section on Four Partitioning Estimators for Causal Effects and integrate with Github and previous sections readings. Watch the Video of Susan Athey talking about Heterogeneous Treatment Effects and Random Forests https://www.youtube.com/watch?v=oZoizsX3bts (also placed below for convenience). During the meeting discussed conditional expectation as a regression model. Imputed outcomes, why and how they lead to improvements in off-the-shelf methods. Issues of random forest estimation came about: leaf vs splits. Discussed checkpoint writeup and what should be included in it. Discussed plan for next week.
Week 6: Start thinking what kind of data would you like to work on for the Winter Quarter. Read section on Simulation Study and the real dataset from the Github page. During the meeting discussed implementation in R/Python and the setting of the Simulation Study. We also discussed ideas for the real data and upcoming project and settled on heterogeneity, high-dimensions, and/or random forest construction of the proposed new causal tree method.
Week 7: Present ideas about Final Project for 180B. Proposals. Status of the implementation of the paper for 180A. Status of the writeup - what is the skeleton of the proposal. During the meeting discussed structure of the writeup for 180A and got detailed feedback on the current mid-term writeup. We also discussed some possible issues with simulations and reviewed the code for the simulations: R and Python. R seems to be better at integrating all the needed packages. We also discussed status of the student's stimulations and progress has been made. What remains to be done is to repeat the simulations and discuss how are confidence intervals constructed.
Week 8: Refine ideas and showcase notes for the writeup for the Final Project Proposal. Begin coding up the algorithm up for the Fall Quarter. During the meeting discussed Causal Datasets in R https://cran.r-project.org/web/packages/causaldata/causaldata.pdf and a Real Dataset - Voter turnout - in this paper (section Appendix) that we used to get inspired and collect our own similar dataset https://www.pnas.org/doi/10.1073/pnas.1804597116#sec-4 and original data here https://www.povertyactionlab.org/evaluation/social-pressure-and-voter-turnout-united-states. Students suggestions: http://sbp-brims.org/2019/proceedings/papers/working_papers/Savas.pdf. Left to see if the same data or similar one can be scraped or downloaded. Students successfully replicated causal tree findings.
Week 9: Showcase your code and some initial replication findings. Have majority of the writeup for the Fall done and showcase how it looks like to get immediate feedback. During the meeting discussed ....
Week 10: Small tuneups for the project write-ups.

Susan Athey Guest Talk on Estimating Heterogenous Treatment Effects

See Some Examples of Random Forests

not really ....