Causal data mining

Oct 3, 2020

In the era of big data, computational scientists utilize cutting-edge computational and machine learning techniques to detect and analyze interesting patterns. Since observational data generally lack exogeneity, it is challenging to draw valid causal identifications. Although causal inference is stressed in economics and quantitative social science, existing approaches may not be naturally extended to the application to large-scale data with complex data structure. For example, the independent variable of interest in causal inference is typically binary (i.e., simply treatment or control), but in big data it can be categorical, continuous or multi-dimensional.

I am interested in “complex” causal inference, which is aimed to discover causal knowledge that is beyond the conventional potential outcomes framework with binary treatment variables. In close collaborations with world-leading social platforms, such as WeChat and Facebook, I discover causal scientific knowledge in large-scale observational and experimental data.

For observational data, I leveraged a large-scale natural experiment [1], where exogeneity is introduced by WeChat’s computer algorithm. The “group red packet” on WeChat is a monetary gift shared by multiple recipients. Importantly, the WeChat’s algorithm randomly splits a gift amount and each recipient receives a random share. In this way, I obtained a treatment variable (the random amount received) that is well manipulated. This enables me to identify the impact of the gift amount on multiple important outcome variables with high internal validity.

For experimental data, in collaboration with Facebook Core Data Science, I developed an approach to exploring and analyzing complex patterns of social contagion in experimental data [2]. In an experiment on social networks, the outcome of an individual is not only affected by the treatment assignment of this individual, but also by the treatment assignment of her network neighbors (e.g., Facebook friends). In this case, the intervention is complex because it includes the assignment conditions for both this individual and her neighbors. Direct analysis of the complex space of the intervention is challenging. Our study combines causal inference in networks and machine learning to provide a solution: we first characterize treatment assignment conditions of the individual and her neighbors by labelled network motifs, and then apply a tree-based algorithm to help researchers automatically categorize the intervention.

In addition, the WeChat field experiment [3] contains 11 different treatment groups, where each treatment group corresponds to a different motivation for prosocial behavior. This can help us further understand the psychological and behavioral mechanisms rather than simply identifying a causal effect. In [5], I proposed to employ hyper-realistic photos produced by generalized adversarial networks, with the aim of examining how users behave differently towards different opponents who are associated with the photos. These photos are well manipulated in gender, age, race, ethnicity, and facial expressions, among other facial characteristics.

I will continue to study interesting open questions in applied causal inference, including identifying complicated heterogeneous treatment effects, temporal patterns of treatment effects, and the interaction effect between multiple treatments.

  • [1] Yuan, Y., Liu, T.X., Tan C., Chen Q., and Pentland A.S., “Gift Contagion in Online Groups: Evidence From WeChat Red Packets”, R&R at Management Science (2020).
  • [2] Yuan, Y., Altenburger, K., and Kooti, F., “Causal Network Motifs: Identifying Heterogeneous Spillover Effects in A/B Tests”, submitted (2020).
  • [3] Yuan, Y., Nicolaides, C., Eckles D., and Pentland, A., “Who motivates more workouts? Friends or Strangers”, working paper, presented at International Conference on Computational Social Science (2020)