Generating Data for Statistical Experiments

When evaluating a causal inference method, we often want to test it on data from a known causal model. CausalTables.jl allows us to define a DataGeneratingProcess (or DGP) to do just that.

Defining a DataGeneratingProcess

A data generating process describes a mechanism by which draws from random variables are simulated. It typically takes the form of a sequence of conditional distributions. CausalTables allows us to define a DGP as a DataGeneratingProcess object, which takes three arguments: the names of variables generated at each step, the types of these variables, and funcs, an array of functions of the form (; O...) -> *some code.

Suppose, for example, that we wanted to simulate data from the following DGP:

\[\begin{align*} W &\sim \text{DiscreteUniform}(1, 5) \\ X &\sim \text{Normal}(W, 1) \\ Y &\sim \text{Normal}(X + 0.2W, 1) \end{align*}\]

where X is the treatment, Y is the response, and W is a confounding variable affecting both X and Y. A verbose and inconvenient (albeit correct) way to define this DGP would be as follows:

using Distributions
using CausalTables

DataGeneratingProcess(
    [:W, :X, :Y],
    [:distribution, :distribution, :distribution],
    [
        (; O...) -> DiscreteUniform(1, 5), 
        (; O...) -> (@. Normal(O.W, 1)),
        (; O...) -> (@. Normal(O.X + 0.2 * O.W, 1))
    ]
)

where ; O... syntax is a shorthand for a function that takes keyword arguments corresponding to the names of the variables in the DGP.

However, a much more convenient way to define this DGP is using the @dgp macro, which takes a sequence of conditional distributions of the form [variable name] ~ Distribution(args...) and deterministic variable assignments of the form [variable name] = f(...) and automatically generates a valid DataGeneratingProcess. For example, the easier way to define the DGP above is as follows:

using CausalTables
distributions = @dgp(
        W ~ DiscreteUniform(1, 5),
        X ~ (@. Normal(W, 1)),
        Y ~ (@. Normal(X + 0.2 * W, 1))
    )

Note that with the @dgp macro, any symbol (that is, any string of characters prefixed by a colon, as in :W or :X) is automatically replaced with the corresponding previously-defined variable in the process. For instance, in Normal(:W, 1), the :W will be replaced automatically with the distribution we defined as W earlier in the sequence.

Defining a StructuralCausalModel

In CausalTables.jl, a StructuralCausalModel is a data generating process endowed with some causal interpretation. Constructing a StructuralCausalModel allows users to randomly draw a CausalTable with the necessary components from the DataGeneratingProcess they've defined. With the above DataGeneratingProcess in hand, we can define a StructuralCausalModel object like so – treatment, response, and confounder variables in the causal model are specified as keyword arguments to the DataGeneratingProcess constructor:

dgp = StructuralCausalModel(
    distributions;
    treatment = :X,
    response = :Y,
    confounders = [:W]
)

Networks of Causally-Connected Units

In some cases, we might work with data in which units may not be causally independent, but rather, in which one unit's variables could dependent on some summary function of its neighbors. Generating data from such a model can be done by adding lines of the form Xs $ NetworkSummary to the @dgp macro.

Here's an example of how such a StructuralCausalModel might be constructed:

using Graphs
using CausalTables
using Distributions

dgp = @dgp(
        W ~ DiscreteUniform(1, 5),
        n = length(W),
        A = Graphs.adjacency_matrix(erdos_renyi(n, 0.5)),
        Ws $ Sum(:W, :A),
        X ~ (@. Normal(Ws, 1)),
        Xs $ Sum(:X, :A),
        Y ~ (@. Normal(Xs + 0.2 * Ws, 1))
    )

scm = StructuralCausalModel(
    dgp;
    treatment = :X,
    response = :Y,
    confounders = [:W, :Ws]
)

API

CausalTables.DataGeneratingProcessType
mutable struct DataGeneratingProcess

A struct representing a data generating process.

Fields

  • names: An array of symbols representing the names of the variables.
  • types: An array of symbols representing the types of the variables.
  • funcs: An array of functions representing the generating functions for each variable.
source
CausalTables.StructuralCausalModelType
struct StructuralCausalModel

A struct representing a structural causal model (SCM). This includes a DataGeneratingProcess

Arguments

  • dgp::DataGeneratingProcess: The data generating process from which random data will be drawn.
  • treatment::Vector{Symbol}: The variables representing the treatment.
  • response::Vector{Symbol}: The variables representing the response.
  • confounders::Vector{Symbol}: The variables representing the confounders.
  • arraynames: Names of auxiliary variables used in the DataGeneratingProcess that are not included as "tabular" variables. Most commonly used to denote names of adjacency matrices used to compute summary functions of previous steps.
source
Base.randMethod
rand(scm::StructuralCausalModel, n::Int)

Generate random data from a Structural Causal Model (SCM) using the specified number of samples.

Arguments

  • scm::StructuralCausalModel: The Structural Causal Model from which to generate data.
  • n::Int: The number of samples to generate.

Returns

A CausalTable object containing the generated data.

source