Turning Your Data Into a `CausalTable`

In Julia, most datasets are stored in a Table: a data structure with a Tables.jl-compatible interface. One of the main purposes of CausalTables.jl is to wrap a Table of data in Julia in order to provide it as input to some other causal inference package. Given a Table of some data, we can turn it into a CausalTable by specifying the treatment, response, and control variables.

Constructing the `CausalTable`

The code below provides an example of how to wrap the Boston Housing dataset as a CausalTable to answer causal questions of the form "How would changing nitrous oxide air pollution (NOX) within Boston-area towns affect median home value (MEDV)?" Any dataset in a Tables.jl-compliant format can be wrapped as a CausalTable. In this example, we turn a DataFrame from DataFrames.jl into a CausalTable object.

using CausalTables
using MLDatasets: BostonHousing
using DataFrames

# get data in a Tables.jl-compliant format
tbl = BostonHousing().dataframe

# Wrapping the dataset in a CausalTable
ctbl = CausalTable(tbl; treatment = :NOX, response = :MEDV)

When only treatment and response are specified, all other variables are assumed to be confounders. However, one can also explicitly specify the causes of both treatment and response by passing them as a NamedTuple of lists to the CausalTable constructor. In the example below, we specify the causes of the treatment NOX only as [:CRIM, :INDUS], and the causes of the response MEDV are specified as [:CRIM, :INDUS, :NOX].

ctbl = CausalTable(tbl; treatment = :NOX, response = :MEDV,
                        causes = (NOX = [:CRIM, :INDUS], MEDV = [:CRIM, :INDUS, :NOX]))

CausalTable
┌─────────┬─────────┬─────────┬───────┬─────────┬─────────┬─────────┬───────────
│    CRIM │      ZN │   INDUS │  CHAS │     NOX │      RM │     AGE │     DIS  ⋯
│ Float64 │ Float64 │ Float64 │ Int64 │ Float64 │ Float64 │ Float64 │ Float64  ⋯
├─────────┼─────────┼─────────┼───────┼─────────┼─────────┼─────────┼───────────
│ 0.00632 │    18.0 │    2.31 │     0 │   0.538 │   6.575 │    65.2 │    4.09  ⋯
│ 0.02731 │     0.0 │    7.07 │     0 │   0.469 │   6.421 │    78.9 │  4.9671  ⋯
│ 0.02729 │     0.0 │    7.07 │     0 │   0.469 │   7.185 │    61.1 │  4.9671  ⋯
│ 0.03237 │     0.0 │    2.18 │     0 │   0.458 │   6.998 │    45.8 │  6.0622  ⋯
│ 0.06905 │     0.0 │    2.18 │     0 │   0.458 │   7.147 │    54.2 │  6.0622  ⋯
│ 0.02985 │     0.0 │    2.18 │     0 │   0.458 │    6.43 │    58.7 │  6.0622  ⋯
│ 0.08829 │    12.5 │    7.87 │     0 │   0.524 │   6.012 │    66.6 │  5.5605  ⋯
│ 0.14455 │    12.5 │    7.87 │     0 │   0.524 │   6.172 │    96.1 │  5.9505  ⋯
│    ⋮    │    ⋮    │    ⋮    │   ⋮   │    ⋮    │    ⋮    │    ⋮    │    ⋮     ⋱
│ 0.23912 │     0.0 │    9.69 │     0 │   0.585 │   6.019 │    65.3 │  2.4091  ⋯
│ 0.17783 │     0.0 │    9.69 │     0 │   0.585 │   5.569 │    73.5 │  2.3999  ⋯
│ 0.22438 │     0.0 │    9.69 │     0 │   0.585 │   6.027 │    79.7 │  2.4982  ⋯
│ 0.06263 │     0.0 │   11.93 │     0 │   0.573 │   6.593 │    69.1 │  2.4786  ⋯
│ 0.04527 │     0.0 │   11.93 │     0 │   0.573 │    6.12 │    76.7 │  2.2875  ⋯
│ 0.06076 │     0.0 │   11.93 │     0 │   0.573 │   6.976 │    91.0 │  2.1675  ⋯
│ 0.10959 │     0.0 │   11.93 │     0 │   0.573 │   6.794 │    89.3 │  2.3889  ⋯
│ 0.04741 │     0.0 │   11.93 │     0 │   0.573 │    6.03 │    80.8 │   2.505  ⋯
└─────────┴─────────┴─────────┴───────┴─────────┴─────────┴─────────┴───────────
                                                  6 columns and 490 rows omitted
Summaries: NamedTuple()
Arrays: NamedTuple()

Note that a full representation of the causes of each variable is not required, though they can be specified (this is often referred to a "directed acyclic graph"). Only the causes of the treatment and response are necessary as input; CausalTables.jl can compute other types of variables one might be interested in like confounders or mediators automatically.

Warning

When provided, the partial edgelist represented by causes assumes that if variable A is not listed as a cause of B, then no "causal path" exists between A and B – the two variables are uncorrelated. This differs slightly from the common definition of a directed acyclic graph edge in causal inference, where A can be considered a cause of B even if it only acts through another variable C. In this case, specify both A and C as causes of B in causes when constructing the CausalTable.

After wrapping a dataset in a CausalTable object, the Tables.jl is available to call on the CausalTable as well. Below, we demonstrate a few of these functions, as well as additional utility functions for causal inference tasks made available by CausalTables.jl.

using Tables

# Examples of using the Tables.jl interface
Tables.getcolumn(ctbl, :NOX) # extract specific column
Tables.subset(ctbl, 1:5)     # exact specific rows
Tables.columnnames(ctbl)     # obtain all column names

(:CRIM, :ZN, :INDUS, :CHAS, :NOX, :RM, :AGE, :DIS, :RAD, :TAX, :PTRATIO, :B, :LSTAT, :MEDV)

In addition, the CausalTable object has several utility functions that can be used to extract different types of variables relevant to causal inference from the CausalTable object.

# Additional utility functions for CausalTables
treatment(ctbl)              # get CausalTable of treatment variables
response(ctbl)               # get CausalTable of response variables
treatmentparents(ctbl)       # get CausalTable of treatment and response
responseparents(ctbl)        # get CausalTable of treatment and confounders

parents(ctbl, :NOX)          # get CausalTable of parents of a particular variable

confounders(ctbl)            # get CausalTable of confounders
mediators(ctbl)              # get CausalTable of mediators
instruments(ctbl)            # get CausalTable of instruments

data(ctbl)                   # get underlying wrapped dataset of a CausalTable

Although the CausalTable object is immutable, one can replace the values of its attributes with new ones using the replace function. The code below demonstrates how to replace the treatment and response variables of the CausalTable object ctbl with :CRIM and nothing, respectively. Setting causes = nothing is a quick shortcut to specify that all unlabeled variables are confounders of the treatment-response relationship.

# Replace one or more attributes of the CausalTable.
# Setting `causes = nothing` is a quick shortcut to specify
# that all unlabeled variables are confounders of the treatment-response relationship
CausalTables.replace(ctbl; response = :CRIM, causes = nothing)

Tables with Network-Dependent Units

The previous example assumes that each unit (row in the Table, in this case tbl), is "causally independent" of every other unit – that is, the treatment of one unit does not affect the response of any other unit. This is a component of the "stable unit treatment value assumption" (SUTVA) often used in causal inference. In some cases, however, we might work with data in which units may not be causally independent, but rather, in which one unit's variables depend on some summary function of its neighbors

In this case, one must instead perform causal inference on the summary functions of each unit's neighbors (Aronow and Samii, 2017). To do this, each CausalTable has two relevant arguments that can be used to correct SUTVA violations. The arrays argument is a NamedTuple that can store adjacency matrices and other miscellaneous parameters that denote the causal relationships between variables. The summaries argument is a tuple of NetworkSummary objects that can be used to summarize the network relationships between units by referencing variables in either the underlying data or the arrays argument of CausalTable (or both).

The code below provides an example of how such a CausalTable might be constructed to consider a summary function treatment in the case of causally-dependent units, using the Karate Club dataset. In this example, treatment is defined as the number of friends a club member has, denoted by the summary function parameter summaries = (friends = Friends(:F),). Hence, this answers the causal question "how would changing a subject's number of friends (friends) affect which club they are likely to join (labels_clubs)?"

We store the network relationships between units as an adjacency matrix F by assigning it to the arrays parameters. This allows the Friends(:F) summary function to access it when calling summarize(ctbl). More detail on the types of NetworkSummary that can be used in a dependent-data CausalTable can be found in Network Summaries

using CausalTables
using MLDatasets
using Graphs

# Get a Table of Karate Club data from MLDatasets
data = KarateClub()
tbl = data.graphs[1].node_data

# Convert the karate club data into a Graphs.jl graph object
g = SimpleGraphFromIterator([Edge(x...) for x in zip(data.graphs[1].edge_index...)])

# Store the "friends" as an the adjacency matrix in a NamedTuple
# Note that the input to arrays must be a NamedTuple, even if there is only one summary variable,
# so the trailing comma is necessary.
m = (F = Graphs.adjacency_matrix(g),)

# Construct a CausalTable with the adjacency matrix stored in `arrays` and a summary variable recording the number of friends
ctbl = CausalTable(tbl; treatment = :friends, response = :labels_clubs, arrays = m, summaries = (friends = Friends(:F),))

One can then call the function summarize(ctbl) to compute the values of the summary function on the causal table.

Based on these summaries, it is also possible to extract two matrices from the CausalTable object: the adjacency_matrix and the dependency_matrix. The adjacency_matrix denotes which units are causally dependent upon one another: an entry of 1 in cell $(i,j)$ indicates that some variable in unit i exhibits a causal relationship to some variable in unit j. The dependency_matrix denotes which units are statistically dependent upon one another: an entry of 1 in cell $(i,j)$ indicates that the data of unit i is correlated with the data in unit j. Two units are correlated if they either are causally dependent (neighbors in the adjacency matrix) or share a common neighbor in the adjacency matrix.

CausalTables.adjacency_matrix(ctbl) # get adjacency matrix
CausalTables.dependency_matrix(ctbl) # get dependency matrix

API

Base.replace — Method

replace(o::CausalTable; kwargs...)

Replace the fields of a CausalTable object with the provided keyword arguments.

Arguments

o::CausalTable: The CausalTable object to be replaced.
kwargs...: Keyword arguments specifying the new values for the fields.

Returns

A new CausalTable object with the specified fields replaced.

Turning Your Data Into a CausalTable

Constructing the CausalTable

Tables with Network-Dependent Units

API

Turning Your Data Into a `CausalTable`

Constructing the `CausalTable`