Is Julia Ready for Statisticians?

Salvador Balkus

Who Am I?

3rd year PhD in Biostatistics
Advised by Nima Hejazi, affiliated with NSAPH (Dominici lab)
Background in computer/data science

Experience in…
- R
- Python
- Java
- MATLAB
- C
- Mathematica

What is Julia?

“We want a language that’s open source, with a liberal license. We want the speed of C with the dynamism of Ruby. We want a language that’s homoiconic, with true macros like Lisp, but with obvious, familiar mathematical notation like Matlab. We want something as usable for general programming as Python, as easy for statistics as R, as natural for string processing as Perl, as powerful for linear algebra as Matlab, as good at gluing programs together as the shell. Something that is dirt simple to learn, yet keeps the most serious hackers happy. We want it interactive and we want it compiled.”

– Bezanson et al. (2012)

What is Julia?

Julia: “runs like C but reads like Python” (Perkel 2019)

The Experiment

Two years ago (Fall 2022) I switched to using Julia for almost all statistical tasks.

Classes

BST 232: Algorithms
BST 245: Multivariate and Longitudinal
BST 249: Bayesian

Research

Observational Causal Inference on Networks
Debiased Machine Learning
CausalTables.jl

What, exactly, is different about Julia?

What I Won’t Be Covering

Speed (it’s fast)
Unicode support

β₀ = 2.5
β₁ = 0.2
x² = 9

ŷ = β₀ + β₁ * x²

My Scoring System: Julia vs R

On each slide, I will score Julia on its ease of use in various domains compared to R.

JScore values:

4: Highly recommend Julia over all other languages
3: Julia is much better for some use cases
2: Fine, but not a reason to choose Julia over others
1: Avoid using in Julia unless necessary for other tasks

Round 1: Scientific Practice

IDE — `JScore: 2`

Literate Programming — `JScore: 2`

Literate.jl

Don’t forget engine: julia !

Reproducibility — `JScore: 4`

Dependency Management

Automatically tracks dependencies in Project.toml

Reproducibility — `JScore: 4`

Julia supports test- and doc-specific dependencies

Reproducibility — `JScore: 4`

Documentation

An example of mine in Julia

Someone else’s in R

Reproducibility — `JScore: 4`

Running Experiments

using DrWatson
initialize_project("Example")

@quickactivate "Example"

params = @strdict a b v method
...
path = datadir("sims", "1.jld2")
wsave(path, sim1)

Science Score: 8/12 (Better than R)

Round 2: Data Analysis

Data Processing (Iris) — `JScore: 1`

R: dplyr

library(dplyr)
data("iris")

iris |> group_by(Species) |> 
    summarize(a = mean(Petal.Length) / mean(Sepal.Length),
              b = sum(Petal.Length))

Julia: DataFrames.jl

using CSV, DataFrames, StatsBase
path = joinpath(pkgdir(DataFrames), "docs", "src", "assets", "iris.csv")
iris = CSV.read(path, DataFrame)

iris_gdf = groupby(iris, :Species)
combine(iris_gdf, [:PetalLength, :SepalLength] =>
            ((p, s) -> (a = mean(p)/mean(s), b = sum(p))) => AsTable)

Plotting

R: ggplot2

library(ggplot2)
p = ggplot(iris) + 
geom_point(aes(
    x = Petal.Length,
    y = Sepal.Length,
    color = Species)
)

Plotting — `JScore: 2`

Julia: Plots.jl

using StatsPlots; gr()
@df iris scatter(
    :PetalLength, :SepalLength,
    group = :Species, size = (600, 400))

Plotting

Julia supports multiple “backends”

Plotly backend via iframe

GLMs — `JScore: 1`

R: glm function – perfect

Julia: GLM.jl – terrible

using GLM
df = DataFrame(x = randn(Float32, 10), 
               y = randn(Float32, 10))
lm(@formula(y ~ x), df)

MethodError: MethodError(GLM.delbeta!, (GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}([1.0 3.055166006088257; 1.0 0.6637405753135681; … ; 1.0 -0.5296103954315186; 1.0 -1.343801736831665], [0.0, 0.0], [0.0, 0.0], [0.0, 0.0], LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}([4.002796686977798 0.6608336050292363; 2.645182564854622 3.092458398501759], 'U', [2, 1], 2, -1.0, 0), [6.9173836729581e-310 6.9173807184827e-310; 6.9173836729581e-310 6.91722498160464e-310; … ; 6.9173836729581e-310 6.9173836729581e-310; 6.9173836725905e-310 0.0], [3.4766779039175e-310 8.4879831683e-314; 4.243991583e-314 6.9172818793886e-310]), Float32[1.2254993, 0.53429693, -0.3176004, -0.94373393, -0.8283784, 0.3207153, 0.7854858, -0.18182965, -0.5882426, 0.28917944]), 0x0000000000007bfd)
MethodError: no method matching delbeta!(::GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}, ::Vector{Float32})

Closest candidates are:
  delbeta!(!Matched::GLM.SparsePredChol{T}, ::Vector{T}) where T
   @ GLM ~/.julia/packages/GLM/vM20T/src/linpred.jl:220
  delbeta!(!Matched::GLM.SparsePredChol{T}, ::Vector{T}, !Matched::Vector{T}) where T
   @ GLM ~/.julia/packages/GLM/vM20T/src/linpred.jl:213
  delbeta!(::GLM.DensePredChol{T, <:LinearAlgebra.CholeskyPivoted}, !Matched::Vector{T}, !Matched::Vector{T}) where T<:Union{Float32, Float64}
   @ GLM ~/.julia/packages/GLM/vM20T/src/linpred.jl:159
  ...

Stacktrace:
 [1] fit!(obj::LinearModel{GLM.LmResp{Vector{Float32}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}})
   @ GLM ~/.julia/packages/GLM/vM20T/src/lm.jl:94
 [2] fit(::Type{LinearModel}, X::Matrix{Float64}, y::Vector{Float32}, allowrankdeficient_dep::Nothing; wts::Vector{Float32}, dropcollinear::Bool)
   @ GLM ~/.julia/packages/GLM/vM20T/src/lm.jl:143
 [3] fit(::Type{LinearModel}, X::Matrix{Float64}, y::Vector{Float32}, allowrankdeficient_dep::Nothing)
   @ GLM ~/.julia/packages/GLM/vM20T/src/lm.jl:134
 [4] fit(::Type{LinearModel}, f::FormulaTerm{Term, Term}, data::DataFrame, args::Nothing; contrasts::Dict{Symbol, Any}, kwargs::@Kwargs{})
   @ StatsModels ~/.julia/packages/StatsModels/mPD8T/src/statsmodel.jl:88
 [5] fit
   @ ~/.julia/packages/StatsModels/mPD8T/src/statsmodel.jl:78 [inlined]
 [6] lm
   @ ~/.julia/packages/GLM/vM20T/src/lm.jl:157 [inlined]
 [7] lm(X::FormulaTerm{Term, Term}, y::DataFrame)
   @ GLM ~/.julia/packages/GLM/vM20T/src/lm.jl:157
 [8] top-level scope
   @ ~/Research/is-julia-ready-for-statistics/presentation.qmd:409

Survival Models — `JScore: 1`

Julia: Survival.jl

Event Times
Kaplan-Meier
Nelson-Aalen
Cox model

R: built-in survival

All of Julia’s functions, plus plots and better metrics
Log-rank tests
Accelerated failure time
Time-dependent covariates
Competing risks
Machine learning with ranger

Mixed-effect Models — `JScore: 2`

Both written by Douglas Bates.

Julia: MixedModels.jl

using MixedModels
df = MixedModels.dataset(
    :penicillin)
model = fit(MixedModel, 
    @formula(diameter ~ 1 + 
    (1|plate) + (1|sample)), df)

R: lme4

library(lme4)
model = lmer(diameter ~ 1 + 
        (1|plate) + (1|sample),
        data = Penicillin);

Can also consider nlme, plm, glmm, lmer…

Data Analysis Score: 7/20 (Use at your own risk)

Round 3: Statistical Modeling

Distributions — `JScore: 4`

Julia: Distributions.jl

using Distributions
# 1. Define your distribution(s)
d = [Normal(0, 1), Beta(2, 2)]
# 2. Do whatever you want with it
rand.(d)
pdf.(d, 0.5)
quantile.(d, 0.05)
mean.(d)

Every conceivable distribution in one place!
Automatic convolutions, order statistics, etc.

R: r[dist] function family

# Evaluate first distribution
rnorm(1, mean = 0, sd = 1)
pnorm(1, mean = 0, sd = 1)
qnorm(0.05, mean = 0, sd = 1)

# Evaluate second distribution
rbeta(1, 2, 2)
pnorm(0.5, 2, 2)
qbeta(0.05, 2, 2)

Every distribution has a different function
Cannot extract properties of distributions at all

Bayesian Models — `JScore: 4`

Julia: Turing.jl

using Turing
@model function demo(y, n)
    s² ~ InverseGamma(2, 3)
    m ~ Normal(0, sqrt(s²))
    for i in 1:n
        y[i] ~ Normal(m, sqrt(s²))
    end
end

n = 10
y = rand(Normal(1, 1), n)
chain = sample(demo(y, n), NUTS(), 
        MCMCThreads(), 1_500, 3)

R: Nimble, Stan, JAGS, etc.

R defines restricted metalanguages for sampling
Turing samples from any arbitrary Julia code

Machine Learning — `JScore: 4`

Julia: MLJ.jl

GLM, Decision Tree, Random Forest, XGBoost, LightGBM
Can fit entire pipeline at once (via linear pipeline, model stacking/ensembling, learning networks)
Self-tuning models, various cross-validation schemes
Extremely easy to implement custom models

R: sl3

Slightly wider model selection (earth, gam)
Discrete Super Learner
Beyond just regression (density estimation, etc.)

Proof of how easy MLJ is

Template for a custom model; similar to sklearn

mutable struct MyRegressor <: MLJBase.Deterministic
    myparam::Float64
end

function MLJBase.fit(model::MyRegressor, verbosity, X, y)
    fitresult = ... # fit parameters
    cache = nothing # optional; to cache for future updates
    report = nothing # optional; to expose internals
    return fitresult, cache, report
end

function MMI.predict(::DensityRatioClassifier, fitresult, Xy_nu, Xy_de)
    ...
end

mach = machine(MyRegressor(0.1), X, y) |> fit!

Deep Learning — `JScore: 3`

Julia: Flux.jl / Lux.jl

Highly flexible, more generic
Can differentiate any arbitrary Julia code (including randow draws)
Better for highly custom models
Harder to diagnose errors

Other languages: PyTorch, Tensorflow, etc.

Easier to use
Easier to learn (better docs)
Easier to deploy and scale

Other things I didn’t cover

GEE models: Julia’s MEstimation.jl solves arbitrary estimating equations via automatic differentation
Graphs: Graphs.jl faster than networkx or igraph, but often need different packages for weighted graphs, etc.
Optimization: Many packages; Optimization.jl brings them all together
Parallel Computing: easily supports Tasks, multi-threading, distributed Julia processes, message passing, and GPU

Statistical Modeling Score: 15/16 (Julia’s chief strength)

Total Score: 30/48 (Worth using)

Should you learn Julia?

Conclusions

Learn Julia as your second language
Anything scientific and high-performance should be done in Julia
Need Rcpp or Stan? A custom model? Simulation on the cluster? STOP: Learn Julia!

Is Julia Ready for Statisticians?

Who Am I?

What is Julia?

What is Julia?

The Experiment

What, exactly, is different about Julia?

What I Won’t Be Covering

My Scoring System: Julia vs R

Round 1: Scientific Practice

IDE — JScore: 2

Literate Programming — JScore: 2

Reproducibility — JScore: 4

Reproducibility — JScore: 4

Reproducibility — JScore: 4

Reproducibility — JScore: 4

Reproducibility — JScore: 4

Science Score: 8/12 (Better than R)

Round 2: Data Analysis

Data Processing (Iris) — JScore: 1

Plotting

Plotting — JScore: 2

Plotting

Plotly backend via iframe

GLMs — JScore: 1

Survival Models — JScore: 1

Mixed-effect Models — JScore: 2

Data Analysis Score: 7/20 (Use at your own risk)

Round 3: Statistical Modeling

Distributions — JScore: 4

Bayesian Models — JScore: 4

Machine Learning — JScore: 4

Proof of how easy MLJ is

Deep Learning — JScore: 3

Other things I didn’t cover

Statistical Modeling Score: 15/16 (Julia’s chief strength)

Total Score: 30/48 (Worth using)

Should you learn Julia?

Conclusions

IDE — `JScore: 2`

Literate Programming — `JScore: 2`

Reproducibility — `JScore: 4`

Reproducibility — `JScore: 4`

Reproducibility — `JScore: 4`

Reproducibility — `JScore: 4`

Reproducibility — `JScore: 4`

Data Processing (Iris) — `JScore: 1`

Plotting — `JScore: 2`

GLMs — `JScore: 1`

Survival Models — `JScore: 1`

Mixed-effect Models — `JScore: 2`

Distributions — `JScore: 4`

Bayesian Models — `JScore: 4`

Machine Learning — `JScore: 4`

Deep Learning — `JScore: 3`