Is Julia Ready for Statisticians?

Salvador Balkus

Who Am I?

  • 3rd year PhD in Biostatistics
  • Advised by Nima Hejazi, affiliated with NSAPH (Dominici lab)
  • Background in computer/data science
  • Experience in…
    • R
    • Python
    • Java
    • MATLAB
    • C
    • Mathematica

What is Julia?

“We want a language that’s open source, with a liberal license. We want the speed of C with the dynamism of Ruby. We want a language that’s homoiconic, with true macros like Lisp, but with obvious, familiar mathematical notation like Matlab. We want something as usable for general programming as Python, as easy for statistics as R, as natural for string processing as Perl, as powerful for linear algebra as Matlab, as good at gluing programs together as the shell. Something that is dirt simple to learn, yet keeps the most serious hackers happy. We want it interactive and we want it compiled.”

Bezanson et al. (2012)

What is Julia?

Julia: “runs like C but reads like Python” (Perkel 2019)

The Experiment

Two years ago (Fall 2022) I switched to using Julia for almost all statistical tasks.

Classes

  • BST 232: Algorithms
  • BST 245: Multivariate and Longitudinal
  • BST 249: Bayesian

Research

  • Observational Causal Inference on Networks
  • Debiased Machine Learning
  • CausalTables.jl

What, exactly, is different about Julia?

What I Won’t Be Covering

  • Speed (it’s fast)
  • Unicode support
β₀ = 2.5
β₁ = 0.2
= 9

= β₀ + β₁ *

My Scoring System: Julia vs R

On each slide, I will score Julia on its ease of use in various domains compared to R.

JScore values:

  • 4: Highly recommend Julia over all other languages
  • 3: Julia is much better for some use cases
  • 2: Fine, but not a reason to choose Julia over others
  • 1: Avoid using in Julia unless necessary for other tasks

Round 1: Scientific Practice

IDE — JScore: 2

Literate Programming — JScore: 2

Literate.jl

Don’t forget engine: julia !

Reproducibility — JScore: 4

Reproducibility — JScore: 4

Dependency Management

  • Automatically tracks dependencies in Project.toml

Reproducibility — JScore: 4

Julia supports test- and doc-specific dependencies

Reproducibility — JScore: 4

Documentation

Reproducibility — JScore: 4

Running Experiments

using DrWatson
initialize_project("Example")

@quickactivate "Example"

params = @strdict a b v method
...
path = datadir("sims", "1.jld2")
wsave(path, sim1)

Science Score: 8/12 (Better than R)

Round 2: Data Analysis

Data Processing (Iris) — JScore: 1

R: dplyr

library(dplyr)
data("iris")

iris |> group_by(Species) |> 
    summarize(a = mean(Petal.Length) / mean(Sepal.Length),
              b = sum(Petal.Length))

Julia: DataFrames.jl

using CSV, DataFrames, StatsBase
path = joinpath(pkgdir(DataFrames), "docs", "src", "assets", "iris.csv")
iris = CSV.read(path, DataFrame)

iris_gdf = groupby(iris, :Species)
combine(iris_gdf, [:PetalLength, :SepalLength] =>
            ((p, s) -> (a = mean(p)/mean(s), b = sum(p))) => AsTable)

Plotting

R: ggplot2

library(ggplot2)
p = ggplot(iris) + 
geom_point(aes(
    x = Petal.Length,
    y = Sepal.Length,
    color = Species)
)

Plotting — JScore: 2

Julia: Plots.jl

using StatsPlots; gr()
@df iris scatter(
    :PetalLength, :SepalLength,
    group = :Species, size = (600, 400))

Plotting

Julia supports multiple “backends”

Plotly backend via iframe

GLMs — JScore: 1

R: glm function – perfect

Julia: GLM.jl – terrible

using GLM
df = DataFrame(x = randn(Float32, 10), 
               y = randn(Float32, 10))
lm(@formula(y ~ x), df)
MethodError: MethodError(GLM.delbeta!, (GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}([1.0 3.055166006088257; 1.0 0.6637405753135681; … ; 1.0 -0.5296103954315186; 1.0 -1.343801736831665], [0.0, 0.0], [0.0, 0.0], [0.0, 0.0], LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}([4.002796686977798 0.6608336050292363; 2.645182564854622 3.092458398501759], 'U', [2, 1], 2, -1.0, 0), [6.9173836729581e-310 6.9173807184827e-310; 6.9173836729581e-310 6.91722498160464e-310; … ; 6.9173836729581e-310 6.9173836729581e-310; 6.9173836725905e-310 0.0], [3.4766779039175e-310 8.4879831683e-314; 4.243991583e-314 6.9172818793886e-310]), Float32[1.2254993, 0.53429693, -0.3176004, -0.94373393, -0.8283784, 0.3207153, 0.7854858, -0.18182965, -0.5882426, 0.28917944]), 0x0000000000007bfd)
MethodError: no method matching delbeta!(::GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}, ::Vector{Float32})

Closest candidates are:
  delbeta!(!Matched::GLM.SparsePredChol{T}, ::Vector{T}) where T
   @ GLM ~/.julia/packages/GLM/vM20T/src/linpred.jl:220
  delbeta!(!Matched::GLM.SparsePredChol{T}, ::Vector{T}, !Matched::Vector{T}) where T
   @ GLM ~/.julia/packages/GLM/vM20T/src/linpred.jl:213
  delbeta!(::GLM.DensePredChol{T, <:LinearAlgebra.CholeskyPivoted}, !Matched::Vector{T}, !Matched::Vector{T}) where T<:Union{Float32, Float64}
   @ GLM ~/.julia/packages/GLM/vM20T/src/linpred.jl:159
  ...

Stacktrace:
 [1] fit!(obj::LinearModel{GLM.LmResp{Vector{Float32}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}})
   @ GLM ~/.julia/packages/GLM/vM20T/src/lm.jl:94
 [2] fit(::Type{LinearModel}, X::Matrix{Float64}, y::Vector{Float32}, allowrankdeficient_dep::Nothing; wts::Vector{Float32}, dropcollinear::Bool)
   @ GLM ~/.julia/packages/GLM/vM20T/src/lm.jl:143
 [3] fit(::Type{LinearModel}, X::Matrix{Float64}, y::Vector{Float32}, allowrankdeficient_dep::Nothing)
   @ GLM ~/.julia/packages/GLM/vM20T/src/lm.jl:134
 [4] fit(::Type{LinearModel}, f::FormulaTerm{Term, Term}, data::DataFrame, args::Nothing; contrasts::Dict{Symbol, Any}, kwargs::@Kwargs{})
   @ StatsModels ~/.julia/packages/StatsModels/mPD8T/src/statsmodel.jl:88
 [5] fit
   @ ~/.julia/packages/StatsModels/mPD8T/src/statsmodel.jl:78 [inlined]
 [6] lm
   @ ~/.julia/packages/GLM/vM20T/src/lm.jl:157 [inlined]
 [7] lm(X::FormulaTerm{Term, Term}, y::DataFrame)
   @ GLM ~/.julia/packages/GLM/vM20T/src/lm.jl:157
 [8] top-level scope
   @ ~/Research/is-julia-ready-for-statistics/presentation.qmd:409

Survival Models — JScore: 1

Julia: Survival.jl

  • Event Times
  • Kaplan-Meier
  • Nelson-Aalen
  • Cox model

R: built-in survival

  • All of Julia’s functions, plus plots and better metrics
  • Log-rank tests
  • Accelerated failure time
  • Time-dependent covariates
  • Competing risks
  • Machine learning with ranger

Mixed-effect Models — JScore: 2

Both written by Douglas Bates.

Julia: MixedModels.jl

using MixedModels
df = MixedModels.dataset(
    :penicillin)
model = fit(MixedModel, 
    @formula(diameter ~ 1 + 
    (1|plate) + (1|sample)), df)

R: lme4

library(lme4)
model = lmer(diameter ~ 1 + 
        (1|plate) + (1|sample),
        data = Penicillin);

Can also consider nlme, plm, glmm, lmer

Data Analysis Score: 7/20 (Use at your own risk)

Round 3: Statistical Modeling

Distributions — JScore: 4

Julia: Distributions.jl

using Distributions
# 1. Define your distribution(s)
d = [Normal(0, 1), Beta(2, 2)]
# 2. Do whatever you want with it
rand.(d)
pdf.(d, 0.5)
quantile.(d, 0.05)
mean.(d)
  • Every conceivable distribution in one place!
  • Automatic convolutions, order statistics, etc.

R: r[dist] function family

# Evaluate first distribution
rnorm(1, mean = 0, sd = 1)
pnorm(1, mean = 0, sd = 1)
qnorm(0.05, mean = 0, sd = 1)

# Evaluate second distribution
rbeta(1, 2, 2)
pnorm(0.5, 2, 2)
qbeta(0.05, 2, 2)
  • Every distribution has a different function
  • Cannot extract properties of distributions at all

Bayesian Models — JScore: 4

Julia: Turing.jl

using Turing
@model function demo(y, n)
~ InverseGamma(2, 3)
    m ~ Normal(0, sqrt(s²))
    for i in 1:n
        y[i] ~ Normal(m, sqrt(s²))
    end
end

n = 10
y = rand(Normal(1, 1), n)
chain = sample(demo(y, n), NUTS(), 
        MCMCThreads(), 1_500, 3)

R: Nimble, Stan, JAGS, etc.

  • R defines restricted metalanguages for sampling
  • Turing samples from any arbitrary Julia code

Machine Learning — JScore: 4

Julia: MLJ.jl

  • GLM, Decision Tree, Random Forest, XGBoost, LightGBM
  • Can fit entire pipeline at once (via linear pipeline, model stacking/ensembling, learning networks)
  • Self-tuning models, various cross-validation schemes
  • Extremely easy to implement custom models

R: sl3

  • Slightly wider model selection (earth, gam)
  • Discrete Super Learner
  • Beyond just regression (density estimation, etc.)

Proof of how easy MLJ is

Template for a custom model; similar to sklearn

mutable struct MyRegressor <: MLJBase.Deterministic
    myparam::Float64
end

function MLJBase.fit(model::MyRegressor, verbosity, X, y)
    fitresult = ... # fit parameters
    cache = nothing # optional; to cache for future updates
    report = nothing # optional; to expose internals
    return fitresult, cache, report
end

function MMI.predict(::DensityRatioClassifier, fitresult, Xy_nu, Xy_de)
    ...
end

mach = machine(MyRegressor(0.1), X, y) |> fit!

Deep Learning — JScore: 3

Julia: Flux.jl / Lux.jl

  • Highly flexible, more generic
  • Can differentiate any arbitrary Julia code (including randow draws)
  • Better for highly custom models
  • Harder to diagnose errors

Other languages: PyTorch, Tensorflow, etc.

  • Easier to use
  • Easier to learn (better docs)
  • Easier to deploy and scale

Other things I didn’t cover

  • GEE models: Julia’s MEstimation.jl solves arbitrary estimating equations via automatic differentation
  • Graphs: Graphs.jl faster than networkx or igraph, but often need different packages for weighted graphs, etc.
  • Optimization: Many packages; Optimization.jl brings them all together
  • Parallel Computing: easily supports Tasks, multi-threading, distributed Julia processes, message passing, and GPU

Statistical Modeling Score: 15/16 (Julia’s chief strength)

Total Score: 30/48 (Worth using)

Should you learn Julia?

Conclusions

  • Learn Julia as your second language
  • Anything scientific and high-performance should be done in Julia
  • Need Rcpp or Stan? A custom model? Simulation on the cluster? STOP: Learn Julia!