Simulating a multi-level dataset for linear mixed effects analysis

This tutorial takes a simulation-based approach to understanding mixed effects models. We start with the fundamental equations specifying the model structure. Then, we simulate data with two variance components and fit several mixed effects models to it.
Why do we do this? Even the most advanced users do not necessarily know the implications and effects of the assumptions and procedures used in modeling. A simulation-based approach allows you to interrogate how the model performs under different scenarios (i.e., data and experimental design conditions). This provides a way of understanding the models' behavior, including bias, variance, power, and false positive rates.
This video on simulating mixed effects data by Jeanette Mumford is a clear and gentle introduction that might be helpful. It is part of the Mumford mixed effects models series on Youtube. She also has a Mumford mixed models in fMRI series.
Table of Contents

Model formulation

First level:
This describes how individual-participant data are generated from a set of fixed predictors and participant-specific model parameters ("true" regression intercept and slope values for each participant). Outcome data vector for participant () is generated by a within-person design matrix times participant-specific coefficient vector plus participant specific error vector .
lindquist_mfx_level1.png
Note that is block diagonal here.
Second level:
This describes how participant-specific model parameters are generated from between-person (group-level) population-level model parameters . is generated by a between-person design matrix (see below) times fixed population-level coefficient vector plus participant-specific deviation .
Combined two-level model
is an second-level design matrix of between-person predictors for the intercept and other model parameters.
lindquist_mfx_level2.png
Alternative notations
There are several other notation systems that differ somewhat in the notation used. Here are the equations used in the Mumford video lectures.
mumford_mixedfx_equations.png

A simple example

Let's generate some data with made-up slopes. We have a simple model with a linear predictor and an intercept for each person.
 
disp('Design matrix');
Design matrix
Zi = [1 1 1 1 1; 0 1 2 3 4]'
Zi = 5×2
1 0 1 1 1 2 1 3 1 4
 
disp('Number of subjects')
Number of subjects
m = 3
m = 3
 
disp('Number of observations within-person, and num predictors')
Number of observations within-person, and num predictors
[n_within, p] = size(Zi)
n_within = 5
p = 2
 
disp('Z is the Kronecker product of an [m x m] identity matrix and Zi');
Z is the Kronecker product of an [m x m] identity matrix and Zi
disp('The Kronecker product expands Zi into a block diagonal matrix for every subject.');
The Kronecker product expands Zi into a block diagonal matrix for every subject.
 
Z = kron(eye(m), Zi)
Z = 15×6
1 0 0 0 0 0 1 1 0 0 0 0 1 2 0 0 0 0 1 3 0 0 0 0 1 4 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 1 2 0 0 0 0 1 3 0 0 0 0 1 4 0 0
 
disp('Made-up true betas')
Made-up true betas
disp('Intercept and slope for each of m persons, concatenated')
Intercept and slope for each of m persons, concatenated
beta = [0 2 1 3 4 6]'
beta = 6×1
0 2 1 3 4 6
 
disp('True response');
True response
y = Z*beta % Now each person's fit is a combination of their individual Zi and beta_i
y = 15×1
0 2 4 6 8 1 4 7 10 13
 
% Plot the true response without error
ycell = mat2cell(y, n_within * ones(m, 1))';
x1 = repmat({Zi(:, 2)}, 1, m);
create_figure('example');
handles = line_plot_multisubject(x1, ycell);
____________________________________________________________________________________________________________________________________________ Input data: X scaling: No centering or z-scoring Y scaling: No centering or z-scoring Transformations: No data transformations before plot Correlations: r = 0.68 across all observations, based on untransformed input data Stats on slopes after transformation, subject is random effect: Mean b = 3.67, t( 2) = 3.05, p = 0.092735, num. missing: 0 Average within-person r = 1.00 +- 0.00 (std) Between-person r (across subject means) = NaN, p = NaN ____________________________________________________________________________________________________________________________________________
set(handles.point_handles, 'MarkerSize', 8, 'Marker', 'o'); % we can use handles to reformat things
set(gca, 'XLim', [-0.5 4.5])
xlabel('X1'); ylabel('y');
The response (y) will also be measured witih error. Let's add some errors.
Here, the errors are normally distributed with var = 1. Here the errors are homoscedastic across subjects.
This is a common assumption in mixed effects models, e.g., lmer in R, but not all models make this assumption.
e = randn(size(y, 1), 1);
 
y = y + e;
 
% Plot the true response with error
ycell = mat2cell(y, n_within * ones(m, 1))';
x1 = repmat({Zi(:, 2)}, 1, m);
create_figure('example');
handles = line_plot_multisubject(x1, ycell);
____________________________________________________________________________________________________________________________________________ Input data: X scaling: No centering or z-scoring Y scaling: No centering or z-scoring Transformations: No data transformations before plot Correlations: r = 0.66 across all observations, based on untransformed input data Stats on slopes after transformation, subject is random effect: Mean b = 3.53, t( 2) = 2.67, p = 0.115979, num. missing: 0 Average within-person r = 0.98 +- 0.03 (std) Between-person r (across subject means) = NaN, p = NaN ____________________________________________________________________________________________________________________________________________
set(handles.point_handles, 'MarkerSize', 8, 'Marker', 'o'); % we can use handles to reformat things
set(gca, 'XLim', [-0.5 4.5])
xlabel('X1'); ylabel('y');
 
The variation in betas 1, 3, and 5 reflects individual differences in the intercept.
The variation in betas 2, 4, and 6 reflects individual differences in the slope of X1.
We set these to fixed values above, but in the next section, these individual differences will be randomly selected from a distribution. The "true" individual differences among subjects are thus random effects.

A more flexible simulation

Specify parameters and model structure

 
% number of subjects
m = 20;
 
% n observations within-subject
% For now, assume same across all subjects
 
n_within =20;
 
n_i = n_within * ones(m, 1);
 
% Number of model parameters (regressors), including intercept
p = 2;
 
% Within-person residual noise variance
% For now, assume same across all subjects
sig_2_within =1.1;
 
% s2 is 1 x m vector of residual variances
s2 = sig_2_within .* ones(1, m);
Fixed population-level effects
Specify a column vector of true population-level values for each effect, intercept followed by other parameters.
b_g = [5 2]';
 
% Check the size
if any(size(b_g) - [p 1])
error('Population effects b_g is the wrong size')
end
Group-level covariance matrix
Diagonals are variances for each of p parameters. By convention, the intercept is first.
Off-diagonals are covariances
e.g., for an intercept and slope (simple multilevel regression), U_g would be a 2 x 2 matrix with variances of the intercept and slope on the diagonal.
 
U_g = [20 0; 0 10];
 
% Check the size
if ~all(size(U_g) == p)
error('Covariance matrix U_g is the wrong size')
end
 
%%
Fixed design matrix
These can be the same for all subjects, or different. They'd be the same if we were using a fixed experimental design with the same numbers of observations per person, and different if (1) our predictors are measured variables or (2) we have different numbers of observations.
We'll assume they are the same across participants for now, but allow the code to be easily extended to different design matrices across participants.
Predictors can also be categorical or continuous, and correlated or uncorrelated. We'll generate a simple effects-coded [1 -1] predictor for now, with approximately balanced numbers of observations.
The intercept is first (a column of 1's) by convention.
Z_within = [ones(n_within, 1)]; % intercept
 
Z_tmp = randn(n_within, p - 1);
 
Z_tmp(Z_tmp >= 0) = 1;
Z_tmp(Z_tmp <= 0) = -1;
 
Z_within = [Z_within Z_tmp];
 
Z_i = {};
for i = 1:m % allows subjects to vary later if needed
 
Z_i{i} = Z_within;
 
end
 
% Note: this could be modified to be more general
regressortype = 'continuous'; % 'continuous' or 'binary'
Now we have all the variables we need to simulate data:
n_i Number of observations for each person i = 1...m
b_g Population-level true fixed effects parameters
s2 residual variances for each person (1 x m)
U_g Group covariance matrix
Z_i Design matrix for each person

Questions to answer

Under most input parameter choices, the dataset will include a random subject intercept and random subject slope.
Which input parameter would you change to make the dataset consistent with a "random intercept only" model?
Which input parameter would you change to make the dataset consistent with a "random slope only" model?

Create simulated dataset

Generate subject-level true parameters
Assume a normal distribution with covariance U_g
beta_i here is an [p x m] matrix where column i is the [p x 1] true beta coefficient vector for subject i.
beta_i = mvnrnd(b_g, U_g, m)';
Generate data (y)
y = cell(1, m);
 
for i = 1:m
 
y{i} = Z_i{i} * beta_i(:, i) + sqrt(s2(i)) .* randn(n_i(i), 1);
 
end
To fit the dataset, we'll need these variables:
y{i = 1...m} The DV/outcome
Z_i{i = 1...m} The design matrix for each person
CANlab functions like glmfit_multilevel, igls, and rigls can use these cell arrays directly.
For fitlm and fitlme, and lmer in R (etc.), we will want these in stacked, 'long format' tabular form for mixed effects models. This requires a Grouping (in this case a Subject ID variable) as well. We will also create a column for each predictor and leave out the intercept, as it's added automatically.
% Make a long-format table with simulated data
 
sim_data = table(cat(1, y{:}), 'VariableNames', {'Y'});
 
sid_tmp = {}; % Subject ID
var_tmp = {}; % predictor variables
 
for i = 1:m
 
% Subject ID
sid_tmp{i} = i .* Z_i{i}(:, 1); % use intercept in case n's vary
 
for j = 1:(p-1)
 
var_tmp{j}{i} = Z_i{i}(:, j+1);
 
end
 
end
 
sim_data.sid = cat(1, sid_tmp{:});
 
% add other variables
for j = 1:(p-1)
 
varname = sprintf('x%02d', j);
sim_data.(varname) = cat(1, var_tmp{j}{:});
 
end
 
sim_data
sim_data = 400×3 table
 Ysidx01
1-4.032011
2-3.869811
3-4.08511-1
4-3.954211
5-3.80401-1
60.17251-1
7-2.86571-1
8-2.57981-1
9-2.63481-1
100.57301-1
11-3.28291-1
12-3.13351-1
13-3.43891-1
14-4.908211
15-2.45871-1
16-3.71961-1
17-2.68491-1
18-5.198611
19-5.834611
20-3.214911
2119.907821
2221.598321
231.43322-1
2421.036321
255.00812-1
263.09012-1
274.12882-1
284.58232-1
292.97162-1
303.38032-1
312.80122-1
323.27712-1
333.13162-1
3420.536421
354.03962-1
360.91592-1
373.48132-1
3820.623321
3920.869421
4020.436021
413.189831
423.231831
433.39123-1
446.215531
453.64173-1
463.15443-1
471.94383-1
485.48723-1
493.61023-1
503.57363-1
512.66963-1
524.57743-1
533.67983-1
544.401631
553.63923-1
565.25953-1
573.16583-1
582.694731
592.655931
604.464031
6113.013141
6212.586741
6317.57664-1
6412.172041
6516.29804-1
6615.22364-1
6715.21074-1
6815.03134-1
6914.57454-1
7013.09864-1
7114.57134-1
7214.91104-1
7314.13874-1
7413.233841
7516.65794-1
7615.67324-1
7712.61054-1
7813.614641
7912.717141
8013.400941
8112.325151
8213.148451
839.11005-1
8415.989151
859.33425-1
868.90145-1
879.12195-1
887.56655-1
899.41835-1
908.20495-1
919.19135-1
9210.12015-1
938.32895-1
9412.677051
958.94995-1
969.51385-1
979.04175-1
9813.164051
9912.817351
10011.449451

Plot the dataset

For plotting, we may need to extract a predictor. This extracts the first non-intercept predictor (if p > 1 and there is one).
We'll use the CANlab function line_plot_multisubject. It gives us some output and basic stats on the relationship between x1 and y as well.
if p > 1
x1 = cellfun(@(x) x(:, 2), Z_i, 'UniformOutput', false);
 
create_figure('multisubject data')
disp('line_plot_multisubject output:')
[han, ~, ~, slope_stats] = line_plot_multisubject(x1, y);
 
xlabel('x1')
ylabel('y')
title('Simulated data')
 
else
disp('Intercept only: No x1-y relationship to plot')
end
ans =
Figure (multisubject data) with properties: Number: 2 Name: 'multisubject data' Color: [1 1 1] Position: [584 595 560 420] Units: 'pixels' Show all properties
line_plot_multisubject output:
Warnings: ____________________________________________________________________________________________________________________________________________ Y: input cells with low varability: 3 ____________________________________________________________________________________________________________________________________________ ____________________________________________________________________________________________________________________________________________ Input data: X scaling: No centering or z-scoring Y scaling: No centering or z-scoring Transformations: No data transformations before plot Correlations: r = 0.47 across all observations, based on untransformed input data Stats on slopes after transformation, subject is random effect: Mean b = 3.54, t( 19) = 4.27, p = 0.000413, num. missing: 0 Average within-person r = 0.58 +- 0.71 (std) Between-person r (across subject means) = -0.00, p = 1.000000 ____________________________________________________________________________________________________________________________________________
 

Fit the model

glmfit_multilevel

glmfit_multilevel is a CANlab function. It is a fast and simple option for running a two-level mixed effects model with participant as a random effect. It implements a random-intercept, random-slope model across 2nd-level units (e.g., participants). it fits regressions for individual 2nd-level units (e.g., participants), and then (optionally) uses a precision-weighted least squares approach to model group effects (if the 'weighted' keyword is entered). It thus treats participants as a random effect. This is appropriate when 2nd-level units are participants and 1st-level units are observations (e.g., trials) within participants. glmfit_multilevel was designed with this use case in mind.
Options:
glmfit_multilevel includes some options that are not included in many
mixed effects models, including bootstrapping or sign permutation for inference
Inputs:
disp('glmfit_multilevel output:')
glmfit_multilevel output:
 
stats = glmfit_multilevel(y, x1, [], 'verbose');
Analysis description: Second Level of Multilevel Model GLS analysis Observations: 20, Predictors: 1 Outcome names: 1st-level B1 1st-level B2 Weighting option: unweighted Inference option: t-test Other Options: Plots: No Robust: no Save: No Output file: glmfit_general_output.txt Total time: 0.00 s ________________________________________ Second Level of Multilevel Model Outcome variables: 1st-level B1 1st-level B2 Adj. mean 3.64 3.54 2nd-level B01 1st-level B1 1st-level B2 Coeff 3.64 3.54 STE 1.14 0.83 t 3.20 4.27 Z 2.83 3.53 p 0.00469 0.00041 ________________________________________
 

fitlme

This is Matlab's object-oriented equivalent to lmer in R. It fits a variety of mixed effects models. A list of methods and properties are here:
help LinearMixedModel
These models, like many other statistics objects implementing linear models, allow you to specify model formulas using Wilkinson notation. There is also a matrix input version.
help LinearMixedModel
LinearMixedModel Fitted linear mixed effects model. LME = FITLME(...) fits a linear mixed effects model to data. The fitted model LME is a LinearMixedModel modeling a response variable as a linear function of fixed effect predictors and random effect predictors. LinearMixedModel methods: coefCI - Coefficient confidence intervals coefTest - Linear hypothesis test on coefficients predict - Compute predicted values given predictor values random - Generate random response values given predictor values partialDependence - Partial Dependence values. plotPartialDependence - Plot Partial Dependence. plotResiduals - Plot of residuals designMatrix - Fixed and random effects design matrices fixedEffects - Stats on fixed effects randomEffects - Stats on random effects covarianceParameters - Stats on covariance parameters fitted - Fitted response residuals - Various types of residuals response - Response used to fit the model compare - Compare fitted models anova - Marginal tests for fixed effect terms disp - Display a fitted model LinearMixedModel properties: FitMethod - Method used for fitting (either 'ML' or 'REML') MSE - Mean squared error (estimate of residual variance) Formula - Representation of the model used in this fit LogLikelihood - Log of likelihood function at coefficient estimates DFE - Degrees of freedom for residuals SSE - Error sum of squares SST - Total sum of squares SSR - Regression sum of squares CoefficientCovariance - Covariance matrix for coefficient estimates CoefficientNames - Coefficient names NumCoefficients - Number of coefficients NumEstimatedCoefficients - Number of estimated coefficients Coefficients - Coefficients and related statistics Rsquared - R-squared and adjusted R-squared ModelCriterion - AIC and other model criteria VariableInfo - Information about variables used in the fit ObservationInfo - Information about observations used in the fit Variables - Table of variables used in fit NumVariables - Number of variables used in fit VariableNames - Names of variables used in fit NumPredictors - Number of predictors PredictorNames - Names of predictors ResponseName - Name of response NumObservations - Number of observations in the fit ObservationNames - Names of observations in the fit See also fitlme, fitlmematrix, LinearModel, GeneralizedLinearModel, NonLinearModel. Documentation for LinearMixedModel
 
disp('fitlme output:')
fitlme output:
 
lme = fitlme(sim_data,'Y ~ x01 + (1 + x01|sid)')
lme =
Linear mixed-effects model fit by ML Model information: Number of observations 400 Fixed effects coefficients 2 Random effects coefficients 40 Covariance parameters 4 Formula: Y ~ 1 + x01 + (1 + x01 | sid) Model fit statistics: AIC BIC LogLikelihood Deviance 1433.1 1457 -710.54 1421.1 Fixed effects coefficients (95% CIs): Name Estimate SE tStat DF pValue Lower Upper {'(Intercept)'} 3.6373 1.1069 3.2859 398 0.0011066 1.4611 5.8135 {'x01' } 3.5392 0.80765 4.3821 398 1.5057e-05 1.9514 5.127 Random effects covariance parameters (95% CIs): Group: sid (20 Levels) Name1 Name2 Type Estimate Lower Upper {'(Intercept)'} {'(Intercept)'} {'std' } 4.944 3.6236 6.7455 {'x01' } {'(Intercept)'} {'corr'} -0.21241 -0.57555 0.22071 {'x01' } {'x01' } {'std' } 3.6031 2.6389 4.9195 Group: Error Name Estimate Lower Upper {'Res Std'} 1.0772 1.0013 1.1588
 
anova(lme, 'DFMethod', 'Satterthwaite')
ans =
ANOVA marginal tests: DFMethod = 'Satterthwaite' Term FStat DF1 DF2 pValue {'(Intercept)'} 10.797 1 20 0.0036938 {'x01' } 19.203 1 20 0.00028801
pval = coefTest(lme,[1 0]);

igls / rigls

disp('igls output:')
igls output:
 
igls_stats = igls(y, x1, 'plot', 'all');
MST (diag of Ystar_mtx): 40.85556 MSE (diag of resid): 0.00000
igls.m output report: --------------------------------------- Data: 20 observations x 20 subjects Fit Type: igls AR(p) model: No Converged: Yes Max iterations: 5, Completed iterations: 1 Epsilon for convergence: 0.010000 Elapsed time: 0.32 s Statistics: Tests of inference on fixed population parameters Parameter est. t( 19) p Intcpt. 3.637 3.29 0.003885 ** Slope1 3.539 4.38 0.000320 *** Statistics: Tests of significance on random effects (variances) Parameter est. LRT p Intcpt. 24.440 1047.88 0.000000 *** Slope1 12.979 832.30 0.000000 *** ---------------------------------------
disp('Est. random-effect variances: '); disp(sqrt(igls_stats.betastar)');
Est. random-effect variances: 4.9436 3.6026
 
 
disp('Estimated fixed effects')
Estimated fixed effects
igls_stats.beta
ans = 2×1
3.6373 3.5392
 
disp('True fixed effects')
True fixed effects
b_g
b_g = 2×1
5 2
 
disp('Estimated random effects (covariances)')
Estimated random effects (covariances)
igls_stats.betastar
ans = 2×1
24.4395 12.9790
 
disp('True random effects (covariances)')
True random effects (covariances)
diag(U_g)
ans = 2×1
20 10
 

Hands-on Activities

  1. Identify and compare residual variance terms, slope estimates, t-values across models
  2. Modify the code to generate one or more continuous predictors instead of categorical/experimental ones.
  3. Turn the data-generation script into a function, so that you can efficiently simulate power and false positive rates using repeated simualtions -- two key properties of any statistical model.
A Matlab version is included in the function sim_generate_mixedfx_data1, which is called below. You can create your own version in the software package of your choice (e.g., Python, R).
 
regressortype = 'categorical';
 
[y, Z_i, x1, sim_data] = sim_generate_mixedfx_data1(m, n_i, p, b_g, s2, U_g, regressortype);
 
sim_data
sim_data = 400×3 table
 Ysidx01
17.124811
22.89401-1
34.73171-1
44.02551-1
53.77381-1
62.75291-1
73.65471-1
82.17451-1
94.859311
106.922111
116.625311
123.86531-1
137.666311
145.219711
150.26281-1
163.57891-1
173.05921-1
181.82001-1
192.35741-1
206.234011
218.774621
228.02832-1
238.67702-1
244.68792-1
257.047721
267.39282-1
277.116421
288.98042-1
296.98552-1
307.422821
317.670221
327.61512-1
339.31472-1
347.77902-1
355.791321
366.858521
375.984521
386.889021
397.37892-1
406.40402-1
4118.15393-1
4212.301131
4315.75893-1
4417.20033-1
4511.361931
4611.050031
4716.27443-1
4818.87763-1
4912.296031
5018.60823-1
5111.924931
5211.541231
5316.85473-1
5417.16403-1
5517.76283-1
5617.62013-1
5716.37193-1
5818.21783-1
5918.28423-1
6010.489531
6111.424541
6211.335241
639.924741
6411.490841
658.590541
6611.000341
6711.30024-1
689.220741
6912.66474-1
7013.63624-1
719.069141
728.349141
7310.123241
749.075041
7512.33764-1
767.889941
7714.01284-1
7810.264841
7912.99404-1
8011.46444-1
81-3.56615-1
821.383451
83-5.18045-1
84-4.07875-1
85-3.98825-1
862.641951
87-2.23285-1
88-4.21725-1
890.117951
90-1.72555-1
91-5.01005-1
92-3.90385-1
931.271151
942.046051
95-2.93325-1
96-4.45405-1
972.355551
98-4.27705-1
99-4.93895-1
100-4.46255-1
 

Extended simulations

With a bit more extension, these simulations can be used to answer a number of questions.
  1. Simulate data in R and fit using LMER. Compare/contrast output and create a "cheat sheet" comparing output
  2. Compare TPR and FPR across methods - matlab, R, glmfit_multilevel
  3. Compare with RM ANOVA
  4. Simulate and compare true positive rates (TPR, power) and false positive rates (FPR) in these interesting cases: