Simulating a multi-level dataset for linear mixed effects analysis

This tutorial takes a simulation-based approach to understanding mixed effects models. We start with the fundamental equations specifying the model structure. Then, we simulate data with two variance components and fit several mixed effects models to it.

Why do we do this? Even the most advanced users do not necessarily know the implications and effects of the assumptions and procedures used in modeling. A simulation-based approach allows you to interrogate how the model performs under different scenarios (i.e., data and experimental design conditions). This provides a way of understanding the models' behavior, including bias, variance, power, and false positive rates.

This video on simulating mixed effects data by Jeanette Mumford is a clear and gentle introduction that might be helpful. It is part of the Mumford mixed effects models series on Youtube. She also has a Mumford mixed models in fMRI series.

Table of Contents

Model formulation A simple example A more flexible simulation Specify parameters and model structure Questions to answer Create simulated dataset Plot the dataset Fit the model glmfit_multilevel fitlme igls / rigls Hands-on Activities Extended simulations

Model formulation

First level:

This describes how individual-participant data are generated from a set of fixed predictors and participant-specific model parameters ("true" regression intercept and slope values for each participant). Outcome data vector

for participant

(

) is generated by a within-person design matrix

times participant-specific coefficient vector

plus participant specific error vector

, where mis the number of participants.
y and are vectors corresponding to observations
is an within-person design matrix of model predictors (regressors). can vary across participants. p has the same value for all participants. One column of is constant across all rows, and its coefficient reflects the model intercept.
is a vector of participant-specific model parameters. Note that these are theoretical values, not estimates, as we are laying out the structure of the model. These are fixed effects, population-level effects of predictors that we are trying to estimate.

Note that

is block diagonal here.

Second level:

This describes how participant-specific model parameters

are generated from between-person (group-level) population-level model parameters

is generated by a between-person design matrix

(see below) times fixed population-level coefficient vector

plus participant-specific deviation

is a vector of participant-specific model parameters.
is a vector of fixed population-level model parameters.
A is a placeholder identity matrix for group-level variables here, eye(p)
is a participant-specific deviation from population-level effects. If participants are selected randomly from a population, varies randomly across m levels, and introduces a random effect.

Combined two-level model

is an

second-level design matrix of between-person predictors for the intercept and other model parameters.

Alternative notations

There are several other notation systems that differ somewhat in the notation used. Here are the equations used in the Mumford video lectures.

A simple example

Let's generate some data with made-up slopes. We have a simple model with a linear predictor and an intercept for each person.

disp('Design matrix');

Design matrix

Zi = [1 1 1 1 1; 0 1 2 3 4]'

Zi = 5×2

disp('Number of subjects')

Number of subjects

m = 3

m = 3

disp('Number of observations within-person, and num predictors')

Number of observations within-person, and num predictors

[n_within, p] = size(Zi)

n_within = 5

p = 2

disp('Z is the Kronecker product of an [m x m] identity matrix and Zi');

Z is the Kronecker product of an [m x m] identity matrix and Zi

disp('The Kronecker product expands Zi into a block diagonal matrix for every subject.');

The Kronecker product expands Zi into a block diagonal matrix for every subject.

Z = kron(eye(m), Zi)

Z = 15×6

   0     0     0     0     0
   1     0     0     0     0
   2     0     0     0     0
   3     0     0     0     0
   4     0     0     0     0
   0     1     0     0     0
   0     1     1     0     0
   0     1     2     0     0
   0     1     3     0     0
   0     1     4     0     0

disp('Made-up true betas')

Made-up true betas

disp('Intercept and slope for each of m persons, concatenated')

Intercept and slope for each of m persons, concatenated

beta = [0 2 1 3 4 6]'

beta = 6×1

disp('True response');

True response

y = Z*beta % Now each person's fit is a combination of their individual Zi and beta_i

y = 15×1

% Plot the true response without error

ycell = mat2cell(y, n_within * ones(m, 1))';

x1 = repmat({Zi(:, 2)}, 1, m);

create_figure('example');

handles = line_plot_multisubject(x1, ycell);

____________________________________________________________________________________________________________________________________________
Input data:
X scaling: No centering or z-scoring
Y scaling: No centering or z-scoring
                
Transformations:
No data transformations before plot
                                                                   
Correlations:                                                      
r = 0.68 across all observations, based on untransformed input data
                                                                
Stats on slopes after transformation, subject is random effect: 
Mean b = 3.67, t(  2) = 3.05, p = 0.092735, num. missing:   0   
                                                                
Average within-person r = 1.00 +- 0.00 (std)
 
Between-person r (across subject means) = NaN, p = NaN
____________________________________________________________________________________________________________________________________________

set(handles.point_handles, 'MarkerSize', 8, 'Marker', 'o'); % we can use handles to reformat things

set(gca, 'XLim', [-0.5 4.5])

xlabel('X1'); ylabel('y');

The response (y) will also be measured witih error. Let's add some errors.

Here, the errors are normally distributed with var = 1. Here the errors are homoscedastic across subjects.

This is a common assumption in mixed effects models, e.g., lmer in R, but not all models make this assumption.

e = randn(size(y, 1), 1);

y = y + e;

% Plot the true response with error

ycell = mat2cell(y, n_within * ones(m, 1))';

x1 = repmat({Zi(:, 2)}, 1, m);

create_figure('example');

handles = line_plot_multisubject(x1, ycell);

____________________________________________________________________________________________________________________________________________
Input data:
X scaling: No centering or z-scoring
Y scaling: No centering or z-scoring
                
Transformations:
No data transformations before plot
                                                                   
Correlations:                                                      
r = 0.66 across all observations, based on untransformed input data
                                                                
Stats on slopes after transformation, subject is random effect: 
Mean b = 3.53, t(  2) = 2.67, p = 0.115979, num. missing:   0   
                                                                
Average within-person r = 0.98 +- 0.03 (std)
 
Between-person r (across subject means) = NaN, p = NaN
____________________________________________________________________________________________________________________________________________

set(handles.point_handles, 'MarkerSize', 8, 'Marker', 'o'); % we can use handles to reformat things

set(gca, 'XLim', [-0.5 4.5])

xlabel('X1'); ylabel('y');

The variation in betas 1, 3, and 5 reflects individual differences in the intercept.

The variation in betas 2, 4, and 6 reflects individual differences in the slope of X1.

We set these to fixed values above, but in the next section, these individual differences will be randomly selected from a distribution. The "true" individual differences among subjects are thus random effects.

A more flexible simulation

Specify parameters and model structure

 
% number of subjects
m = 20;         
 
% n observations within-subject
% For now, assume same across all subjects
 
n_within =20; 
 
n_i = n_within * ones(m, 1); 
 
% Number of model parameters (regressors), including intercept
p = 2;
 
% Within-person residual noise variance
% For now, assume same across all subjects
sig_2_within =1.1;
 
% s2 is 1 x m vector of residual variances
s2 = sig_2_within .* ones(1, m);

Fixed population-level effects

Specify a column vector of true population-level values for each effect, intercept followed by other parameters.

b_g = [5 2]';
 
% Check the size 
if any(size(b_g) - [p 1])
    error('Population effects b_g is the wrong size')
end

Group-level covariance matrix

Diagonals are variances for each of p parameters. By convention, the intercept is first.

Off-diagonals are covariances

e.g., for an intercept and slope (simple multilevel regression), U_g would be a 2 x 2 matrix with variances of the intercept and slope on the diagonal.

 
U_g = [20 0; 0 10];  
 
% Check the size 
if ~all(size(U_g) == p)
    error('Covariance matrix U_g is the wrong size')
end
 
%%

Fixed design matrix

These can be the same for all subjects, or different. They'd be the same if we were using a fixed experimental design with the same numbers of observations per person, and different if (1) our predictors are measured variables or (2) we have different numbers of observations.

We'll assume they are the same across participants for now, but allow the code to be easily extended to different design matrices across participants.

Predictors can also be categorical or continuous, and correlated or uncorrelated. We'll generate a simple effects-coded [1 -1] predictor for now, with approximately balanced numbers of observations.

The intercept is first (a column of 1's) by convention.

Z_within = [ones(n_within, 1)]; % intercept
 
Z_tmp = randn(n_within, p - 1); 
 
Z_tmp(Z_tmp >= 0) = 1;
Z_tmp(Z_tmp <= 0) = -1;
 
Z_within = [Z_within Z_tmp];
 
Z_i = {};
for i = 1:m         % allows subjects to vary later if needed
 
    Z_i{i} = Z_within;
 
end

% Note: this could be modified to be more general

regressortype = 'continuous'; % 'continuous' or 'binary'

Now we have all the variables we need to simulate data:

n_i Number of observations for each person i = 1...m

b_g Population-level true fixed effects parameters

s2 residual variances for each person (1 x m)

U_g Group covariance matrix

Z_i Design matrix for each person

Questions to answer

Under most input parameter choices, the dataset will include a random subject intercept and random subject slope.

Which input parameter would you change to make the dataset consistent with a "random intercept only" model?

Which input parameter would you change to make the dataset consistent with a "random slope only" model?

Create simulated dataset

Generate subject-level true parameters

Assume a normal distribution with covariance U_g

beta_i here is an [p x m] matrix where column i is the [p x 1] true beta coefficient vector for subject i.

beta_i = mvnrnd(b_g, U_g, m)';

Generate data (y)

y = cell(1, m);
 
for i = 1:m         
 
    y{i} = Z_i{i} * beta_i(:, i) + sqrt(s2(i)) .* randn(n_i(i), 1);
 
end

To fit the dataset, we'll need these variables:

y{i = 1...m} The DV/outcome

Z_i{i = 1...m} The design matrix for each person

CANlab functions like glmfit_multilevel, igls, and rigls can use these cell arrays directly.

For fitlm and fitlme, and lmer in R (etc.), we will want these in stacked, 'long format' tabular form for mixed effects models. This requires a Grouping (in this case a Subject ID variable) as well. We will also create a column for each predictor and leave out the intercept, as it's added automatically.

% Make a long-format table with simulated data
 
sim_data = table(cat(1, y{:}), 'VariableNames', {'Y'});
 
sid_tmp = {}; % Subject ID
var_tmp = {}; % predictor variables
 
for i = 1:m         
 
    % Subject ID
    sid_tmp{i} = i .* Z_i{i}(:, 1);  % use intercept in case n's vary
 
    for j = 1:(p-1)
 
        var_tmp{j}{i} = Z_i{i}(:, j+1); 
 
    end
 
end
 
sim_data.sid = cat(1, sid_tmp{:});
 
% add other variables
for j = 1:(p-1)
 
    varname = sprintf('x%02d', j);
    sim_data.(varname) = cat(1, var_tmp{j}{:});
 
end
 
sim_data
sim_data = 400×3 table 
 Ysidx01
1-4.032011
2-3.869811
3-4.08511-1
4-3.954211
5-3.80401-1
60.17251-1
7-2.86571-1
8-2.57981-1
9-2.63481-1
100.57301-1
11-3.28291-1
12-3.13351-1
13-3.43891-1
14-4.908211
15-2.45871-1
16-3.71961-1
17-2.68491-1
18-5.198611
19-5.834611
20-3.214911
2119.907821
2221.598321
231.43322-1
2421.036321
255.00812-1
263.09012-1
274.12882-1
284.58232-1
292.97162-1
303.38032-1
312.80122-1
323.27712-1
333.13162-1
3420.536421
354.03962-1
360.91592-1
373.48132-1
3820.623321
3920.869421
4020.436021
413.189831
423.231831
433.39123-1
446.215531
453.64173-1
463.15443-1
471.94383-1
485.48723-1
493.61023-1
503.57363-1
512.66963-1
524.57743-1
533.67983-1
544.401631
553.63923-1
565.25953-1
573.16583-1
582.694731
592.655931
604.464031
6113.013141
6212.586741
6317.57664-1
6412.172041
6516.29804-1
6615.22364-1
6715.21074-1
6815.03134-1
6914.57454-1
7013.09864-1
7114.57134-1
7214.91104-1
7314.13874-1
7413.233841
7516.65794-1
7615.67324-1
7712.61054-1
7813.614641
7912.717141
8013.400941
8112.325151
8213.148451
839.11005-1
8415.989151
859.33425-1
868.90145-1
879.12195-1
887.56655-1
899.41835-1
908.20495-1
919.19135-1
9210.12015-1
938.32895-1
9412.677051
958.94995-1
969.51385-1
979.04175-1
9813.164051
9912.817351
10011.449451
⋮

	Y	sid	x01
1	-4.0320	1	1
2	-3.8698	1	1
3	-4.0851	1	-1
4	-3.9542	1	1
5	-3.8040	1	-1
6	0.1725	1	-1
7	-2.8657	1	-1
8	-2.5798	1	-1
9	-2.6348	1	-1
10	0.5730	1	-1
11	-3.2829	1	-1
12	-3.1335	1	-1
13	-3.4389	1	-1
14	-4.9082	1	1
15	-2.4587	1	-1
16	-3.7196	1	-1
17	-2.6849	1	-1
18	-5.1986	1	1
19	-5.8346	1	1
20	-3.2149	1	1
21	19.9078	2	1
22	21.5983	2	1
23	1.4332	2	-1
24	21.0363	2	1
25	5.0081	2	-1
26	3.0901	2	-1
27	4.1288	2	-1
28	4.5823	2	-1
29	2.9716	2	-1
30	3.3803	2	-1
31	2.8012	2	-1
32	3.2771	2	-1
33	3.1316	2	-1
34	20.5364	2	1
35	4.0396	2	-1
36	0.9159	2	-1
37	3.4813	2	-1
38	20.6233	2	1
39	20.8694	2	1
40	20.4360	2	1
41	3.1898	3	1
42	3.2318	3	1
43	3.3912	3	-1
44	6.2155	3	1
45	3.6417	3	-1
46	3.1544	3	-1
47	1.9438	3	-1
48	5.4872	3	-1
49	3.6102	3	-1
50	3.5736	3	-1
51	2.6696	3	-1
52	4.5774	3	-1
53	3.6798	3	-1
54	4.4016	3	1
55	3.6392	3	-1
56	5.2595	3	-1
57	3.1658	3	-1
58	2.6947	3	1
59	2.6559	3	1
60	4.4640	3	1
61	13.0131	4	1
62	12.5867	4	1
63	17.5766	4	-1
64	12.1720	4	1
65	16.2980	4	-1
66	15.2236	4	-1
67	15.2107	4	-1
68	15.0313	4	-1
69	14.5745	4	-1
70	13.0986	4	-1
71	14.5713	4	-1
72	14.9110	4	-1
73	14.1387	4	-1
74	13.2338	4	1
75	16.6579	4	-1
76	15.6732	4	-1
77	12.6105	4	-1
78	13.6146	4	1
79	12.7171	4	1
80	13.4009	4	1
81	12.3251	5	1
82	13.1484	5	1
83	9.1100	5	-1
84	15.9891	5	1
85	9.3342	5	-1
86	8.9014	5	-1
87	9.1219	5	-1
88	7.5665	5	-1
89	9.4183	5	-1
90	8.2049	5	-1
91	9.1913	5	-1
92	10.1201	5	-1
93	8.3289	5	-1
94	12.6770	5	1
95	8.9499	5	-1
96	9.5138	5	-1
97	9.0417	5	-1
98	13.1640	5	1
99	12.8173	5	1
100	11.4494	5	1
⋮

Plot the dataset

For plotting, we may need to extract a predictor. This extracts the first non-intercept predictor (if p > 1 and there is one).

We'll use the CANlab function line_plot_multisubject. It gives us some output and basic stats on the relationship between x1 and y as well.

if p > 1

x1 = cellfun(@(x) x(:, 2), Z_i, 'UniformOutput', false);

create_figure('multisubject data')

disp('line_plot_multisubject output:')

[han, ~, ~, slope_stats] = line_plot_multisubject(x1, y);

xlabel('x1')

ylabel('y')

title('Simulated data')

else

disp('Intercept only: No x1-y relationship to plot')

end

ans = 
  Figure (multisubject data) with properties:

      Number: 2
        Name: 'multisubject data'
       Color: [1 1 1]
    Position: [584 595 560 420]
       Units: 'pixels'

  Show all properties

line_plot_multisubject output:

Warnings:

____________________________________________________________________________________________________________________________________________
Y: input cells with low varability:
3
____________________________________________________________________________________________________________________________________________

____________________________________________________________________________________________________________________________________________
Input data:
X scaling: No centering or z-scoring
Y scaling: No centering or z-scoring
                
Transformations:
No data transformations before plot
                                                                   
Correlations:                                                      
r = 0.47 across all observations, based on untransformed input data
                                                                
Stats on slopes after transformation, subject is random effect: 
Mean b = 3.54, t( 19) = 4.27, p = 0.000413, num. missing:   0   
                                                                
Average within-person r = 0.58 +- 0.71 (std)
 
Between-person r (across subject means) = -0.00, p = 1.000000
____________________________________________________________________________________________________________________________________________

Fit the model

glmfit_multilevel

glmfit_multilevel is a CANlab function. It is a fast and simple option for running a two-level mixed effects model with participant as a random effect. It implements a random-intercept, random-slope model across 2nd-level units (e.g., participants). it fits regressions for individual 2nd-level units (e.g., participants), and then (optionally) uses a precision-weighted least squares approach to model group effects (if the 'weighted' keyword is entered). It thus treats participants as a random effect. This is appropriate when 2nd-level units are participants and 1st-level units are observations (e.g., trials) within participants. glmfit_multilevel was designed with this use case in mind.

Options:

glmfit_multilevel includes some options that are not included in many

mixed effects models, including bootstrapping or sign permutation for inference

Inputs:

y = cell array, one cell per subject. Each cell contains a column vector of outcome data for that subject, across lower-level observations (e.g., trials).
x1 = cell array, one cell per subject. Each cell contains a [trials x predictors] matrix of predictor data for that subject.

disp('glmfit_multilevel output:')
glmfit_multilevel output:
 
stats = glmfit_multilevel(y, x1, [], 'verbose');
Analysis description: Second Level of Multilevel Model  
GLS analysis

Observations:  20, Predictors:   1
Outcome names: 1st-level B1	1st-level B2	
Weighting option: unweighted
Inference option: t-test

Other Options:
	Plots: No
	Robust: no
	Save: No
	Output file: glmfit_general_output.txt
	Total time: 0.00 s

________________________________________

Second Level of Multilevel Model 
	Outcome variables:
	1st-level B1	1st-level B2	
Adj. mean	3.64	3.54	

2nd-level B01
	1st-level B1	1st-level B2	
Coeff	3.64	3.54	
STE	1.14	0.83	
t	3.20	4.27	
Z	2.83	3.53	
p	0.00469	0.00041	

________________________________________
 

fitlme

This is Matlab's object-oriented equivalent to lmer in R. It fits a variety of mixed effects models. A list of methods and properties are here:

help LinearMixedModel

These models, like many other statistics objects implementing linear models, allow you to specify model formulas using Wilkinson notation. There is also a matrix input version.

help LinearMixedModel
 LinearMixedModel Fitted linear mixed effects model.
    LME = FITLME(...) fits a linear mixed effects model to data. The fitted
    model LME is a LinearMixedModel modeling a response variable as a
    linear function of fixed effect predictors and random effect
    predictors.
 
    LinearMixedModel methods:
        coefCI - Coefficient confidence intervals
        coefTest - Linear hypothesis test on coefficients
        predict - Compute predicted values given predictor values
        random - Generate random response values given predictor values
        partialDependence - Partial Dependence values.
        plotPartialDependence - Plot Partial Dependence.
        plotResiduals - Plot of residuals
        designMatrix - Fixed and random effects design matrices
        fixedEffects - Stats on fixed effects
        randomEffects - Stats on random effects
        covarianceParameters - Stats on covariance parameters
        fitted - Fitted response
        residuals - Various types of residuals
        response - Response used to fit the model
        compare - Compare fitted models
        anova - Marginal tests for fixed effect terms           
        disp - Display a fitted model                                                                                                                 
 
    LinearMixedModel properties:
        FitMethod - Method used for fitting (either 'ML' or 'REML')
        MSE - Mean squared error (estimate of residual variance)
        Formula - Representation of the model used in this fit
        LogLikelihood - Log of likelihood function at coefficient estimates
        DFE - Degrees of freedom for residuals
        SSE - Error sum of squares
        SST - Total sum of squares
        SSR - Regression sum of squares
        CoefficientCovariance - Covariance matrix for coefficient estimates
        CoefficientNames - Coefficient names
        NumCoefficients - Number of coefficients
        NumEstimatedCoefficients - Number of estimated coefficients
        Coefficients - Coefficients and related statistics
        Rsquared - R-squared and adjusted R-squared
        ModelCriterion - AIC and other model criteria
        VariableInfo - Information about variables used in the fit
        ObservationInfo - Information about observations used in the fit
        Variables - Table of variables used in fit
        NumVariables - Number of variables used in fit
        VariableNames - Names of variables used in fit
        NumPredictors - Number of predictors
        PredictorNames - Names of predictors
        ResponseName - Name of response
        NumObservations - Number of observations in the fit
        ObservationNames - Names of observations in the fit
 
    See also fitlme, fitlmematrix, LinearModel, GeneralizedLinearModel, NonLinearModel.

    Documentation for LinearMixedModel

 
disp('fitlme output:')
fitlme output:
 
lme = fitlme(sim_data,'Y ~ x01 + (1 + x01|sid)')
lme = 
Linear mixed-effects model fit by ML

Model information:
    Number of observations             400
    Fixed effects coefficients           2
    Random effects coefficients         40
    Covariance parameters                4

Formula:
    Y ~ 1 + x01 + (1 + x01 | sid)

Model fit statistics:
    AIC       BIC     LogLikelihood    Deviance
    1433.1    1457    -710.54          1421.1  

Fixed effects coefficients (95% CIs):
    Name                   Estimate    SE         tStat     DF     pValue        Lower     Upper 
    {'(Intercept)'}        3.6373       1.1069    3.2859    398     0.0011066    1.4611    5.8135
    {'x01'        }        3.5392      0.80765    4.3821    398    1.5057e-05    1.9514     5.127

Random effects covariance parameters (95% CIs):
Group: sid (20 Levels)
    Name1                  Name2                  Type            Estimate    Lower       Upper  
    {'(Intercept)'}        {'(Intercept)'}        {'std' }           4.944      3.6236     6.7455
    {'x01'        }        {'(Intercept)'}        {'corr'}        -0.21241    -0.57555    0.22071
    {'x01'        }        {'x01'        }        {'std' }          3.6031      2.6389     4.9195

Group: Error
    Name               Estimate    Lower     Upper 
    {'Res Std'}        1.0772      1.0013    1.1588
 
anova(lme, 'DFMethod', 'Satterthwaite')
ans = 
    ANOVA marginal tests: DFMethod = 'Satterthwaite'

    Term                   FStat     DF1    DF2    pValue    
    {'(Intercept)'}        10.797    1      20      0.0036938
    {'x01'        }        19.203    1      20     0.00028801
pval = coefTest(lme,[1 0]);

igls / rigls

disp('igls output:')

igls output:

igls_stats = igls(y, x1, 'plot', 'all');

MST (diag of Ystar_mtx): 40.85556
MSE (diag of resid): 0.00000

igls.m output report:
---------------------------------------
Data:  20 observations x  20 subjects 
Fit Type: igls
AR(p) model: No
Converged: Yes
Max iterations:   5, Completed iterations:   1
Epsilon for convergence: 0.010000
Elapsed time: 0.32 s

Statistics: Tests of inference on fixed population parameters
Parameter	est.	t( 19)	p	
Intcpt.	3.637	3.29	0.003885	**
Slope1	3.539	4.38	0.000320	***


Statistics: Tests of significance on random effects (variances)
Parameter	est.	LRT	p	
Intcpt.	24.440	1047.88	0.000000	***
Slope1	12.979	832.30	0.000000	***

---------------------------------------

disp('Est. random-effect variances: '); disp(sqrt(igls_stats.betastar)');

Est.  random-effect variances: 
    4.9436    3.6026

disp('Estimated fixed effects')

Estimated fixed effects

igls_stats.beta

ans = 2×1

    3.6373
    3.5392

disp('True fixed effects')

True fixed effects

b_g

b_g = 2×1

     5
     2

disp('Estimated random effects (covariances)')

Estimated random effects (covariances)

igls_stats.betastar

ans = 2×1

   24.4395
   12.9790

disp('True random effects (covariances)')

True random effects (covariances)

diag(U_g)

ans = 2×1

    20
    10

Hands-on Activities

Identify and compare residual variance terms, slope estimates, t-values across models
Modify the code to generate one or more continuous predictors instead of categorical/experimental ones.
Turn the data-generation script into a function, so that you can efficiently simulate power and false positive rates using repeated simualtions -- two key properties of any statistical model.

A Matlab version is included in the function sim_generate_mixedfx_data1, which is called below. You can create your own version in the software package of your choice (e.g., Python, R).

 
regressortype = 'categorical';
 
[y, Z_i, x1, sim_data] = sim_generate_mixedfx_data1(m, n_i, p, b_g, s2, U_g, regressortype);
 
sim_data
sim_data = 400×3 table 
 Ysidx01
124811
89401-1
73171-1
02551-1
77381-1
75291-1
65471-1
17451-1
859311
922111
625311
86531-1
666311
219711
26281-1
57891-1
05921-1
82001-1
35741-1
234011
774621
02832-1
67702-1
68792-1
047721
39282-1
116421
98042-1
98552-1
422821
670221
61512-1
31472-1
77902-1
791321
858521
984521
889021
37892-1
40402-1
15393-1
301131
75893-1
20033-1
361931
050031
27443-1
87763-1
296031
60823-1
924931
541231
85473-1
16403-1
76283-1
62013-1
37193-1
21783-1
28423-1
489531
424541
335241
924741
490841
590541
000341
30024-1
220741
66474-1
63624-1
069141
349141
123241
075041
33764-1
889941
01284-1
264841
99404-1
46444-1
81-3.56615-1
383451
83-5.18045-1
84-4.07875-1
85-3.98825-1
641951
87-2.23285-1
88-4.21725-1
117951
90-1.72555-1
91-5.01005-1
92-3.90385-1
271151
046051
95-2.93325-1
96-4.45405-1
355551
98-4.27705-1
99-4.93895-1
100-4.46255-1
⋮
 

	Y	sid	x01
1	7.1248	1	1
2	2.8940	1	-1
3	4.7317	1	-1
4	4.0255	1	-1
5	3.7738	1	-1
6	2.7529	1	-1
7	3.6547	1	-1
8	2.1745	1	-1
9	4.8593	1	1
10	6.9221	1	1
11	6.6253	1	1
12	3.8653	1	-1
13	7.6663	1	1
14	5.2197	1	1
15	0.2628	1	-1
16	3.5789	1	-1
17	3.0592	1	-1
18	1.8200	1	-1
19	2.3574	1	-1
20	6.2340	1	1
21	8.7746	2	1
22	8.0283	2	-1
23	8.6770	2	-1
24	4.6879	2	-1
25	7.0477	2	1
26	7.3928	2	-1
27	7.1164	2	1
28	8.9804	2	-1
29	6.9855	2	-1
30	7.4228	2	1
31	7.6702	2	1
32	7.6151	2	-1
33	9.3147	2	-1
34	7.7790	2	-1
35	5.7913	2	1
36	6.8585	2	1
37	5.9845	2	1
38	6.8890	2	1
39	7.3789	2	-1
40	6.4040	2	-1
41	18.1539	3	-1
42	12.3011	3	1
43	15.7589	3	-1
44	17.2003	3	-1
45	11.3619	3	1
46	11.0500	3	1
47	16.2744	3	-1
48	18.8776	3	-1
49	12.2960	3	1
50	18.6082	3	-1
51	11.9249	3	1
52	11.5412	3	1
53	16.8547	3	-1
54	17.1640	3	-1
55	17.7628	3	-1
56	17.6201	3	-1
57	16.3719	3	-1
58	18.2178	3	-1
59	18.2842	3	-1
60	10.4895	3	1
61	11.4245	4	1
62	11.3352	4	1
63	9.9247	4	1
64	11.4908	4	1
65	8.5905	4	1
66	11.0003	4	1
67	11.3002	4	-1
68	9.2207	4	1
69	12.6647	4	-1
70	13.6362	4	-1
71	9.0691	4	1
72	8.3491	4	1
73	10.1232	4	1
74	9.0750	4	1
75	12.3376	4	-1
76	7.8899	4	1
77	14.0128	4	-1
78	10.2648	4	1
79	12.9940	4	-1
80	11.4644	4	-1
81	-3.5661	5	-1
82	1.3834	5	1
83	-5.1804	5	-1
84	-4.0787	5	-1
85	-3.9882	5	-1
86	2.6419	5	1
87	-2.2328	5	-1
88	-4.2172	5	-1
89	0.1179	5	1
90	-1.7255	5	-1
91	-5.0100	5	-1
92	-3.9038	5	-1
93	1.2711	5	1
94	2.0460	5	1
95	-2.9332	5	-1
96	-4.4540	5	-1
97	2.3555	5	1
98	-4.2770	5	-1
99	-4.9389	5	-1
100	-4.4625	5	-1
⋮

Extended simulations

With a bit more extension, these simulations can be used to answer a number of questions.

Simulate data in R and fit using LMER. Compare/contrast output and create a "cheat sheet" comparing output
Compare TPR and FPR across methods - matlab, R, glmfit_multilevel
Compare with RM ANOVA
Simulate and compare true positive rates (TPR, power) and false positive rates (FPR) in these interesting cases:

mismodeling: ignoring a random effect
unbalanced data across levels of a predictor
unequal N within subject
unequal variances within-subject
experimental design vs. normally-distributed predictors
non-normally distributed errors
more vs. fewer irrelevant predictors
more levels of a predictor (mixed vs ANOVA)
crossed random effects with more or fewer levels