Simulation Study: CFA

Does respecting latent structure help prediction, or does flexible ML beat it under messy conditions?

CFA Simulation and Model Comparison

This simulation generates data from a Confirmatory Factor Analysis (CFA) measurement model:

Two latent factors (F1, F2) are simulated with a correlation r_F12.
Six items (X1–X6) are generated as linear combinations of these factors plus residual error, using a specified loading matrix (Λ) and residual variances (Θ).
Optional complexities (heterogeneous loadings, cross-loadings, residual correlations, ordinalization) mimic common violations of the “clean CFA” assumption. (Currently, items are continuous — the ordinal option is available for future runs.)
An outcome variable (Y) is then created as a function of the latent factors, with optional nonlinearities (interaction, quadratic, threshold, mixture).

This setup means each dataset has a known CFA structure:
$$X = \Lambda F + E$$ where X are the observed items, F are the latent factors, and E are residuals.

How CFA and Feature Sets Are Used

A simple two-factor CFA model is fitted in lavaan on the training items (X1–X6).
The fitted model is then applied to both train and test sets to generate factor scores.
Factor scores serve as predictors of the outcome Y alongside other feature sets, allowing direct comparison of different ways of representing latent constructs:
- Items – six raw observed indicators (X1–X6) used directly as predictors.
- Sum scores – classical shortcut: compute the mean of items per factor (F1_sum = mean of X1–X3, F2_sum = mean of X4–X6).
- Factor scores – CFA-estimated latent variables (F1, F2) extracted with lavaan. This represents the psychometric approach.
- All – combine all representations (items + sum scores + factor scores) into one feature set, letting the prediction model decide which inputs contribute most.

Measurement Error in the Simulation

A key motivation for this study is to understand how measurement error affects prediction when using different representations of latent constructs.

How measurement error is built in

In CFA, each observed item is modeled as:

[ X = \Lambda F + E ]

ΛF (signal): the part of item variance explained by the latent factor(s).
E (error): the residual variance not explained by the factors.

In the simulation, measurement error appears in several ways:

Baseline residual variance:
Even in clean conditions, items have error variance (e.g., loading = 0.8 means 64% signal, 36% error).
Cross-loadings:
Items are allowed to load on both factors, contaminating measurement and reducing construct clarity.
Residual correlations:
Item errors are correlated, breaking local independence and introducing shared error variance.
Heterogeneous loadings:
Some items are strong measures, others are weak. Weak items carry more error relative to signal.

Why this matters for prediction

Factor scores (CFA) try to recover latent factors by modeling Λ and Θ, thereby reducing measurement error.
Sum scores and raw items mix signal and error without correction.
XGBoost does not model error explicitly but can sometimes down-weight noisy predictors if they hurt prediction.

Consequences

Under clean measurement (L0–L1), CFA factor scores + OLS produce stable, interpretable predictors because error is well modeled.
Under messy measurement (L2–L4), CFA assumptions are violated (cross-loadings, correlated residuals, uneven loadings). In these conditions, the benefits of CFA weaken, and flexible ML methods (like XGBoost) can outperform by directly leveraging predictive patterns, even if they partially reflect noise.

Why compare CFA and ML?

CFA factor scores represent the theory-driven measurement approach:
- Latent constructs are extracted from noisy items.
- Assumes a linear measurement model and local independence.
- Prioritizes validity and interpretability.
OLS regression provides the classical statistical baseline:
- Predicts outcomes by minimizing squared residuals.
- Can be run on items, sum scores, or factor scores.
- Sensitive to measurement error, scaling, and misspecification.
- It estimates coefficients by minimizing the sum of squared residual:
$$ Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \dots + \beta_pX_p + \varepsilon $$
XGBoost (gradient boosting trees) represents the flexible machine learning approach:
- Captures nonlinearities, thresholds, and interactions automatically.
- Does not rely on latent structure assumptions.
- Robust to noise but less interpretable, and depends on hyperparameters.

Notes and Clarifications

CFA factor scores are estimates, not true factors.
This explains why they do not always dominate even under clean conditions: factor scores are imperfect proxies of the latent variables.
OLS on factor scores is not “pure CFA.”
The pipeline uses CFA to estimate factor scores, then applies OLS regression. This is a hybrid approach (psychometric measurement feeding into classical regression), not a direct comparison of CFA vs ML.
Ordinal items are optional but currently disabled.
All items are simulated as continuous in the present analyses. An ordinalization option exists in the code, but if enabled, the items are still treated as continuous by lavaan. This is acceptable for prediction-focused comparisons, but would matter for psychometric model fit evaluation.

What the presets test

L0_clean: Ideal conditions → CFA factor scores and OLS should perform well.
L1_nonlinear: Adds interaction → XGBoost gains an edge by capturing nonlinearity.
L2_measure / L3_mixed: Add cross-loadings, residual correlations, and heterogeneous loadings → CFA assumptions are violated, ML becomes more competitive.
L4_spicy: Combines strong nonlinearities + measurement noise + mixture subgroups → stress-test scenario where ML flexibility often outperforms psychometric factor scores.

Key takeaway

This framework lets us ask:

How much do theory-driven measurement models (CFA factor scores) help prediction when data are clean?\
How much do flexible ML models gain when data include nonlinearities and messy measurement structures?

In other words, it bridges psychometric measurement and predictive machine learning, showing under which conditions each approach has advantages.

00_ALL_IN_ONE_CFA.R
End-to-end driver script for the simulation study. Specifically:
- Defines presets (L0_clean, L1_nonlinear, L2_measure, L3_mixed, L4_spicy) via preset_cfg().
- Generates replicated CFA-style datasets for each preset.
- Splits datasets into training/test sets (default 70/30).
- Fits predictive models on multiple input types:
  - OLS regression (items, sums, factor scores, or combined).
  - XGBoost (tree-based gradient boosting).
- Collects predictions (Ŷ) vs. true outcomes (Y).
- Computes performance metrics:
  - RMSE (Root Mean Squared Error).
  - R² (variance explained in Y).
- Aggregates results by model, score type, and preset, saving tidy CSVs.
- Produces summary plots (R², RMSE distributions, boxplots, rainclouds).
scripts/utils_pilot.R
Helper functions used by the main script, e.g. reproducible train/test splitting, metric calculations, and plotting utilities.
data/
Project output directory (created automatically).
- data/sim/ – raw simulated datasets and manifest.
- data/pred/ – per-dataset predictions + index.
- data/out/ – aggregated metrics and plots.

How to Run

Open 00_ALL_IN_ONE_CFA.R.
Set the mode variable at the top to one of the presets:
- L0_clean (baseline, linear)
- L1_nonlinear (adds mild interaction)
- L2_measure (adds measurement error + quadratic)
- L3_mixed (heterogeneous loadings + more nonlinearity)
- L4_spicy (full chaos: nonlinearities, threshold, mixture, crossloadings).
Run the script. It will:
- Generate datasets (with replicates).
- Train models (OLS, XGBoost).
- Save predictions, metrics, and plots.
Inspect results in data/out/ (metrics CSVs + plots).

Preset Configurations

The simulation includes five preset modes, each adding layers of complexity. Below is a comparison of the main design features:

Preset	Nonlinearities	Measurement tweaks	Factor loadings	Other twists
L0_clean	None (linear only)	None (no crossloadings, no resid cov, continuous items)	Balanced (0.80, 0.70, 0.60)	Baseline “clean” scenario
L1_nonlinear	Interaction only (β_int = 0.20)	None (clean measurement)	Balanced	Mild departure from linearity
L2_measure	Interaction (0.20) + Quadratic (0.10)	Crossloadings (0.20); residual cov (ρ = 0.30) on (1,2) and (5,6)	Balanced	First level of measurement “mess”
L3_mixed	Interaction (0.25) + Quadratic (0.20)	Same as L2 (crossload 0.20, resid cov 0.30)	Heterogeneous loadings (0.85, 0.55, 0.35)	Mix of nonlinear + uneven measurement strength
L4_spicy	Interaction (0.40) + Quadratic (0.30) + Threshold (β_thr = 0.25, thr_F1 = 1.0) + Mixture (50% alt group: β_Y_B = [0.6, –0.2], β_int_B = 0.20)	Heavy crossloadings (0.30); residual cov (ρ = 0.40) on (1,2), (3,4), (5,6)	Heterogeneous (0.85, 0.55, 0.35)	“Chaos mode”: strongest nonlinearities + group mixture

Simulation Settings

Core design
- N → sample size per dataset (number of respondents).
- r_F12 → correlation between the two latent factors (F1, F2).
- R2_target_Y → target proportion of variance in the outcome (Y) explained by the latent predictors (noise is adjusted accordingly).
- β_Y → regression coefficients linking F1 and F2 to Y (strength of prediction).
Factor loadings (measurement model)
- load_F1 / load_F2 → default loadings for the three indicators of each factor.
- hetero_loads → if TRUE, use uneven (strong, medium, weak) loadings.
- load_F1_hetero / load_F2_hetero → actual heterogeneous loadings if used.
Measurement tweaks (extra imperfections)
- crossload_size → artificial cross-loadings between factors.
- resid_cov_pairs → pairs of items with correlated residuals.
- resid_cov_rho → strength of residual correlation.
- make_ordinal → discretize items into Likert-type categories. (Currently disabled, so all items are continuous.)
- likert_K → number of Likert categories (e.g., 4, 5).
Outcome nonlinearities
- has_interaction, β_int → interaction term (F1 × F2).
- has_quadratic, β_quad → quadratic effect (F1²).
- has_threshold, thr_F1, β_thr → threshold effect (extra bump if F1 > thr).
- has_mixture, mix_p, β_Y_B, β_int_B → mixture model with subgroup-specific betas.
Reproducibility
- seed_base → starting random seed.
- replicates → number of datasets generated per run.
Training settings
- test_ratio → proportion of cases reserved for testing (default 30%).
- use_xgboost, use_sum_scores, use_factor_scores, use_combined → toggles for which models/feature sets are run.
- scale_items, scale_sum, scale_factor, scale_all → scaling options for OLS (trees are scale-invariant).
XGBoost parameters
- xgb_eta → learning rate.
- xgb_max_depth → tree depth.
- xgb_subsample, xgb_colsample → row and feature subsampling.
- xgb_nrounds, xgb_esr → max boosting iterations and early stopping.
Session tag
- session_tag → label to identify runs (e.g., “L0_clean”, “L4_spicy”).

Evaluation Metrics

Predictive performance is evaluated using:

Root Mean Squared Error (RMSE):

$$ RMSE = \sqrt{\tfrac{1}{n} \sum (y - \hat{y})^2} $$
Coefficient of Determination (R²):

$$ R^2 = 1 - \tfrac{MSE}{Var(Y)} $$

Results Overview

The figures below summarize predictive performance (R² and RMSE) across all presets, models, and feature sets.
(Note: these are predictive metrics, not CFA model fit indices like CFI or RMSEA.)

1. Bar plots (mean ± SE by preset)

What you see:
Each panel corresponds to a preset (L0_clean → L4_spicy). Bars show average R² or RMSE with error bars (± standard error), split by model (lm vs xgboost) and feature set (items, sum, factor, all).
Interpretation:
- In L0_clean and L1_nonlinear, OLS is equal to or slightly better than XGBoost.
- Starting from L2_measure, XGBoost catches up and then surpasses OLS as data complexity rises.
- By L4_spicy, the gap is clear: OLS struggles, while XGBoost maintains higher R² and lower RMSE.

2. Line plots (trends across presets)

What you see:
Average R² (or RMSE) plotted across presets, shown separately for each feature set. Lines track how performance changes as the data generating process becomes more complex.
Interpretation:
- R² declines steadily from clean to spicy scenarios — prediction becomes harder overall.
- OLS lines drop more steeply than XGBoost lines, especially for raw items and sum scores.
- RMSE increases with preset complexity, again with OLS deteriorating faster.
- This shows how nonlinearity and measurement mess increasingly favor ML methods.

3. Heatmaps (Δ performance: XGBoost – OLS)

What you see:
Differences in R² and RMSE between models, by feature set and preset.
- Positive ΔR² (blue) → XGBoost outperforms OLS.
- Negative ΔR² (red) → OLS outperforms XGBoost.
Interpretation:
- Early presets (L0–L1): values are negative (OLS slightly better).
- Later presets (L2–L4): values flip to positive — XGBoost increasingly outperforms OLS.
- The strongest improvements are for items and combined predictors under L4_spicy.
- RMSE heatmaps confirm the same pattern: XGBoost yields lower errors in complex scenarios.

4. Raincloud plots (distribution across replicates)

What you see:
Each raincloud shows the distribution of R² (or RMSE) across replicates, for a given preset, score type, and model.
- Shape of the cloud = distribution.
- Black bar = median.
- Dots = individual dataset results.
Interpretation:
- L0_clean: Distributions overlap strongly — both models are similar.
- L2_measure & L3_mixed: XGBoost distributions shift upward (higher R²) and downward (lower RMSE).
- L4_spicy: Clear separation — XGBoost consistently outperforms OLS across almost all replicates.

5. General trends

OLS on factor scores performs best when assumptions are clean (L0–L1).
XGBoost shines when:
- nonlinear terms are present (interaction, quadratic, threshold),
- items are noisy (cross-loadings, residual correlations), or
- groups follow different rules (mixture).
Overall complexity reduces prediction quality for both models, but XGBoost is more robust under stress.

Takeaway

OLS with factor scores = strong baseline under classical CFA conditions.
XGBoost = flexible alternative that handles messy, nonlinear, and heterogeneous data better.
This simulation illustrates the trade-off:
- Theory-driven measurement (CFA) is optimal in clean, interpretable contexts.
- Machine learning becomes advantageous when real-world data deviate from ideal assumptions.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
data		data
out_all		out_all
scripts		scripts
.gitignore		.gitignore
CFA_ML_comparison.png		CFA_ML_comparison.png
README.md		README.md
simulation_study.Rproj		simulation_study.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Simulation Study: CFA

CFA Simulation and Model Comparison

How CFA and Feature Sets Are Used

Measurement Error in the Simulation

How measurement error is built in

Why this matters for prediction

Consequences

Why compare CFA and ML?

Notes and Clarifications

What the presets test

Key takeaway

Contents

How to Run

Preset Configurations

Simulation Settings

Evaluation Metrics

Results Overview

1. Bar plots (mean ± SE by preset)

2. Line plots (trends across presets)

3. Heatmaps (Δ performance: XGBoost – OLS)

4. Raincloud plots (distribution across replicates)

5. General trends

Takeaway

About

Uh oh!

Releases

Packages

Languages

aleksawr/simulation_cfa

Folders and files

Latest commit

History

Repository files navigation

Simulation Study: CFA

CFA Simulation and Model Comparison

How CFA and Feature Sets Are Used

Measurement Error in the Simulation

How measurement error is built in

Why this matters for prediction

Consequences

Why compare CFA and ML?

Notes and Clarifications

What the presets test

Key takeaway

Contents

How to Run

Preset Configurations

Simulation Settings

Evaluation Metrics

Results Overview

1. Bar plots (mean ± SE by preset)

2. Line plots (trends across presets)

3. Heatmaps (Δ performance: XGBoost – OLS)

4. Raincloud plots (distribution across replicates)

5. General trends

Takeaway

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages