Non classé • 18.02.2026

Variable selection in chemometrics: Improving the robustness of models

Julie

sélection de variables en chimiométrie: modèles robustes

INDEX +

When asked why some models hold up in production while others fall apart at the first batch change, I always come back to the same topic: the Variable Selection. The query “Variable Selection in Chemometrics: Improving the Robustness of Models” says it all. We seek less chance, more reliability, and wavelengths that really tell the story. This guide shares my field practice, the pitfalls encountered, and a clear method to gain solidity without losing interpretability.

Variable Selection in Chemometrics: Improving the Robustness of Models

The selection of attributes is not just a mathematical exercise. It is a filter that separates useful information from instrumental noise, sampling variability, and misleading correlations. When used well, it reduces collinearity, limits overfitting, and strengthens interpretability. It can also reduce costs by guiding the choice of a simpler sensor or a narrower spectral window.

I recall a NIR calibration for moisture in milk powders: by removing three windows influenced by temperature, the external error dropped and the maintenance of the model became more serene. The dimensionality reduction didn’t take anything away from the physics of the problem; it made it visible.

Understanding the families of variable selection approaches

Filters: fast, model-independent

These techniques evaluate each variable before learning (correlation with Y, mutual information, univariate tests, stability of loadings from a PCA). Advantages: speed, simplicity, low risk of model bias. Limitations: local view, inability to capture subtle interactions. I use them for a first screening, especially when the spectrum is wide and redundant.

Wrappers: performance first

Wrappers build models to compare subsets of variables (RFE, stepwise, genetic algorithms, interval search such as iPLS). Effective but computationally expensive, they require strict cross-validation to avoid the luck trap. Their strength: align selection with the final metric. Their weakness: sensitivity to noise if sampling is limited.

Embedded: sparsity in the algorithm

Some models learn and select at the same time: penalties ( LASSO, Elastic Net ), trees/forests, or PLS with importance scores ( PLS-VIP ). These are my go-to methods for industrial calibrations, as they balance bias and variance while keeping good scientific traceability when parameterized correctly.

Family	Examples	Strengths	Limitations	When to use
Filters	Corr(Y), mutual information, PCA loadings	Fast, transparent	Ignore interactions	Roughing out, large spectrums
Wrappers	RFE, GA, iPLS	Optimized for the metric	Heavy, sensitive to noise	Refining around informative bands
Embedded	L1/L2, PLS-VIP, trees	Integrated sparsity	Crucial tuning	Robust and explainable models

Concrete strategies to strengthen robustness

Preprocessing and spectral coherence

Before any selection, stabilize the physics: baseline correction, normalization, SNV, Savitzky–Golay derivatives. Your variables then stop carrying the imprint of particle size distribution or the optical path. To explore this step further, I have detailed the best practices in this post on spectral data preprocessing: preprocessing, a crucial step in chemometrics.

Methodical validation: avoiding mirages

The selection must be included in the cross-validation, not performed beforehand. Even better, a nested cross-validation fixes the optimization in an inner loop and evaluates in an outer loop. You gain an honest estimate of risk and hyperparameters less prone to opportunism. This resource covers common traps: reminders on cross-validation.

Stability of selection: thinking in ensembles

I value as much the constancy of the chosen variables as the error metric. Bootstrap, “stability selection”, permutations, or MC-UVE help verify that a subset reappears under perturbations. If the bands held vary from one fold to another, the selection may be capturing local noise. Seeking stability reduces nasty surprises when transferring the model.

Spectral intervals rather than isolated points

Physically coherent regions (for example around the O–H harmonics) survive instrument changes better than isolated wavelengths. Interval-based methods (like iPLS) often provide a good compromise between finesse and robustness, while facilitating dialogue with process experts.

Domain knowledge and artefacts

Identify the “easy” but misleading variables: surface water, temperature markers, bands linked to a process additive. These signals give models that perform well on one batch but poorly on another. A quick physical audit of candidate variables saves weeks of statistical iterations.

Avoiding recurring pitfalls

Preprocessing, PCA or PLS computed on the whole dataset before splitting: that’s data leakage. Compute them in each CV fold.
Hyperparameter optimization on the final test: biased metric. Keep a virgin evaluation set.
Comparison of 50 methods without control for multiplicity: luck-winners are many. Use replications and uncertainty reports.
Lack of Y permutation or Y-scrambling: without this guard, a model may “succeed” on a random signal.
Overlooking maintenance costs: overly aggressive selection can break at the slightest recalibration.

Guided example: a robust pipeline on NIR data

1) Partitioning and game rules

Stratified splitting by batch to preserve structure. Reserve a frozen external set. Everything related to variable selection is done inside the folds. I measure risk with the RMSEP and the stability of the subset.

2) Preprocessing

SNV + SG derivative (short windows to limit noise), then light smoothing. Parameters tuned in the inner loop. I check the impact on residue dispersion and the compactness of the scores.

3) Selection and modelling

Two parallel tracks: a) PLS with L1/L2 penalization (sparse LASSO/Elastic Net) to encourage sparsity; b) interval search-type iPLS to anchor the physics. The selected variables must remain stable across multiple re-splits and be consistent with chemistry.

4) External evaluation and diagnostic

Applied to the held-out test set, compared with the full-spectral model, residual analysis by batch. If the variables evolve strongly from one draw to another, I reexamine the interval granularity or the CV scheme. The PLS VIP importances guide the discussion with the team; for a reminder on the framework, see the PLS regression.

Personal rule: if a band does not appear at least 70% of the time in resampling, I consider it suspect, even if the metric is flattering.

Parsimony or reasoned redundancy?

A minimalist subset is appealing, but controlled redundancy provides security against instrument or supplier drift. I aim for a robust core of carrying variables, surrounded by “buffer” variables that stabilize the prediction. This comfort zone prevents even minor optical variation from destabilizing the model.

Another lever: prefer windows slightly wider than the theoretical absorption band. Real signals breathe, and a margin protects against spectral shifts or imperfect baseline corrections.

Interpret, document, and communicate

Selection is durable only if it is narratable. Associate each variable or interval with a physico-chemical hypothesis. Archive the preprocessing version, the list of variables, the metric, and the explained variance. A future audit could distinguish a process drift from an instrumental drift.

In my files, a simple schematic summarizes the chain: samples → preprocessing → selection method → hyperparameters → performance. This “identity sheet” avoids misunderstandings during annual recalibrations.

Checklist before final validation

Preprocessing recalculated in each fold, no footprint left between training and validation.
Cross-validation scheme adapted to the experimental design (by batch, by day, by instrument).
Uncertainty report on the metric and on the selected variables via resampling.
Interpretable variables, linked to a plausible physico-chemical transition or property.
Transferability test: another instrument, another batch, another operator.
Maintenance plan: alert thresholds, re-fit frequency, strategy for outliers.

What to take away for robust models

Variable selection is not a chase for the maximum score; it is a conversation between chemistry, metrology, and the algorithm. By combining careful preprocessing, intelligent penalization, interval search, and rigorous evaluation, we obtain lean, traceable, and robust models against real-world surprises. Take the time to document, test your choices against physics, and keep a periodic testing protocol on hand. Your predictions will be calmer, your deployments more serene.

Want to go further? Return to the fundamentals of PLS and establish a strict validation hygiene; these two reflexes, supported by thoughtful selection, will durably transform how your models age in the field.