Non classé 19.02.2026

Cross-validation in chemometrics: Principles and best practices

Julie
validation croisée en chimiométrie: guide pratique
INDEX +

When asked how to make a predictive model reliable in the laboratory, I always come back to the same foundation: cross-validation. In chemometrics, it is what brings order to uncertainty, protects against performance illusions, and prepares a serene deployment, from bench testing to production. This guide shares my field benchmarks, my default choices, and the traps I have learned to avoid while training teams and supporting industry.

Cross-validation in chemometrics: Principles and best practices

Validating a model is testing its ability to generalize beyond the training sample. Cross-validation segments the data into folds, then systematically evaluates predictions on held-out subsets. Its primary role is to contain overfitting, the heart of disappointments in production. It also illuminates the balance between bias (model too simple) and variance (model too unstable), two forces pulling in opposite directions. In practice, it provides an internal estimate of error, often summarized by metrics such as the Q², the RMSECV, or the accuracy in classification, while guiding hyperparameter selection and the sizing of the model.

Why cross-validation structures your chemometrics projects

A good model is not limited to a nice training R². It must absorb small daily variations: sample lots, operators, minor instrumental drifts. Internal validation helps anticipate these perturbations. It paves the way for an even more demanding check, the external test set, reserved for samples never seen in the development process. This clean separation between calibration, internal validation, and final test allows telling a credible performance story to your quality, your partners, and production.

The cross-validation schemes tailored to analytical data

stratified k-fold: the default balance

Folding into k-fold (usually 5 to 10) offers a robust compromise between bias and estimation variance. In classification, preserve the class proportions in each fold; in regression, group the response by quantiles. This stratification prevents some folds from being too easy or too hard. For modest datasets (n ≲ 100), I often multiply CV repetitions to stabilize the estimation of the error and the hyperparameters.

Leave-one-out: appealing, but often misleading

The leave-one-out (LOOCV) uses n−1 samples to train and a single one to test, repeated n times. It seems optimal when data are scarce. In practice, it tends to underestimate generalization error and to produce high estimation variance. I reserve it for very simple cases, or to quickly compare model ideas, never to stop critical choices.

Venetian blinds and contiguous blocks: respect the structure

In spectroscopy, nearby subsamples (replicates, spectral neighborhood, time series) look too similar. The folds in regular bands (venetian blinds) or by consecutive blocks enforce a healthy separation. As soon as the order of acquisitions matters, the chronological segmentation applies: we test in the future relative to training. It is the only honest way to judge robustness against drifts.

Monte Carlo and repeated CV: to stabilize the estimation

Repeated validation (random resampling with a fixed training rate) reduces the impact of “unlucky” partitions. It is suitable when sample sizes vary strongly by batch, or to refine an error curve according to a hyperparameter (complexity, regularization). Keep a traced random seed and always report the distribution of errors, not just the mean.

Group k-fold and block by batch: avoid confusion

Whenever dependencies exist (samples from the same patient, batch, day, operator), fold by group. The model should never see, during training, elements too close to those kept for internal testing. This constraint sometimes changes perceived performance, but it reflects your real-use case. Better to have a conservative estimate than a brilliant model… on paper.

Scheme When to use Strengths Considerations
k-fold (5–10) General regression and classification Good compromise, easy to replicate Stratify, repeat if n is small
LOOCV Very small datasets, quick comparisons Uses almost all data High variance, optimistic
Venetian blinds / blocks Series, correlated acquisitions Respects local correlations Careful define the block width
Group k-fold Batches, subjects, operators Prevents leakage Requires reliable metadata
Repeated Monte Carlo Stabilize estimation Distribution of errors Record the seed and number of runs

Setting up unbiased validation: pipelines and data leakage

The golden rule: any computation that learns from data must be redone in every fold, independently. Never compute a SNV, a centering and scaling, a PCA, or hyperparameter selection on the whole dataset, then validate: that is a data leakage. Integrate your preprocessing and your variable selection into a single pipeline that trains only on the training fold data, before predicting the validation fold.

Two additional guardrails count just as much. First, group the replicates of the same sample in the same fold, to avoid overestimating performance. Next, fix segmentation choices before observing metrics, to avoid “choosing the folding that works best,” a subtle but costly bias in real life.

Choosing the number of components with a carefully conducted CV

For PLS and PCR, I systematically plot the validation error (often the RMSECV) as a function of the number of . The minimum is not always the best choice: I apply a parsimony rule (the “one standard deviation” rule) to retain the smallest number of factors whose performance remains within a statistically equivalent margin to the minimum. This approach yields models more stable to field perturbations.

If you are torn between PCR or PLS, CV is your most reliable arbiter. It also helps tune other hyperparameters (penalties of a regularized model, depth of a tree, kernel of an SVM). Don’t forget to repeat folding several times and to report uncertainty (error bars, quantiles) rather than a single value.

Metrics that truly count when validating a model

In regression, systematically report R², Q², RMSEC, RMSECV, and RMSEP. Each indicator tells a part of the story: internal fit, estimated generalization, and performance on external samples. In classification, specify accuracy, sensitivity, specificity, AUC, and, for rare classes, the F1-score. The definitions and detailed caveats are gathered here: R², RMSECV and RMSEP. Keep unit consistency and contextualize the error relative to analytical variability (R&R, LOD/LOQ, industry requirements).

Real-world example: from NIR spectroscopy to production deployment

We had to estimate the moisture content of a pharmaceutical powder by NIR. After standard preprocessing (SNV, Savitzky–Golay derivative, spectral alignment), we imposed cross-validation in blocks by manufacturing batch. LOOCV gave flattering errors; the batch-wise scheme, more realistic, revealed inter-batch drift. We adjusted the sampling plan, strengthened calibration at the moisture extremes, and reduced the number of PLS factors via the RMSECV curve. The model held for six months without recalibration, then was updated on a new reference batch, planned from the start.

Good practices and pitfalls to avoid in the lab

  • Define the folds before any exploration of performance and document them.
  • Group replicates, batches, subjects or days of acquisition in the same fold.
  • Integrate preprocessing and hyperparameter selection into the CV pipeline.
  • Avoid tuning at random: grid search or Bayesian search with a log of trials.
  • Repeat CV (at least 5–10 repetitions when n is modest) and report the distribution of the error.
  • Prefer a conservative estimate and explain choices with respect to the final use.
  • Reserve an external set for the final word and routinely monitor post-deployment drift.

Special cases: time-series, batches, rare classes

For processes monitored over time, mixing past and future is forbidden. CV by time blocks respects the acquisition order and avoids the performance mirage. On rare classes, stratification must preserve the ratio in each fold, and optimization should aim for metrics suited to them (AUC, F1). In the presence of labeled batches, choose a group k-fold; I readily accept an apparently higher error to gain credibility during method transfers or quality audits.

Going further: ethics, traceability and nested validation

Transparency is an asset as much scientific as regulatory. Keep the random seed, the exact definition of folds, the software versions, and the history of experiments. For projects rich in hyperparameters (SVMs, networks), I use a nested validation with an inner loop for tuning and an outer loop for an unbiased estimation of performance. This separation avoids overfitting the hyperparameter space and provides a more honest measure, ready to be shared with quality.

What to keep in mind for your chemometric models

Your validation protocol is a contract of trust. Respect the data structure, ban artificial proximities between training and test, prefer simplicity when two configurations perform equally, and always speak in terms of uncertainty. Internal validation lights the way, external testing confirms the path. With these guidelines, you will build models that live up to their promises beyond the lab notebook, in contact with real samples and the constraints of a production line.

chimiometrie.fr – Tous droits réservés.