Non classé • 19.02.2026

Unsupervised classification (HCA): A chemometric approach

Julie

classification non supervisée (hca) en chimiométrie: guide

INDEX +

Are you looking to transform complex measurements into readable groups without imposing labels? That’s exactly what unsupervised classification (HCA) offers in the laboratory. I have been using this approach for years to explore spectral signatures, sort production batches and spot hidden behaviors. You will find below a clear explanation, concrete methodological choices, practical feedback, and an operational guide. If you are starting in chemometrics, the goal is simple: gain discernment before modeling.

Understanding unsupervised classification (HCA) in chemometrics

HCA stands for Hierarchical Cluster Analysis. In French, it is often referred to as CAH (Classification ascendante hiérarchique). The principle: group similar samples, step by step, until forming a hierarchy visualized by a dendrogram. No class is imposed from the start; the structure comes from the data itself.

In analytical chemistry, this mapping reveals families of raw materials, manufacturing profiles or degradation states. On NIR or Raman spectra, the weak but coherent patterns stand out. I like to start with a hierarchical clustering exploration before any predictive modeling: we understand the landscape, identify special cases, then decide on the action plan.

Preparing data before a robust HCA

The quality of the grouping depends first on the preprocessing. Dominant amplitudes often overwhelm fine information, and instrumental variance creates false closeness. At minimum, center and scale the variables: the centering-reducing puts each variable on an equal footing. In spectroscopy, baseline alignment, drift correction and normalization are decisive.

On floury NIR spectra, I have found that a simple normalization such as SNV combined with a derivative smoothing Savitzky–Golay eliminates texture and reveals chemical differences. To go deeper, the preprocessing of spectral data deserves a dedicated read, because each matrix has its quirks.

Handling outliers and missing values

Before running the HCA, check extreme values, near-constant columns and missing data. An outlier can pull an entire group toward an artificial branch. My ritual: visual inspection, robust statistics, and, if needed, prudent imputation. A HCA becomes reliable when sources of variability are understood, not just cleaned.

Distances and aggregation methods: choose according to the problem’s chemistry

Two ingredients structure your hierarchy: the similarity measure and the way to aggregate groups. My preferences change with the nature of the variables, the scale, and the noise.

Measure / Link	When to use	Strengths / Notes
Euclidean distance	Centered and scaled data; comparable signals	Intuitive; sensitive to residual amplitudes
Manhattan (L1)	Presence of extreme values, robustness	Less sensitive to outliers, can smooth too much
Correlation	Shape of the profile more important than intensity	Ignores scale; useful for normalized spectra
Mahalanobis	Correlated variables, informative covariance	Requires reliable covariance estimation
Single / complete / average linkage	Controls compactness vs. chaining	Complete linkage often yields compact clusters
Ward’s method	Minimize intra-group inertia	Often the most readable for centered matrices

In practice, I combine Ward with a Euclidean distance on autoscaled data. For chromatographic fingerprints, correlation sometimes offers a more pertinent view of the signal shape than its raw height.

Interpreting the dendrogram and determining the number of classes

The dendrogram cut is not just an arbitrary horizontal line. Look for height jumps that reflect costly mergers; test several cuts and compare them to business reality. Metrics help: cluster validation by bootstrap stability, inconsistency jump, silhouette suited to the final partition. The Cophenetic Correlation Coefficient indicates whether the hierarchy accurately reflects the initial dissimilarities.

When two rival solutions emerge, I return to the samples: what physically distinguishes them? In a pharmaceutical dossier, the best cut separated tablets according to residual moisture, later confirmed by Karl Fischer. HCA always wins when the chemical interpretation follows the calculation.

Practical cases from the laboratory

NIR and agricultural raw materials

On flours, HCA revealed three families aligned with protein content. After SNV and Savitzky–Golay derivative, the structure became clearer and allowed finer input controls.

Fermentations and batch monitoring

In bioprocesses, HCA on time profiles (pH, DO, spectroscopic signals) separated the “healthy” tanks from those sensitive to lactic contamination. Early triggering of investigations prevented batch losses.

Chromatographic fingerprints

For plant extracts, correlation with complete linkage grouped the profiles by chemotype. Targeted analysis of discriminant peaks facilitated quality documentation. A pragmatic detail: excessive smoothing sometimes masks key markers.

The value of HCA lies less in the software than in the ability to listen to what the branches say. Statistics provide; chemistry validates.

HCA, PCA and k-means: which tool, and when?

HCA explores and structures. The Principal Component Analysis (PCA) projects and visualizes directions of variance; k-means imposes a number of groups and optimizes their compactness. In practice, I proceed as follows: PCA to see the big picture, HCA to read hierarchical proximities, k-means to stabilize a final partition. To brush up on the basics, I refer you to this clear resource on PCA in chemometrics.

In very noisy matrices, prior PCA serves as a filter: reducing the dimension to the relevant components stabilizes distances. For production-class expectations, k-means is fast and sufficient; for exploratory screening, HCA tells a richer story.

Step-by-step procedure to deploy an HCA in routine

Define the objective: incoming quality control, quality investigation, exploratory study.
Document data acquisition: batches, calibrations, system limits.
Clean and preprocess: correction of instrumental noise, normalization, centering-reduction, handling missing data.
Reduce dimension if needed (PCA or variable selection).
Choose distance and linkage according to the physico-chemistry and the interpretation perspective.
Run the HCA, examine the dendrogram, test several cuts.
Validate: stability, business relevance, metrological coherence.
Document the decision rules and integrate into the quality workflow.

Practitioner tips

Keep a raw version and a preprocessed version for comparison.
Test Ward + Euclidean on autoscaled data as the baseline configuration.
Sample reference samples in each cluster for chemical verification.
Record the transformations applied: traceability and reproducibility come first.

Unsupervised classification (HCA): best practices and limits

HCA excels at revealing proximities and initiating hypotheses. The method remains sensitive to scales, redundant variables, and measurement artefacts. A judicious choice of preprocessing, systematic confrontation with the context and a few quality indicators prevent common traps.

If you work with spectra or capricious profiles, invest time in preprocessing settings, then confront your dendrogram with orthogonal measurements. This discipline of analysis turns an exploratory tool into a genuine decision-making lever.