pca (1)

Analytical tests that are underpinned by statistical classification models are often based on reference sets of relatively few (in statistical terms) “authentic samples”.  There is a risk that these may not reflect the entire scope of variability within an authentic food population.  Laboratories building these models therefore need to take great care in how they process the reference data (e.g. dimension reduction, feature selection) to avoid the problem of over-fitting.  Over-fitting results in a statistical model too tightly tailored to the reference set which then fails when applied to a sample that differs in some way.  There is best-practice guidance available for laboratories – see signposts on FAN.

This latest research (open access) develops specific recommendations and a workflow for laboratories to deal with dimension reduction.  It comes from a statistical, rather than analytical, scientific journal. The authors evaluated different statistical approaches, using a model dataset of ICP-MS data from 28 apples of 4 origin classes.  They compared Linear Discriminant Analysis (LDA) and Partial Least Squares Discriminant Analysis (PLS-DA) algorithms. Their workflow integrated Principal Component Analysis (PCA) for feature extraction, followed by supervised classification using LDA and PLS-DA. Model performance and stability were systematically assessed. The dataset was processed with normalization, scaling, and transformation prior to modeling. Each model was validated via leave-one-out cross-validation and evaluated using accuracy, sensitivity, specificity, balanced accuracy, detection prevalence, p-value, and Cohen’s Kappa.

They report that, as a linear projection-based classifier, LDA provided higher robustness and interpretability in small and unbalanced datasets. In contrast, PLS-DA, which is optimized for covariance maximization, exhibits higher apparent sensitivity but lower reproducibility under similar conditions. They also emphasise the importance of dimensionality reduction strategies, such as PCA-based variable selection versus latent space extraction in PLS-DA, in controlling overfitting and improving model generalisability.

They conclude that their proposed algorithmic workflow provides a reproducible and statistically sound approach for evaluating discriminant methods in chemometric classification.

Read more…