statistics (2)

Analytical tests that are underpinned by statistical classification models are often based on reference sets of relatively few (in statistical terms) “authentic samples”.  There is a risk that these may not reflect the entire scope of variability within an authentic food population.  Laboratories building these models therefore need to take great care in how they process the reference data (e.g. dimension reduction, feature selection) to avoid the problem of over-fitting.  Over-fitting results in a statistical model too tightly tailored to the reference set which then fails when applied to a sample that differs in some way.  There is best-practice guidance available for laboratories – see signposts on FAN.

This latest research (open access) develops specific recommendations and a workflow for laboratories to deal with dimension reduction.  It comes from a statistical, rather than analytical, scientific journal. The authors evaluated different statistical approaches, using a model dataset of ICP-MS data from 28 apples of 4 origin classes.  They compared Linear Discriminant Analysis (LDA) and Partial Least Squares Discriminant Analysis (PLS-DA) algorithms. Their workflow integrated Principal Component Analysis (PCA) for feature extraction, followed by supervised classification using LDA and PLS-DA. Model performance and stability were systematically assessed. The dataset was processed with normalization, scaling, and transformation prior to modeling. Each model was validated via leave-one-out cross-validation and evaluated using accuracy, sensitivity, specificity, balanced accuracy, detection prevalence, p-value, and Cohen’s Kappa.

They report that, as a linear projection-based classifier, LDA provided higher robustness and interpretability in small and unbalanced datasets. In contrast, PLS-DA, which is optimized for covariance maximization, exhibits higher apparent sensitivity but lower reproducibility under similar conditions. They also emphasise the importance of dimensionality reduction strategies, such as PCA-based variable selection versus latent space extraction in PLS-DA, in controlling overfitting and improving model generalisability.

They conclude that their proposed algorithmic workflow provides a reproducible and statistically sound approach for evaluating discriminant methods in chemometric classification.

Read more…

This article (purchase required) uses data mining of nearly 72,000 official food inspections from China from 2018 – 2023.  It tests the hypothesis that data manipulation by local food inspection agencies has led to an overall underestimate of food fraud and food safety incidents in China.

The authors examined the distribution of non-compliant samples near the qualified standard value using exceedance multiples. To quantify the extent of data manipulation, they used an exhaustive algorithm to construct counterfactual estimates.

They report an abnormal distribution of unqualified samples near standard value, indicating potential data manipulation. Robustness tests supported this inference.

They conclude that over 11% of unqualified (failed) samples may have been adjusted to qualified status during 2018–2023, with higher manipulation rates in eastern regions than in central and western regions. The manipulation rate of unqualified samples across 25 sample provinces ranged from 8.13% to 16.30%.

Read more…