Nontargeted analysis for food authenticity by liquid chromatography–mass spectrometry (LC-MS) can provide data on thousands of chemical features. However, most studies that train machine learning models for food authentication use sample sizes in the tens or hundreds. Such training sets are typically considered too small to be optimal, as it introduces the problem of overfitting when working with such a large feature-to-sample ratio.
This study (open access) aimed to mitigate this issue with a machine learning protocol designed for sub-optimal training sets, using honey as an example. A recursive feature elimination (RFE) pipeline was developed specifically to address the challenges of optimizing the honey chemical fingerprint for multiclass machine learning classifiers on a limited number of samples with imperfect labels. A support vector machine was used for both RFE and classification to reduce the 2028 nontargeted features down to just 54 features (a 97.3% reduction) without any loss of classification performance.
The authors report that the resulting model was a 6-class classifier, capable of identifying monofloral blueberry, buckwheat, clover, goldenrod, linden, or other honey with a nested cross-validation Matthews correlation coefficient (MCC) of 0.803 ± 0.046. The development of a k-nearest neighbours filter and the decision to continue the RFE process beyond the iteration with the highest classification score were instrumental in achieving this outcome.
They conclude that this work shows a complete pipeline that automates feature selection from nontargeted LC-MS spectra when working with a limited number of samples and imperfect labels. This process can also be expanded to other food groups and spectral data.
Photo by Andrea De Santis on Unsplash