A robust penalized classification method for high-dimensional spectroscopic data

Project Number
RI 6/14 ZY

Project Duration
May 2015 - May 2018


In this project we wish to develop a robust penalized classification method for high-dimensional spectroscopic data. Spectroscopic data have very high dimensionality as many as 1000 wavelengths. The spectral differences between different types of samples are responsible for only a small part of the variability in the spectra. The study on how to extract most useful spectral features to identify the compositional differences between different types of the samples has been an attractive and challenging area in recent years. Naïve implementation of classical linear discriminant analysis (LDA) in high-dimensional data setting provides poor classification results and the interpretation of the results is challenging due to singularity problem and highly-correlated spectral features. Many traditional approaches to this problem involve performing feature selection before classification, which may lose some useful information for classification. Modern classification methods, such as kernel method and artificial neural networks (ANN), require critical parameters tuning and expensive computation, which may cause overfitting problem. Penalized discriminant method involves a sparse linear combination of all spectral features by introducing an additional penalty to within class covariance matrix. It extends LDA to the high-dimensional setting such that classification and feature selection are performed simultaneously. It has many advantages for classification of spectroscopic data in terms of accuracy, stability and interpretability. However, how to design the penalty functions is still an open and challenging issue. In our work, we aim to develop a simple, robust and interpretable classifier on high-dimensional spectroscopic data for the purpose of both classification and interpretation. Suitable penalty functions will be designed to model spectral features and spectral correlations. The advantages of the proposed method are that it can handle the situation where the number of predictors is much larger than the number of samples without filtering out any spectral structures and it can incorporate highly correlated structure of adjacent spectral bands into the model. Most important, the directions of classifier are physically interpretable as directions where the informative spectral regions for classification are emphasized by large weights. This is scientifically important, as it provides investigators a guide in spectral feature selection and provides evidence regarding the active ingredients in the study object.

Funding Source

Related Projects