Mass Spectrometry Based Proteomic Biomarker Selection And Sample Prediction
Abstract
High-resolution MALDI-TOF (matrix-assisted laser
desorption/ionization time-of-flight) mass spectrometry has recently
shown promise as a screening tool for detecting discriminatory
peptide/protein patterns. The major computational obstacle in
finding such patterns is the large number of mass/charge peaks
(features, biomarkers, data points) in a spectrum. To tackle this
problem, we have developed methods for data preprocessing and
biomarker selection. The preprocessing consists of binning, baseline
correction, and normalization. An algorithm, Extended Markov Blanket
(EMB), is developed for biomarker detection, which combines
redundant feature removal and discriminant feature selection. The
biomarker selection couples with support vector machine (SVM) to
achieve sample prediction from high-resolution proteomic profiles.
Disease progresses in several stages. Therefore, there exist
biomarkers corresponding to each stage. To deal with such a
multi-class problem, we propose a classification and a feature
selection method. The proposed classification method consists of two
schemes: error-correcting output coding (ECOC) and pairwise coupling
(PWC). In prediction for a test sample, aggregated results of both
schemes are considered. In PWC scheme, important features for each
pair of classes are found by using extended Markov blanket (EMB)
feature selection.
To identify the molecular formulae of the biomarkers, we develop a
de novo peptide sequencing method. De novo peptide sequencing
that determines the amino acid sequence of a peptide via tandem mass
spectrometry (MS/MS) has been increasingly used nowadays in
proteomics for protein identification. Current de novo methods
generally employ a graph theory, which usually produces a large
number of candidate sequences and causes heavy computational cost
while trying to determine a sequence with less ambiguity. We present
a novel de novo sequencing algorithm that greatly reduces the
number of candidate sequences. By utilizing certain properties of b-
and y-ion series in MS/MS spectrum, we propose a reliable two-way
parallel searching algorithm to filter out the peptide candidates
that are further pruned by an intensity evidence based screening
criterion.
LDA is a traditional statistical scheme for feature reduction which
has been widely used in a diversity of application areas. In a case
where the dimensionality exceeds the sample size, however, the
classical LDA faces a problem known as singularity. Since the
dimensionality of the mass spectrometry data is considerably huge,
the singularity problem necessarily happens. Another drawback of the
classical LDA is its linear property with which LDA fails for
nonlinear problems. To solve the problem, nonlinear based LDA
methods have been proposed. However, they suffer from high cost in
running. We propose a new fast kernel discriminant analysis (FKDA).