Thesis (Ph.D.)--University of Rochester. School of Medicine and Dentistry. Dept. of Biostatistics and Computational Biology, 2008.
Identifying those genes that are differentially expressed in individuals with
cancer could lead to new avenues of treatment or prevention. Gene array information
can be collected for an individual and paired with information about a
specific outcome. We consider binary and censored time to event survival outcomes.
Improved construction of prognostic gene signatures − risk scores based
on groups of genes that can relatively accurately predict an individual’s outcome
or risk thereof − would be a step forward in gene array research.
We consider the problem of selecting which variables (genes) to include in
a prognostic gene signature with high dimensional data in the logistic regression
model and the Cox proportional hazards model frameworks. Constructing a prognostic
gene signature from gene array data is very difficult because the number of
candidate genes greatly outnumbers the sample size available in most studies. In
this case, commonly used statistical techniques perform poorly at choosing genes
that accurately predict the endpoint of interest.
We propose a search algorithm that lies between forward selection and searching
over all subsets. This method uses an evolving subgroup paradigm to intelligently
select variables over a relatively large model space. We show that our
method applied to the Cox proportional hazards model for censored survival time
data or the logistic regression model for binary data can yield significant improvements
in the number of predictive variables selected and the prediction error
compared with the LASSO method, forward selection, and univariate selection,
as demonstrated with a simulation study.
We also investigate validation methods. Censored survival time data confounds
many of the common validation methods because the outcome is not always observed,
and splitting a data set can lead to a lack of uncensored observations in
some partitions. We pursue validation with the goal of selecting the ideal number
of variables to include. We propose a robust version of partial likelihood
cross-validation and modified versions of the c-index as metrics to select model
complexity. We propose analogous methods in the logistic model setting.
We apply our new methods to publicly available lymphoma gene array data
(Rosenwald et. al., 2002) with 7399 genes and 240 individuals. After splitting
the full data into training and validation portions, running our method on the
training data set leads to predictive results in the validation data set.