UR Research > URMC Theses > School of Medicine and Dentistry Theses >

Subset Selection for High-Dimensional Data, with Applications to Gene Array Data

URL to cite or link to: http://hdl.handle.net/1802/8411

APearsonPhDThesis.pdf   1.20 MB (No. of downloads : 235)
UR-only until 01/2012
Thesis (Ph.D.)--University of Rochester. School of Medicine and Dentistry. Dept. of Biostatistics and Computational Biology, 2008.
Identifying those genes that are differentially expressed in individuals with cancer could lead to new avenues of treatment or prevention. Gene array information can be collected for an individual and paired with information about a specific outcome. We consider binary and censored time to event survival outcomes. Improved construction of prognostic gene signatures − risk scores based on groups of genes that can relatively accurately predict an individual’s outcome or risk thereof − would be a step forward in gene array research. We consider the problem of selecting which variables (genes) to include in a prognostic gene signature with high dimensional data in the logistic regression model and the Cox proportional hazards model frameworks. Constructing a prognostic gene signature from gene array data is very difficult because the number of candidate genes greatly outnumbers the sample size available in most studies. In this case, commonly used statistical techniques perform poorly at choosing genes that accurately predict the endpoint of interest. We propose a search algorithm that lies between forward selection and searching over all subsets. This method uses an evolving subgroup paradigm to intelligently select variables over a relatively large model space. We show that our method applied to the Cox proportional hazards model for censored survival time data or the logistic regression model for binary data can yield significant improvements in the number of predictive variables selected and the prediction error compared with the LASSO method, forward selection, and univariate selection, as demonstrated with a simulation study. We also investigate validation methods. Censored survival time data confounds many of the common validation methods because the outcome is not always observed, and splitting a data set can lead to a lack of uncensored observations in some partitions. We pursue validation with the goal of selecting the ideal number of variables to include. We propose a robust version of partial likelihood cross-validation and modified versions of the c-index as metrics to select model complexity. We propose analogous methods in the logistic model setting. We apply our new methods to publicly available lymphoma gene array data (Rosenwald et. al., 2002) with 7399 genes and 240 individuals. After splitting the full data into training and validation portions, running our method on the training data set leads to predictive results in the validation data set.
Contributor(s):
Alexander T. Pearson - Author

Derick R. Peterson - Thesis Advisor

Primary Item Type:
Thesis
Language:
English
Subject Keywords:
Subset Selection; Gene Array; High Dimensional Data
Date will be made available to public:
2012-01-01   
License Grantor / Date Granted:
Susan Love / 2009-11-04 08:23:44.182 ( View License )
Date Deposited
2009-11-04 08:23:44.182
Date Last Updated
2012-09-26 16:35:14.586719
Submitter:
Susan Love

Copyright © This item is protected by copyright, with all rights reserved.

All Versions

Thumbnail Name Version Created Date
Subset Selection for High-Dimensional Data, with Applications to Gene Array Data1 2009-11-04 08:23:44.182