Masters Thesis

A machine learning approach for identifying subtypes of autism

Utilizing data mining techniques for the purpose of categorizing the Autism Spectrum Disorder (ASD) has great potential for better understanding this currently wide spread and complex medical condition. Since the 1960's, The Autism Research Institute (ARI) has been collecting data on the symptoms of children with ASD. The raw ARI database has over 120 factors and over 40,000 unique entries. In this thesis, we define a novel four step approach to identify subtypes in this Autism data set. While our research is applied to the ARI data set, this approach can be applied to other data sets and can utilize various machine learning algorithms. The four steps of our approach are 1) data pre-processing, 2) apply clustering algorithms, 3) apply classification algorithms, and 4) evaluate the results. We used the Weka data mining tool to implement our four steps. From Weka, the Expectation Maximization (EM), and JRip classification algorithms were used. In addition, we developed our own Weka clustering algorithm adapted from Minimum Message length (MML) communication theory. Using our approach and these machine learning algorithms, we were able to identify twenty-five unique subtypes and nine major overarching types. These results show the usefulness that data mining can have in many fields; especially in the field of medical taxonomy. Through the evaluation process in our approach we are able to show which machine learning algorithms are more efficient with the ARI database and other similar data sets. In addition, our approach sets the ground work for future research in the emerging fields of data mining and Autism. The research in this thesis has a twofold contribution to the sciences. The first is to the Computer Science community. We define a robust data mining approach that can be applied to various data sets. In addition, we have also created a novel adaptation of the MML algorithms and introduced it into the Weka pallet of algorithms. The second contribution of our research to the sciences is to the Autism community. The recent rise in occurrences of this disorder has sparked a major need to understand its many aspects. Our research is a start at being able to segment the Autism Spectrum Disorder into smaller, comprehendible facets. Keywords: Data Mining, Machine Learning, Autism, Clustering, Classification, Expectation Maximization algorithm, Minimum Message length algorithm, JRip algorithm.

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.