Statistical Methods for Identifying Demographic Structure in DNA Sequence Alignments

Rohrlach, Adam Benjamin

Please use this identifier to cite or link to this item: https://hdl.handle.net/2440/120353

Type:	Thesis
Title:	Statistical Methods for Identifying Demographic Structure in DNA Sequence Alignments
Author:	Rohrlach, Adam Benjamin
Issue Date:	2018
School/Discipline:	School of Mathematical Sciences
Abstract:	All life on Earth, from viruses and bacteria, trees and flowers, to birds and human beings, can be traced back to a single common ancestor. However, the evolutionary history that led to this diversity of life is a complicated story that we do not yet fully understand. Since the discovery of the structure of deoxyribonucleic acid (DNA) in 1953, and the development of DNA sequencing technology, researchers have been using similarities and differences in the genomes of organisms to better understand the relationships between species. However, due to the complexity of the evolutionary history of life, simplifying assumptions must be made to make mathematical models tractable. It must then be of paramount importance for researchers to be able to identify when the simplifying assumptions of a specific model are unreasonable. In this thesis we present two projects, and although they are different in implementation, both attempt to investigate simplifying assumptions in the closely related fields of population genetics and phylogenetics. However, we also present applications of our projects where the results of our work are not used in assessing assumptions for further analyses, but are of standalone interest to researchers. Our first project is concerned with the development of a method for constructing coordinate representations for single-copy DNA, such as mitochondrial DNA (mtDNA) or Y-chromosomal DNA, analogous to the use of PCA for nuclear DNA. We construct a coordinate system such that, given p informative sites in an alignment of n individuals, returns p-dimensional coordinates for each n individuals. We order the dimensions by the proportion of variability each dimension captures in the overall genetic diversity. From these coordinates in \genetic space" researchers may perform a number of down stream analyses. It is possible to optimally visualise high-dimensional sequence data in two or three dimensions. One may use our method to identify closely related individuals, identify sites in the alignment that are closely linked, or to use the same coordinate space to nd sites that are closely linked with groups of individuals. Finally, one may choose to test for significant relationships between the structure of the coordinates in genetic space, and metadata recorded on sequenced individuals, indicating demographic variables that are highly related to the evolutionary history of an alignment. This final application of our method, where one may test for demographic structure in sequence data, is of key importance to the theme of discovering when simplifying assumptions of analyses are not reasonable. Through the comparison of coordinates in gene space, and any demographic variables of interest, researchers may explore whether or not the individuals in the alignment indicate population substructure. For example, one may investigate if there appears to be a phylogeographic structure to the individuals forming distinct subpopulations, and if migration appears to occur between subpopulations. Through empirical data, we show that our method can readily recover tree-like structure, identify strong genetic groupings based on qualitative traits and show that we are able to recover phylogeographic signal given provenanced sampling information. We show that our method can even be used to suggest routes of migration based on mtDNA. Finally we apply our method to modern Aboriginal Australian mtDNA to show strong evidence for discrete geographic populations of Aboriginal Australian peoples that display permanence on the Australian landscape dating back to the original colonisation of Australia 50 thousand years before present (kya). Our second project is concerned with identifying departures from a tree-like evolutionary history at the species level. It is not uncommon for closely related species (Species A and C say) to still be capable of interbreeding, and producing viable \hybrid" offγspring (Species B say). Under these conditions, a phylogenetic tree cannot describe the evolutionary history of the hybrid species, and instead an admixture graph may be a better description. We begin by considering the evolutionary history of three species: a hybrid organism that has undergone some independent evolution (Species B), and two \parent" organisms, Species A and C. Relatively long, contiguous regions of the genome of Species B will have undergone no recombination since the admixture event. These regions will have been contributed by either Species A (and hence will be more closely related to Species A), or Species C. We aim to estimate the proportion of the genome contributed by Species A, and denote this by considering the proportion of informative site patterns that indicate evidence for the two possible ancestries. The mixing proportion is the parameter of interest in our analyses. However, due to the classical problem of the non-identifiability of mixing parameters in multinomial distributions, we describe two Bayesian methods for estimating γ. Our first method places prior distributions on the parameters of the model, and uses Approximate Bayesian Computation (ABC) to estimate the marginal posterior distribution of γ. Our second, closely related method, instead estimates the marginal posterior distribution of via numerical integration. We show via a simulation study that our methods can accurately estimate the true value of γ, and perform well under biologically reasonable scenarios. However, we also find that our methods suffer from a relatively small positive bias for small values of γ, i.e., when one species of the parent species contributes very little to the genome of the hybrid species. We compare the performance of our method to the popular method of the ratio of f4 statistics. We do this by estimating the proportion of Neanderthal ancestry in pre-ice age European human samples and comparing our results to the finding of Fu et al. [18]. We show that our method recovers extremely similar estimates of Neanderthal ancestry with no apparent systematic bias when compared to the results of Fu et al.. Finally we apply our method to the genomes of Late Pleistocene European bison (Bison bonasus) and Steppe Bison (Bison priscus) to understand the evolutionary history of bovid megafauna in Europe over the last seventy thousand years. It was thought that before 10 kya the only bovid present in Europe was the Steppe bison. However, from bone samples found dating from the present day, and back to approximately 70 kya, mtDNA indicated a second bison species was also roaming Europe before 10 kya, more closely related to modern cattle than the Steppe bison. After nuclear DNA was sequenced, we were able to show that this new species of bovid was actually a hybrid offspring of Aurochs (the ancestor of modern cattle) and Steppe bison, an event that occurred approximately 120 kya. We used our method, in concert with the ratio of f4 statistics, to show that the hybrid species contained approximately 10% Aurochs and 90% Steppe bison ancestry.
Advisor:	Bean, Nigel Tuke, Jonathan Holland, Barbara
Dissertation Note:	Thesis (Ph.D.) -- University of Adelaide, School of Mathematical Sciences, 2018
Keywords:	Population genetics phylogeography mtDNA admixture bioinformatics
Provenance:	This electronic version is made publicly available by the University of Adelaide in accordance with its open access policy for student theses. Copyright in this thesis remains with the author. This thesis may incorporate third party material which has been used by the author pursuant to Fair Dealing exceptions. If you are the owner of any included third party copyright material you wish to be removed from this electronic version, please complete the take down form located at: http://www.adelaide.edu.au/legals
Appears in Collections:	Research Theses

Files in This Item:

File	Description	Size	Format
Rohrlach2019_PhD.pdf		20.13 MB	Adobe PDF	View/Open

Show full item record

Adelaide Research & Scholarship