High-Throughput Sequencing And Natural Selection: Studies Of Recent Sweep Inferences And A New Computational Approach For Transcription Identification

Wang, Zhen

High-Throughput Sequencing And Natural Selection: Studies Of Recent Sweep Inferences And A New Computational Approach For Transcription Identification

Files

zw47.pdf (2.64 MB)

Permanent Link(s)

https://hdl.handle.net/1813/38964

Collections

Cornell Theses and Dissertations

Full item page

Author(s)

Wang, Zhen

Abstract

Short-read high-throughput sequencing is the most popular approach to collect massive amount of DNA sequence data at declining cost in nearly all fields of current biological studies. Its many varieties have been employed for different research purposes, e.g. genomic sequencing for variant detection, RNA-seq for transcriptome profiling, etc. However, the individual reads and the resulting called sequences frequently have missing and errorprone base calls, and appropriate corrections and evaluations are necessary for drawing conclusions. I examined how missing data and sequence errors affect the power and prediction accuracy of two frequently used methods for the inference of recent positive selection from such datasets. I showed that variant-frequency based method, SweepFinder, is very sensitive to data quality and its sensitivity and prediction accuracy are greatly compromised by missing data or sequence errors. In contrast, the haplotype-based method, iHS, is very robust to missing data and sequence errors and is able to efficiently detect signals of recent selective sweeps with very low false discovery rate. I then applied four different computational approaches on the high-throughput resequencing data of a 2.1 Mbp segment of Drosophila melanogaster X chromosome to compare and discuss their performances. The study emphasized the relative advantages of linkage disequilibrium-based methods in detecting recent sweeps relative to site frequency-based approaches when applied on incomplete data. There are also many challenges in other applications of high-throughput sequencing, including discoveries of novel transcription active regions (TARs) in RNA-seq analysis. Here, I present a flexible statistical program, HPIBD (HMM-based Peak Identification and Boundary Definition) for de novo analysis of RNA-seq datasets. It avoids the use of arbitrary read-depth cutoffs and has built-in tolerance to read gaps. It is able to statistically make TARs predictions, estimate peak boundaries and evaluate the confidence in the prediction. I implemented the model and showed that HPIBD has robust performance under various validations and with benchmark to Cufflinks.

Date Issued

2014-08-18

Keywords

Population Genetics; Natural Selection; High-throughput Sequencing

Committee Chair

Aquadro, Charles F

Committee Member

Clark, Andrew
Keinan, Alon

Degree Discipline

Genetics

Degree Name

Ph. D., Genetics

Degree Level

Doctor of Philosophy

Types

dissertation or thesis

High-Throughput Sequencing And Natural Selection: Studies Of Recent Sweep Inferences And A New Computational Approach For Transcription Identification

Files

No Access Until

Permanent Link(s)

Collections

Other Titles

Author(s)

Abstract

Journal / Series

Volume & Issue

Description

Sponsorship

Date Issued

Publisher

Keywords

Location

Effective Date

Expiration Date

Sector

Employer

Union

Union Local

NAICS

Number of Workers

Committee Chair

Committee Co-Chair

Committee Member

Degree Discipline

Degree Name

Degree Level

Related Version

Related DOI

Related To

Related Part

Based on Related Item

Has Other Format(s)

Part of Related Item

Related To

Related Publication(s)

Link(s) to Related Publication(s)

References

Link(s) to Reference(s)

Previously Published As

Government Document

ISBN

ISMN

ISSN

Other Identifiers

Rights

Rights URI

Types

Accessibility Feature

Accessibility Hazard

Accessibility Summary

Link(s) to Catalog Record