UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Copy number estimation for high-throughput short read shotgun sequencing de novo whole-genome assembly contigs Lim, Yee Fay

Abstract

High-throughput short shotgun sequencing reads, also known as second-generation sequencing (SGS) reads, continue to be prevalent for de novo whole-genome assembly, whether alone or in combination with long-range information. Knowledge of contig multiplicity (copy number) is acknowledged to improve assembly correctness, contiguity, and coverage for SGS reads. Despite that, a principled, general solution for contig copy number estimation in de novo whole-genome SGS assembly has been unavailable. In the literature, the problem is generally unaddressed or given heuristic treatment. In this work, we introduce a novel, versatile statistically informed contig copy number estimator, based on mixture models, for high-throughput short read shotgun sequencing de novo whole-genome assembly. In particular, this tool targets de Bruijn graph assembly, the dominant paradigm for de novo whole-genome SGS assembly. We show that it performs reliably at resolving multiplicities up to low repeat copy numbers; it is also robust over a range of genome characteristics, sequencing coverage levels, and assembly settings. Moreover, it is far more versatile than the closest existing alternative tools and usually outperforms them, often by a wide margin. At the same time, somewhat reduced though still robust performance in a limited set of experiments using real sequencing data suggests fundamental limitations to its usage of only length and read coverage data; incorporating other types of information, e.g. GC content, may be necessary to improve performance. Our code is publicly available at https://github.com/bcgsc/wgs-copynum-est; we hope this effort will provide a useful reference for similar future work.

Item Citations and Data

Rights

Attribution-NonCommercial-ShareAlike 4.0 International