Contributions to the detection of non-reference sequences in population-scale NGS data

Krannich, Thomas (2022) Contributions to the detection of non-reference sequences in population-scale NGS data. PhD thesis, Freie Universität Berlin.

Full text not available from this repository.

Official URL: http://dx.doi.org/10.17169/refubium-34757

Abstract

Non-reference sequence (NRS) variants are a less frequently investigated class of genomic structural variants (SV). Here, DNA sequences are found within an individual that are novel with respect to a given reference. NRS occur predominantly due to the fact that a linear reference genome lacks biological diversity and ancestral sequence if it was primarily derived from a single or few individuals. Therefore, newly sequenced individuals can yield genomic sequences which are absent from a reference genome. With the increasing throughput of sequencing technologies, SV detection has become possible across tens of thousands of individuals. When using short-read data, the detection of NRS variants inevitably involves a de novo assembly which is a complex computational problem and requires high-quality sequence data at high coverage. Previous studies have demonstrated how sequence data of multiple genomes can be combined for the reliable detection of NRS variants. However, the algorithms proposed in these studies have a limited capability to process large sets of genomes. This thesis introduces novel contributions for the discovery of NRS variants in many genomes, which scale to considerably larger numbers of genomes than previous methods. A practical software tool, PopIns2, that was developed to apply the presented methods is elucidated in greater detail. The highlight among the new contributions is a procedure to merge contig assemblies of unaligned reads from many individuals into a single set of NRS by heuristically generating a weighted minimum path cover for a colored de Bruijn graph. Tests on simulated data show that PopIns2 ranks among the best approaches in terms of quality and reliability and that its approach yields the best precision for a growing number of genomes processed. Results on the Polaris Diversity Cohort and a set of 1000 Icelandic human genomes demonstrate unmatched scalability for the application on population-scale datasets.

Item Type:	Thesis (PhD)
Subjects:	Mathematical and Computer Sciences > Computer Science
Divisions:	Department of Mathematics and Computer Science > Institute of Computer Science > Algorithmic Bioinformatics Group
ID Code:	2854
Deposited By:	Anja Kasseckert
Deposited On:	05 Sep 2022 13:50
Last Modified:	05 Sep 2022 13:50

Repository Staff Only: item control page