Approximate string matching for high-throughput sequencing

Siragusa, Enrico (2015) Approximate string matching for high-throughput sequencing. PhD thesis, Freie Universität Berlin.

Full text not available from this repository.

Official URL: https://refubium.fu-berlin.de/handle/fub188/11364

Abstract

Over the past years, high-throughput sequencing (HTS) has become an invaluable method of investigation in molecular and medical biology. HTS technologies allow to sequence cheaply and rapidly an individual’s DNA sample under the form of billions of short DNA reads. The ability to assess the content of a DNA sample at base-level resolution opens the way to a myriad of applications, including individual genotyping and assessment of large structural variations, measurement of gene expression levels and characterization of epigenetic features. Nonetheless, the quantity and quality of data produced by HTS instruments call for computationally efficient and accurate analysis methods. In this thesis, I present novel methods for the mapping of high-throughput sequencing DNA reads, based on state of the art approximate string matching algorithms and data structures. Read mapping is a fundamental step of any HTS data analysis pipeline in resequencing projects, where DNA reads are reassembled by aligning them back to a previously known reference genome. The ingenuity of approximate string matching methods is crucial to design efficient and accurate read mapping tools. In the first part of this thesis, I cover practical indexing and filtering methods for exact and approximate string matching. I present state of the art algorithms and data structures, give their pseudocode and discuss their implementation. Furthermore, I provide all implementations within SeqAn, the generic C++ template library for sequence analysis, which is freely available under http://www.seqan.de/. Subsequently, I experimentally evaluate all implemented methods, with the aim of guiding the engineering of new sequence alignment software. To the best of my knowledge, this is the first study providing a comprehensive exposition, implementation and evaluation of such methods. In the second part of this thesis, I turn to the engineering and evaluation of read mapping tools. First, I present a novel method to find all mapping locations per read within a user- defined error rate; this method is published in the peer-reviewed journal Nucleic Acids Research and packaged in a open source tool nicknamed Masai. Afterwards, I generalize this method to quickly report all co-optimal or suboptimal mapping locations per read within a user-defined error rate; this method, packaged in a tool called Yara, provides a more practical, yet sound solution to the read mapping problem. Extensive evaluations, both on simulated and real datasets, show that Yara has better speed and accuracy than de-facto standard read mapping tools.

Item Type:	Thesis (PhD)
Subjects:	Mathematical and Computer Sciences > Computer Science
Divisions:	Department of Mathematics and Computer Science > Institute of Computer Science > Algorithmic Bioinformatics Group
ID Code:	2507
Deposited By:	Anja Kasseckert
Deposited On:	04 Mar 2021 14:34
Last Modified:	04 Mar 2021 14:34

Repository Staff Only: item control page