Indices and Applications in High-Throughput Sequencing

Weese, D. (2013) Indices and Applications in High-Throughput Sequencing. PhD thesis, Freie Universität Berlin.

Preview

PDF (PhD Thesis) - Published Version
2MB

Official URL: http://www.diss.fu-berlin.de/diss/receive/FUDISS_t...

Abstract

Recent advances in sequencing technology allow to produce billions of base pairs per day in the form of reads of length 100 bp an longer and current developments promise the personal $1,000 genome in a couple of years. The analysis of these unprecedented amounts of data demands for efficient data structures and algorithms. One such data structures is the substring index, that represents all substrings or substrings up to a certain length contained in a given text. In this thesis we propose 3 substring indices, which we extend to be applicable to millions of sequences. We devise internal and external memory construction algorithms and a uniform framework for accessing the generalized suffix tree. Additionally we propose different index-based applications, e.g. exact and approximate pattern matching and different repeat search algorithms. Second, we present the read mapping tool RazerS, which aligns millions of single or paired-end reads of arbitrary lengths to their potential genomic origin using either Hamming or edit distance. Our tool can work either lossless or with a user-defined loss rate at higher speeds. Given the loss rate, we present a novel approach that guarantees not to lose more reads than specified. This enables the user to adapt to the problem at hand and provides a seamless tradeoff between sensitivity and running time. We compare RazerS with other state-of-the-art read mappers and show that it has the highest sensitivity and a comparable performance on various real-world datasets. At last, we propose a general approach for frequency based string mining, which has many applications, e.g. in contrast data mining. Our contribution is a novel and lightweight algorithm that is faster and uses less memory than the best available algorithms. We show its applicability for mining multiple databases with a variety of frequency constraints. As such, we use the notion of entropy from information theory to generalize the emerging substring mining problem to multiple databases. To demonstrate the improvement of our algorithm we compared to recent approaches on real-world experiments of various string domains, e.g. natural language, DNA, or protein sequences.

Item Type:	Thesis (PhD)
Uncontrolled Keywords:	HTS; full-text index; frequency string mining; read mapping; SeqAn
Subjects:	Biological Sciences Mathematical and Computer Sciences
Divisions:	Department of Mathematics and Computer Science > Institute of Computer Science > Algorithmic Bioinformatics Group
ID Code:	1288
Deposited By:	AG Alg BioInf
Deposited On:	15 Jun 2013 19:50
Last Modified:	03 Mar 2017 14:41

Repository Staff Only: item control page