Rahn, René (2023) Performance-Driven Algorithm Engineering. PhD thesis, Freie Universität Berlin.
Full text not available from this repository.
Official URL: http://dx.doi.org/10.17169/refubium-41699
Abstract
A number of technological advancements in high-throughput genome sequencing have led to the generation of exabyte-scale sequencing data worldwide in recent years. These developments have facilitated large-scale resequencing projects like the 1000 Genomes Project, which aim to catalog the genetic diversity of organisms and specific populations. In the context of medical research, incorporating population data, is of particular interest. However, existing applications are often optimised for analysing a few sequences, and the methods used cannot be easily transferred to these vast datasets without exceeding system resources or producing results within a suitable time frame. Simultaneously, the execution model of processors has evolved from sequential to highly parallel process execution, thanks to the addition of processor cores, vector processing units, and advances in superscalar processor designs. This ongoing development in high-performance computing requires applications and algorithms to scale with the growing levels of parallelism. However, highly optimised algorithms are often embedded within applications, making them practically inaccessible to the scientific community. In this dissertation, we first investigate a generic approach to parallelise and vectorise pairwise sequence alignments using a dynamically scalable concurrency model. We explore various techniques and code-level optimisations to effectively utilise the available parallelisms on modern high-performance CPUs. The results demonstrate that the dynamically accelerated pairwise sequence alignment scales proportionally with the number of cores and provides speed-ups of up to a factor of 2500 compared to the sequential reference implementation on modern hardware. Second, we propose a general solution for data-compressed acceleration of pattern matching algorithms by compressing a large collection of related DNA sequences and providing a set of composable algorithms to refine and optimise the applicable operations. Our research on data-compressed acceleration shows hundred-fold speed-ups for online searching over a pangenome comprising over 5000 reference sequences compared to the naive approach of individually searching all sequences. These speed-ups are achieved while utilising only a fraction of the main memory. Moreover, we implemented these features in dedicated modules of the open-source software library SeqAn (https://www.seqan.de/), making them accessible and adoptable by the entire research community. While doing so, we strived for a user-friendly API design so that these methods can be easily customised and extended, making them applicable to a wide range of applications in the domain of computational sequence analysis.
Item Type: | Thesis (PhD) |
---|---|
Subjects: | Mathematical and Computer Sciences > Computer Science |
Divisions: | Department of Mathematics and Computer Science > Institute of Computer Science > Algorithmic Bioinformatics Group |
ID Code: | 3251 |
Deposited By: | Anja Kasseckert |
Deposited On: | 06 Feb 2025 12:50 |
Last Modified: | 06 Feb 2025 12:52 |
Repository Staff Only: item control page