Genome Analysis Methods using Long Read Nanopore Sequencing

Giesselmann, Pay (2021) Genome Analysis Methods using Long Read Nanopore Sequencing. PhD thesis, Freie Universität Berlin.

Full text not available from this repository.

Official URL: http://dx.doi.org/10.17169/refubium-32662

Abstract

Third-generation long-read technologies denote the latest progression in high throughput DNA and RNA sequence analysis. Complementing the widespread second-generation short-read platforms, long-read sequencing adds unique application opportunities by generating previously unattainable read lengths. Despite the remaining higher error rate compared to short reads, single-molecule real-time sequencing (SMRT) and nanopore sequencing advanced to be state-of-the-art for \textit{de-novo} genome assemblies and identification of structural variants. Continuous throughput and accuracy improvements lead to development of novel methods and applications at a fast pace. We identify major application fields and key bioinformatic software for long-read sequencing data analysis by employing a data driven literature research. The integration of citations and keywords into a literature graph provides a scaling approach to analyze an exponentially growing number of third-generation sequencing related publications. Even though sparking the development of countless bioinformatics software, the streamlined nanopore data processing into standardized formats is still lacking. As an enabling step for its successful application, we developed \textit{Nanopype}, a modular and scalable pipeline. Our approach facilitates the basic steps of basecalling, alignment, methylation- and structural variant detection with exchangeable tools in each module. Optimized for the usage on high performance compute clusters, we propose a raw data management, capable of handling multiple sequencing devices placed locally and remotely. Strict version control of integrated tools and deployment as containerized software, ensure reproducibility across projects and laboratories. Finally, we analyze disease associated repeat regions utilizing targeted nanopore sequencing and the \textit{Nanopype} processing infrastructure. The expansion of unstable genomic short tandem repeats (STRs) is of particular interest as it causes more than 30 Mendelian human disorders. Long stretches of repetitive sequence render these regions inaccessible for short-read sequencing by synthesis. Furthermore, finding current nanopore basecalling algorithms insufficient to resolve the repeat length, we developed \textit{STRique}, a raw nanopore signal based repeat detection and quantification software. We demonstrate the precise analysis of repeat lengths on patient-derived samples containing C9orf72 and FMR1 repeat expansions. The additional integration of repeat- and nearby promoter-methylation levels reveal a repeat length depending gain, suggesting an epigenetic response to the expansion. Taken together, this work contributes to further increase the usability and provides novel insights based on third-generation nanopore sequencing.

Item Type:	Thesis (PhD)
Subjects:	Mathematical and Computer Sciences > Computer Science
Divisions:	Department of Mathematics and Computer Science > Institute of Computer Science > Algorithmic Bioinformatics Group
ID Code:	2852
Deposited By:	Anja Kasseckert
Deposited On:	05 Sep 2022 13:44
Last Modified:	05 Sep 2022 13:44

Repository Staff Only: item control page