Repository: Freie Universität Berlin, Math Department

Statistical analysis of high-throughput sequencing count data

Love, Michael I. (2013) Statistical analysis of high-throughput sequencing count data. PhD thesis, Freie Universität Berlin.

Full text not available from this repository.

Official URL: http://dx.doi.org/10.17169/refubium-9391

Abstract

High-throughput sequencing (HTS) refers to the simultaneous sequencing of millions of fragments of DNA, which can be either assembled to reconstitute a genome, or aligned to an existing reference genome. The protocol can be extended to assay a wide variety of biological states of the cell, including DNA copy number, mRNA abundance and various properties of chromatin. HTS experiments allow for these biological states to be quantified as read counts at genome-wide scale with a single experiment. Though the experiments are expensive and often datasets are produced with limited sample size, information can be shared across thousands of genomic ranges in order to obtain robust models which control for technical biases. In this thesis, I present three statistical models for analyzing HTS read count data, aimed at answering concise biological questions. First, a hidden Markov model is developed for detecting copy number variants (CNVs) in individual samples while controlling for technical artifacts, such as variation in read counts due to local GC-content. Applied to a study of 248 male patients with X-linked intellectual disability, the model predicts 16 large CNVs, of which 10 candidate disease-causing CNVs were tested and all experimentally validated. The proposed software is then compared with state-of-the-art segmentation algorithms on normalized data, showing higher sensitivity while controlling the total rate of predicted CNVs. Second, improvements for parameter estimation are made for a statistical model of differential gene expression from RNA-Seq data. The improvements involve the use of empirical Bayes priors -- priors estimated using the observations from all genes -- in order to moderate otherwise noisy estimates of dispersion and fold changes for individual genes. The improved model shows increased sensitivity and more robust estimation of fold change in comparison with other differential expression software packages for RNA-Seq. Finally, a hierarchical Bayes model is used to associate transcription factor binding with chromatin and sequence features in regions of accessible chromatin. The hierarchical model incorporates three levels of parameters: one for individual experiments, one for experiments of the same cell type and one across all cell types. The model parameters are used to generate hypotheses regarding the DNA-binding behavior of a transcription factor, the glucocorticoid receptor. In summary, this thesis describes a set of statistical methods for HTS read count data which can be used across various biological domains. The methods form a framework for robust estimation of variables and hypothesis testing.

Item Type:Thesis (PhD)
Subjects:Mathematical and Computer Sciences
Divisions:Department of Mathematics and Computer Science > Institute of Computer Science
ID Code:2847
Deposited By: Anja Kasseckert
Deposited On:05 Sep 2022 12:15
Last Modified:08 Sep 2022 13:13

Repository Staff Only: item control page