Repository: Freie Universität Berlin, Math Department

Statistical models to capture protein-RNA interaction footprints from truncation-based CLIP-seq data

Krakau, Sabrina (2020) Statistical models to capture protein-RNA interaction footprints from truncation-based CLIP-seq data. PhD thesis, Freie Universität Berlin.

Full text not available from this repository.

Official URL:


Protein-RNA interactions play an important role in all post-transcriptional regulatory processes. High throughput detection of protein-RNA interactions has been facilitated by the emerging CLIP-seq (crosslinking and immunoprecipitation combined with high-throughput sequencing) techniques. Enrichments in mapped reads as well as base transitions or deletions at crosslink sites can be used to infer binding regions. Single-nucleotide resolution techniques (iCLIP and eCLIP) have been achieved by capturing high fractions of cDNAs which are truncated at protein-RNA crosslink sites. Increasing numbers of datasets and derivatives of these protocols have been published in recent years, requiring tailored computational analyses. Existing methods unfortunately do not explicitly model the specifics of truncation patterns and possible biases caused by background binding or crosslinking sequence preferences. We present PureCLIP, a hidden Markov model based approach, which simultaneously performs peak calling and individual crosslink site detection. It is capable of incorporating external data to correct for non-specific background signals and, for the first time, for the crosslinking biases. We devised a comprehensive evaluation based on three strategies. Firstly, we developed a workflow to simulate iCLIP data, which starts from real RNA-seq data and known binding regions and then mimics the experimental steps of the iCLIP protocol, including the generation of background signals. Secondly, we used experimental iCLIP and eCLIP datasets, using the proteins’ known predominant binding regions. And thirdly, we assessed the agreement of called sites between replicates, assuming target-specific signals are reproducible between replicates. On both simulated and real data, PureCLIP is consistently more precise in calling crosslink sites than other state-of-the-art methods. In particular when incorporating input control data and crosslink associated motifs (CL-motifs) PureCLIP is up to 13% more precise than other methods and we show that it has an up to 20% higher agreement across replicates. Moreover, our method can optionally merge called crosslink sites to binding regions based on their distance and we show that the resulting regions reflect the known binding regions with high-resolution. Additionally, we demonstrate that our method achieves a high precision robustly over a range of different settings and performs well for proteins with different binding characteristics. Lastly, we extended the method to include individual CLIP replicates and show that this can boost the precision even further. PureCLIP and its documenta- tion are publicly available at

Item Type:Thesis (PhD)
Subjects:Mathematical and Computer Sciences > Computer Science
Divisions:Department of Mathematics and Computer Science > Institute of Computer Science > Algorithmic Bioinformatics Group
ID Code:2848
Deposited By: Anja Kasseckert
Deposited On:05 Sep 2022 12:24
Last Modified:05 Sep 2022 12:24

Repository Staff Only: item control page