Adam McDermaid, Xin Chen, Yiran Zhang, Juan Xie, Cankun Wang, Qin Ma$, A new machine learning-based framework for mapping uncertainty analysis in RNA-Seq read alignment and gene expression estimation.. Frontiers in Genetics. doi: 10.3389/fgene.2018.00313.
Figure 1: GeneQC workflow
Why is GeneQC needed?
Mapping Uncertainty, in which a given RNA-seq read can be mapped to multiple genomic locations (Figure 2), is a prominent issue in reference-based RNA-seq analyses throughout most eukaryotic species. This issue can result in biased gene expression estimations which affects all downstream analyses, from assembly and differential gene expression to regulatory network elucidation, and significantly affects both animal and plant species (Figure 3). Most RNA-seq studies perform read alignment without much concern for the quality of the alignment results, assuming them to be of sufficient quality. However, our investigation has shown that mapping uncertainty is a prominent issue with the quality of alignment results. Just as most RNA-seq pipelines utilize some quality control method for raw reads, a quality control process for read alignment must be included to verify the reliability of mapping results.
Figure 2: Mapping Uncertainty
Figure 3: Table of GeneQC
How do we approach this problem?
The first step to addressing mapping uncertainty is to develop a method for determining how severe the problem is for a given dataset. In GeneQC, we extract usable features from two genetic levels, genomic and transcriptomic. From the genomic level, we look at the sequence similarity for two genomic locations (Figure 4 D1), whereas for the transcriptomic level, we investigate the proportion of shared reads for two genes or transcripts after initial alignment (Figure 4 D2). Additionally, we consider a network-level features in the form of the number of connected genes to the gene of interest (Figure 4 D3).
Figure 4: Network
These three levels of information are combined using linear modeling to develop a distinct score, referred to as the D-score, to provide a clear measure of the level of mapping uncertainty for each annotated gene for a particular species. Additionally, mixture model distributions are used to determine categorizations for the mapping uncertainty, and thus recommendations for which genetic expression estimates are reliable following the read alignment step.
Figure 5: GeneQC parts
GeneQC outputs a data frame which includes Gene ID, D1, D2, D3, D-score, mapping uncertainty level, and alternative likelihood value for each annotated gene. This information provides a comprehensive glimpse of the mapping uncertainty for each gene, specifically with the D-score and mapping uncertainty level. Users can then take this information to make well-informed decisions about their genetic expression estimates and downstream analyses.