Since there are many robust and capable servers for functional enrichment analysis, we feel it is better to use a previously tool/web server than to try to reproduce what is already available.
Examples of these services include the following:
For this example, we will look at how you can extract differentially expressed gene data for IRIS-EDA and how you can use DAVID for functional enrichment.
Once you have run DGE Analysis in the
DGE Analysis tab, simply
download the differentially expressed genes using the
Download Filtered Data button at the bottom of the page on the
prior tab (
This will download a set of significantly differentially expressed genes or IDs that can be opened up in R, Excel, LibreOffice Calc, or any other spreadsheet viewing software.
Once the spreadsheet is open, simply copy the IDs from the first column and open up DAVID Bioinformatics Resources (https://david.ncifcrf.gov/).
Click on the
Gene Functional Classification button on the left-hand
side of the page. This will be under the Shortcut to DAVID Tools.
On this page, click on the
Upload button on the left-hand side of the
page. In Step 1: Enter Gene List, paste your gene IDs into this text
box. If you have pasted this list into a text file, you can upload
For Step 2: Select Identifier, choose the ID source for streamlined
analysis. For example, the small example data set found in the tutorial is
derived from the
FLYBASE_GENE_ID database, therefore, I would select
that from the list. Your data, of course, may differ. If you don't know
the origin of the gene IDs,there is a
Not Sure option at the bottom of
For Step 3: List Type, choose
Gene List and finally, submit the list
Submit List button in Step 4: Submit List.
Once submitted, you can view which genes are enriched, clusters, etc. For more information about downstream analyses, please click here.
GEO (Gene Expression Omnibus) is a public data repository that accepts array- and high throughput sequence-based data. To submit data to GEO, you will need three components:
After submitting your data to the
DGE Analysis tab, you can fill out this
questionnaire which will populate the metadata template file required for GEO
Table 1: A comparative overview of citation counts for DGE tools and DGE servers. DGE analytical tools (Tool) are compared based on the following criteria: Current number of citations (Citations), percentage of total citations from the analytical tools presented (Citation %), year the analytical tool was published (Year), approximate citations per year based on data accrued through 2017 (Citations/Year), and if the analytical tool has an R-based application (R-based).
|Tool||Citations||Citation %||Year||Citations/Year (through 2017)||R-based?|
|edgeR (Robinson et al. 2010)||7093||32.39700375||2010||1013.285714||Yes|
|Cuffdiff (Trapnell et al. 2012)||4533||20.70430255||2012||906.6||No|
|Cuffdiff2 (Trapnell et al. 2013)||1508||6.887731799||2013||377||No|
|DESeq2 (Love et al. 2014)||4249||19.40714351||2014||1416.333333||Yes|
|limma (Ritchie et al. 2015)||2399||10.95733991||2015||1199.5||Yes|
|DEGseq (Wang et al. 2009)||1237||5.649949758||2009||154.625||Yes|
|baySeq (Hardcastle et al. 2010)||562||2.56691331||2010||80.28571429||Yes|
|SAMseq (Li et al. 2013)||275||1.256051886||2013||68.75||Yes|
|NOIseq (Tarazona et al. 2012)||38||0.173563533||2012||7.6||Yes|
IRIS-EDA can be freely accessed directly through this link or through R using the following code in the proceeding sections.
IRIS-EDA requires several packages to operate. To get these pacakages, the following commands can be entereed into an R terminal to check if you already have the necessary packages. If not, the following code will install any missing packages:
# CRAN packages <- c( "crosstalk", "dplyr", "DT", "gtools", "plotly", "shiny", "plyr", "shinyBS", "shinycssloaders", "shinythemes", "tibble", "tidyr", "Rcpp", "Hmisc", "ggplot2", "locfit", "GGally", "pheatmap", "reshape2", "backports", "digest", "fields", "psych", "stringr", "tools", "openxlsx", "Rtsne", "WGCNA", "flashClust", "parallel", "MCL", "kmed", "ape" ) np <- packages[!(packages %in% installed.packages()[, "Package"])] if(length(np)) install.packages(np)
You will also need several Bioconductor packages. Similar to the prior section, the following code will check and install any missing Bioconductor packages into your R library:
# Bioconductor bioc_packages <- c( "DESeq2", "edgeR", "limma", "QUBIC", "geneplotter", "GO.db", "impute", "preprocessCore", "AnnotationDbi" ) np <- bioc_packages[!(bioc_packages %in% installed.packages()[,"Package"])] if (!require("BiocManager")) install.packages("BiocManager") BiocManager::install(np)
To run BRIC analysis, you also need to download the source code for this clustering algorithm. Run this code to get the GitHub package:
# GitHub if (!require("devtools")) install.packages("devtools") devtools::install_github("OSU-BMBL/BRIC",force = T)
Once you have installed all of the necessary packages, you can now run the application locally by entering the following code into an R terminal:
Typically, the link will provide an easier route to using IRIS-EDA. In circumstances where internet connections will be limited (such as during travel), loading IRIS-EDA through R while internet is still available will allow users to utilize IRIS-EDA without an internet connection later on.
Installing the local application via GitHub will also let the user have access to more “developmental” versions of IRIS-EDA. Be warned though; using developmental versions of this application may be “cutting edge”, however, this could potentially break various features in the application. Therefore, using the server (http://bmbl.sdstate.edu/IRIS/) will provide a more “stable”, peer-reviewed release of the application.
Upon sample submission to IRIS-EDA, you have two options: (1) Perform analyses on a normal RNA-seq experiment or (2) run single-cell RNA-seq analysis.
If you choose the latter option, you will need to provide a file containing ID lengths to help reduce variability. This is crucial since filtration of low TPM (transcripts per million) reads can lead to better results. In order to calculate TPM, ID length approximations are needed in conjunction with raw count data. To submit this information to IRIS-EDA, you will need to provide ID data that follows the proceeding criteria:
id id_len_kbp gene001 1.4 gene002 5.6 gene003 0.8 ... ...
Note: This file must be in a comma seperated value (CSV) format, where column 1 is the ID (e.g. gene, transcript, etc.) and column 2 is the approximation of length in kilobase pairs.
There are many ways to obtain these values. A common procedure would be to parse the respective general fearture format version 3 (GFF3) file and determine the length through calculating the difference between the start and end locations.
If you need help parsing this information, a primitive
R function has
been made, which you can find
To run this function, you will need to install and load 3 packages into
your R library:
All of these packages can be found on the CRAN
repository, or by using the
install.packages() function provided in the base package of
This function has four parameters:
gff: the location to the GFF3 file on your computer
cts: your raw count matrix that you will provide to IRIS-EDA.
Note: this object must be a
matrix type and have the same
format that is found in the first section of this walkthrough (A
note about input data).
type: the GFF3 type column identifier (e.g. gene, transcript,
exon, CDS, etc.)
attribute: the GFF3 attribute (e.g.
Depending on the size of this file, this may take some time. This
function will return a
tibble data frame, so make sure you assign it
to an object before you use
write.csv(). After you perform this task,
the file can successfully be submitted to IRIS-EDA.
Once you have submitted the data, you will notice that the
Filter cutoff changes from
count data row sums to
The default is set to a value of
1, however, this can be changed at
the user's discretion.
Note (1): this value corresponds to which rowsums (i.e. ID sums) will be filtered out if they have a value that is less than the user parameter.
Note (2): “For users with 10X scRNA-Seq data or other data type with
expected low expression across all genes, we have a submission option
that changes default parameterizations to account for this. If
submitting 10X scRNA data, please use the
scRNA - 10X Genomics option.
IRIS-EDA requires two pieces of information for analysis. The first is an expression estimation matrix, also referred to as a count matrix, displaying the gene expression estimates for each sample. The format requires a CSV file with the row names to list the gen IDs and column names to list the sample IDs. The second required input is a condition matrix, wherein the factor levels for each sample are provided. This file requires a CSV format and row names to be the sample IDs matching the sample IDs from the expression estimation matrix and the column names to be the condition factors.
The data used for this tutorial are derived from 28 Vitis vinifera
(grape) samples with three distinct factors (Rootstock, row, and block).
This data can viewed as the “big” example data set found under the
Submit and QC tab under
1. Submission Parameters.
Typically, an expression matrix, also known as count data or a count
matrix refers to data where every i-th row and j-th column refer to
how many reads are assigned to gene (ID) i in sample j. For example,
if we have simplified count data for 4 samples and three genes, the
output will look something like this:
sample1 sample2 sample3 sample4 gene001 23 3 45 2 gene002 6 7 7 8 gene003 0 34 3 42
Note1: When loading count data into IRIS-EDA, make sure
that the first column is your gene IDs and that sample names are short,
concise, and avoid the use of mathematical operators (
^, etc.) and spaces between words. If a space is necessary for
legibility, please consider using an underscore (
Note2: For sample names, avoid starting any entries
with numerical elements (e.g.
This may potentially cause unexpected errors in downstream analyses!
Condition matrices, also known as metadata, details the design of your
experiment. In this type of matrix, every i-th row and j-th column
refer to factor levels assigned to sample i and factor j. For
example, if were to look at the samples given in the count
data section, the metadata
R output will look something
condition time sample1 treated 0h sample2 untreated 0h sample3 treated 24h sample4 untreated 24h
Note1: When loading metadata into IRIS-EDA, make sure
that the first column is your sample names and that column names and
treatment levels are short, concise, and avoid the use of mathematical
^, etc.) and spaces between words. If a
space is necessary for legibility, please consider using an underscore
Note2: For this data, avoid starting any entries with numerical elements (e.g. 1, 2, 42, etc.). This may potentially cause unexpected errors in downstream analyses!
Note3: Metadata can be expanded to fit the nature of your experiment (i.e. multiple factors can be added). The only thing that must remain consistent between these two matrices, is the sample information. Column names in count data must be the same as row names in the metadata.
Submit and QCtab near the top-left corner of the application:
1. Submission Paramters, select either
Start with a small example data set,
Start with a big example data set, or
Load my own datato upload user data. Note: User data requires one count matrix and one condition matrix:
2. Data Processing, select a filter cutoff to simplify and expedite computations by eliminating any rows that have below specified expression totals. The default argument for our application is
3. Right below the filter cutoff parameter, select a transformation method for the count data for displaying the expression estimate. Note: For more information about any of these topics, take a look at the FAQ section at the bottom of this document:
Submit to load the data under
3. Launch Overview:
5. After you click
Submit, the main page will now be populated with
several pieces of information in two sub-tabs,
File Summary and
File Summary provides a glimpse into the submitted data. This subtab
includes three main components.
count datasection will detail the first and last five rows of the count data and also include the total number of IDs and samples:
2. The second portion includes an overview of the condition data this is simply an
R console-based output of this submitted CSV file:
3. Finally, pre- and post-filtered gene ID counts gives a “before-and-after” count of the number of IDs that were filtered using the filter cutoff parameter under
2. Data Processing:
## S5.2: Count Summary
Count summary provides three interactive,
downloadable visualizations based on the file for each sample.
Submit and QCtab, you may proceed to the
Discovery-Driven Analyses tab will be populated with several
subtabs, similar to the
Submit and QC tab. The first subtab you will
see is the
Correlation tab. This tab provides correlation analysis of
the expression matrix by sample:
Correlationsubtab you will see several visualizations.
Interactive Correlation Analysisdisplays a heatmap of the correlation between samples with interactivity showing the actual correlation between the two intersecting samples. The example data shows most sample-sample correlations of 0.95 or larger, indicating relatively high correlation. The darker cells here signify less similar samples, which may yield more interesting differential expression results. This graph can indicate comparisons of interest in future analyses. In most cases, the high number of gene IDs with similar or identical expression estimates will cause correlations to be large, even for dissimilar genetic expression profiles:
4. Clicking on a cell will provided a scatterplot of the two intersecting samples with each data point representing one gene ID. A scatterplot with points falling more closely along the diagonal indicates samples with more similar genetic expression profiles. This scatterplot shows a clear trend of data points (gene IDs) falling along or close to the diagonal. That means these two samples have very similar genetic expression profiles. Data points that fall far from the diagonal line represent genes with dissimilar expression levels between the select samples:
6. The next visualization will be under the
PCA subtab. This subtab
provides Principal Component Analysis (PCA) for the expression
8. The next visualization will be under the
MDS subtab. This subtab
provides multi dimensional scaling for the expression estimation matrix.
This is similar to PCA except is develops components through non-linear
transformations of the gene expressions:
11. The next visualization will be under the
Heatmap subtab. This
subtab provides an interactive heatmap with rows representing the gene
IDs and columns representing sample IDs:
Biclusteringperforms a biclustering analysis using one of the selected biclustering tools (3) with a maximum bicluster size of the indicated cutoff value (2):
Clusteringtab allows for users to select one of three clustering methods respectively representing hierarchical, representative, and graph-based clustering: Weighted Gene Co-expression Network Analysis (WGCNA), k-medoids, and the Markov Clustering Algorithm (MCL). While all three methods have demonstrated high performance related to module detection, the default is WGCNA due to it being the best of the three. Here, users should be careful with large datasets, as these methods can take large amounts of time to run. Setting the variable cutoff option lower will reduce the time by selecting fewer of the most differentially expressed genes to use in clustering.