Confused? Consider reading the tutorial under:

More -> Tutorial

#### 3. Launch Overview

After you click "submit", you may proceed to either the "Data-Driven Analysis", "DGE Analysis", or "GEO" tabs for additional analyses.

#### 4. Launch Analysis

Since there are many robust and capable servers for functional enrichment analysis, we feel it is better to use a previously tool/web server than to try to reproduce what is already available.

Examples of these services include the following:

For this example, we will look at how you can extract differentially expressed gene data for IRIS-EDA and how you can use DAVID for functional enrichment.

1. Once you have run DGE Analysis in the DGE Analysis tab, simply download the differentially expressed genes using the Download Filtered Data button at the bottom of the page on the prior tab (Plots tab).

2. This will download a set of significantly differentially expressed genes or IDs that can be opened up in R, Excel, LibreOffice Calc, or any other spreadsheet viewing software.

3. Once the spreadsheet is open, simply copy the IDs from the first column and open up DAVID Bioinformatics Resources (https://david.ncifcrf.gov/).

4. Click on the Gene Functional Classification button on the left-hand side of the page. This will be under the Shortcut to DAVID Tools.

5. On this page, click on the Upload button on the left-hand side of the page. In Step 1: Enter Gene List, paste your gene IDs into this text box. If you have pasted this list into a text file, you can upload that instead.

6. For Step 2: Select Identifier, choose the ID source for streamlined analysis. For example, the small example data set found in the tutorial is derived from the FLYBASE_GENE_ID database, therefore, I would select that from the list. Your data, of course, may differ. If you don't know the origin of the gene IDs,there is a Not Sure option at the bottom of the list.

7. For Step 3: List Type, choose Gene List and finally, submit the list using the Submit List button in Step 4: Submit List.

8. Once submitted, you can view which genes are enriched, clusters, etc. For more information about downstream analyses, please click here.

## GEO Submission

GEO (Gene Expression Omnibus) is a public data repository that accepts array- and high throughput sequence-based data. To submit data to GEO, you will need three components:

• Processed data files
• Raw data files

After submitting your data to the DGE Analysis tab, you can fill out this questionnaire which will populate the metadata template file required for GEO submission.

### Questionnaire

#### 1. Submission Parameters

WARNING: If you have a large number of samples and / or data, this process can take a long time! Please take this into consideration.

In case of server problems, consider running the application locally. More information can be found here:

https://github.com/OSU-BMBL/iris/

#### 2. BRIC Parameters

Parameter definitions:
• -f : filtering overlapping blocks
• -k : minimum column width of block
• -o : number of blocks to report

#### 3. Launch Overview

More information about the source code for this algorithm can be found at the following link:

https://github.com/OSU-BMBL/BRIC

# S1: Citation information

Table 1: A comparative overview of citation counts for DGE tools and DGE servers. DGE analytical tools (Tool) are compared based on the following criteria: Current number of citations (Citations), percentage of total citations from the analytical tools presented (Citation %), year the analytical tool was published (Year), approximate citations per year based on data accrued through 2017 (Citations/Year), and if the analytical tool has an R-based application (R-based).

Tool Citations Citation % Year Citations/Year (through 2017) R-based?
edgeR (Robinson et al. 2010) 7093 32.39700375 2010 1013.285714 Yes
Cuffdiff (Trapnell et al. 2012) 4533 20.70430255 2012 906.6 No
Cuffdiff2 (Trapnell et al. 2013) 1508 6.887731799 2013 377 No
DESeq2 (Love et al. 2014) 4249 19.40714351 2014 1416.333333 Yes
limma (Ritchie et al. 2015) 2399 10.95733991 2015 1199.5 Yes
DEGseq (Wang et al. 2009) 1237 5.649949758 2009 154.625 Yes
baySeq (Hardcastle et al. 2010) 562 2.56691331 2010 80.28571429 Yes
SAMseq (Li et al. 2013) 275 1.256051886 2013 68.75 Yes
NOIseq (Tarazona et al. 2012) 38 0.173563533 2012 7.6 Yes

# S2: Accessibility

IRIS-EDA can be freely accessed directly through this link or through R using the following code in the proceeding sections.

## S2.1: Install CRAN packages

IRIS-EDA requires several packages to operate. To get these pacakages, the following commands can be entereed into an R terminal to check if you already have the necessary packages. If not, the following code will install any missing packages:

# CRAN
packages <- c(
"crosstalk", "dplyr", "DT", "gtools", "plotly", "shiny", "plyr",
"shinyBS", "shinycssloaders", "shinythemes", "tibble", "tidyr",
"Rcpp", "Hmisc", "ggplot2", "locfit", "GGally", "pheatmap",
"reshape2", "backports", "digest", "fields", "psych", "stringr",
"tools", "openxlsx", "Rtsne", "WGCNA", "flashClust", "parallel",
"MCL", "kmed", "ape"
)
np <- packages[!(packages %in% installed.packages()[, "Package"])]
if(length(np)) install.packages(np)


## S2.2: Install Bioconductor and Devlopmental packages

You will also need several Bioconductor packages. Similar to the prior section, the following code will check and install any missing Bioconductor packages into your R library:

# Bioconductor
bioc_packages <- c(
"DESeq2", "edgeR", "limma", "QUBIC", "geneplotter", "GO.db", "impute",
"preprocessCore", "AnnotationDbi"
)
np <- bioc_packages[!(bioc_packages %in% installed.packages()[,"Package"])]
if (!require("BiocManager")) install.packages("BiocManager")
BiocManager::install(np)


To run BRIC analysis, you also need to download the source code for this clustering algorithm. Run this code to get the GitHub package:

# GitHub
if (!require("devtools")) install.packages("devtools")
devtools::install_github("OSU-BMBL/BRIC",force = T)


## S2.3: Run the Shiny application

Once you have installed all of the necessary packages, you can now run the application locally by entering the following code into an R terminal:

shiny::runGitHub("iris", "OSU-BMBL")


Typically, the link will provide an easier route to using IRIS-EDA. In circumstances where internet connections will be limited (such as during travel), loading IRIS-EDA through R while internet is still available will allow users to utilize IRIS-EDA without an internet connection later on.

Installing the local application via GitHub will also let the user have access to more “developmental” versions of IRIS-EDA. Be warned though; using developmental versions of this application may be “cutting edge”, however, this could potentially break various features in the application. Therefore, using the server (http://bmbl.sdstate.edu/IRIS/) will provide a more “stable”, peer-reviewed release of the application.

# S3: Single-cell RNA-seq

## S3.1: Overview

Upon sample submission to IRIS-EDA, you have two options: (1) Perform analyses on a normal RNA-seq experiment or (2) run single-cell RNA-seq analysis.

If you choose the latter option, you will need to provide a file containing ID lengths to help reduce variability. This is crucial since filtration of low TPM (transcripts per million) reads can lead to better results. In order to calculate TPM, ID length approximations are needed in conjunction with raw count data. To submit this information to IRIS-EDA, you will need to provide ID data that follows the proceeding criteria:

id        id_len_kbp
gene001          1.4
gene002          5.6
gene003          0.8
...              ...


Note: This file must be in a comma seperated value (CSV) format, where column 1 is the ID (e.g. gene, transcript, etc.) and column 2 is the approximation of length in kilobase pairs.

## S3.2: A method to obtain this data

There are many ways to obtain these values. A common procedure would be to parse the respective general fearture format version 3 (GFF3) file and determine the length through calculating the difference between the start and end locations.

If you need help parsing this information, a primitive R function has been made, which you can find here. To run this function, you will need to install and load 3 packages into your R library:

• ape
• stringr
• dplyr

All of these packages can be found on the CRAN repository, or by using the install.packages() function provided in the base package of R.

This function has four parameters:

• gff: the location to the GFF3 file on your computer

• cts: your raw count matrix that you will provide to IRIS-EDA. Note: this object must be a matrix type and have the same format that is found in the first section of this walkthrough (A note about input data).

• type: the GFF3 type column identifier (e.g. gene, transcript, exon, CDS, etc.)

• attribute: the GFF3 attribute (e.g. ID, gene_id, transcript_id, gene_name, or transcript_name)

Depending on the size of this file, this may take some time. This function will return a tibble data frame, so make sure you assign it to an object before you use write.csv(). After you perform this task, the file can successfully be submitted to IRIS-EDA.

## S3.3: TPM filtration

Once you have submitted the data, you will notice that the Filter cutoff changes from count data row sums to TPM:

The default is set to a value of 1, however, this can be changed at the user's discretion.

Note (1): this value corresponds to which rowsums (i.e. ID sums) will be filtered out if they have a value that is less than the user parameter.

Note (2): “For users with 10X scRNA-Seq data or other data type with expected low expression across all genes, we have a submission option that changes default parameterizations to account for this. If submitting 10X scRNA data, please use the scRNA - 10X Genomics option.

# S4: A note about input data

IRIS-EDA requires two pieces of information for analysis. The first is an expression estimation matrix, also referred to as a count matrix, displaying the gene expression estimates for each sample. The format requires a CSV file with the row names to list the gen IDs and column names to list the sample IDs. The second required input is a condition matrix, wherein the factor levels for each sample are provided. This file requires a CSV format and row names to be the sample IDs matching the sample IDs from the expression estimation matrix and the column names to be the condition factors.

The data used for this tutorial are derived from 28 Vitis vinifera (grape) samples with three distinct factors (Rootstock, row, and block). This data can viewed as the “big” example data set found under the Submit and QC tab under 1. Submission Parameters.

## S4.1: Expression matrices

Typically, an expression matrix, also known as count data or a count matrix refers to data where every i-th row and j-th column refer to how many reads are assigned to gene (ID) i in sample j. For example, if we have simplified count data for 4 samples and three genes, the R output will look something like this:

        sample1 sample2 sample3 sample4
gene001      23       3      45       2
gene002       6       7       7       8
gene003       0      34       3      42


Note1: When loading count data into IRIS-EDA, make sure that the first column is your gene IDs and that sample names are short, concise, and avoid the use of mathematical operators (+, -, /, *, ^, etc.) and spaces between words. If a space is necessary for legibility, please consider using an underscore (_)

Note2: For sample names, avoid starting any entries with numerical elements (e.g. 1sample, 2_human, 42_amf, etc.). This may potentially cause unexpected errors in downstream analyses!

## S4.2: Condition matrices

Condition matrices, also known as metadata, details the design of your experiment. In this type of matrix, every i-th row and j-th column refer to factor levels assigned to sample i and factor j. For example, if were to look at the samples given in the count data section, the metadata R output will look something like this:

        condition time
sample1   treated   0h
sample2 untreated   0h
sample3   treated  24h
sample4 untreated  24h


Note1: When loading metadata into IRIS-EDA, make sure that the first column is your sample names and that column names and treatment levels are short, concise, and avoid the use of mathematical operators (+, -, /, *, ^, etc.) and spaces between words. If a space is necessary for legibility, please consider using an underscore (_).

Note2: For this data, avoid starting any entries with numerical elements (e.g. 1, 2, 42, etc.). This may potentially cause unexpected errors in downstream analyses!

Note3: Metadata can be expanded to fit the nature of your experiment (i.e. multiple factors can be added). The only thing that must remain consistent between these two matrices, is the sample information. Column names in count data must be the same as row names in the metadata.

# S5: Quality Control

1. Click on the Submit and QC tab near the top-left corner of the application:

1. Under 1. Submission Paramters, select either Start with a small example data set, Start with a big example data set, or Load my own data to upload user data. Note: User data requires one count matrix and one condition matrix:

1. Under 2. Data Processing, select a filter cutoff to simplify and expedite computations by eliminating any rows that have below specified expression totals. The default argument for our application is 10:

3. Right below the filter cutoff parameter, select a transformation method for the count data for displaying the expression estimate. Note: For more information about any of these topics, take a look at the FAQ section at the bottom of this document:

4. Click Submit to load the data under 3. Launch Overview:

5. After you click Submit, the main page will now be populated with several pieces of information in two sub-tabs, File Summary and Count Summary.

## S5.1: File Summary

File Summary provides a glimpse into the submitted data. This subtab includes three main components.

1. The count data section will detail the first and last five rows of the count data and also include the total number of IDs and samples:

2. The second portion includes an overview of the condition data this is simply an R console-based output of this submitted CSV file:

3. Finally, pre- and post-filtered gene ID counts gives a “before-and-after” count of the number of IDs that were filtered using the filter cutoff parameter under 2. Data Processing:

## S5.2: Count Summary Count summary provides three interactive, downloadable visualizations based on the file for each sample.

1. Box-and-Whisker Plot for transformed read counts by sample with interactivity showing the quartiles, maximum, and minimum count. With the example data, it appears that the box-and-whisker plot for each sample is similar to the other samples. If one sample had a plot varying greatly from the others, it would indicate some required investigation into that specific sample in terms of the number of raw reads provided and proportion of reads aligned.

1. Count Data Distribution plot showing the frequency of transformed count data by sample with interactivity displaying the value and frequency. Additionally, double-clicking a sample ID in the legend isolates just that sample's histogram. Additional sample IDs can be select for more specific comparisons also. With the example data, the histograms appear similar for each sample, indicating no significant derivation. Similar to the box-and-whisker plots, a single sample varying greatly from other samples may indicate a required investigation into the raw read counts or read alignment statistics.

1. Total Reads displays a histogram of total reads by sample with interactivity for displaying actual total read counts for each sample. Double-clicking on a sample ID in the legend isolates that sample's read count histogram and allows for selecting of specific adjacent sample IDs for comparison. Total reads counts for individual samples that vary greatly from the other total read counts may indicate some issues with data preparation (sequencing) or read alignment. Here, sample H26313 has a much lower total reads count than the other samples. This may be reflected in further comparative analyses.

# S6: Discovery-Driven Analyses

1. After examining your results on the Submit and QC tab, you may proceed to the Discovery-Driven Analyses tab:

2. The Discovery-Driven Analyses tab will be populated with several subtabs, similar to the Submit and QC tab. The first subtab you will see is the Correlation tab. This tab provides correlation analysis of the expression matrix by sample:

1. Under the Correlation subtab you will see several visualizations. Interactive Correlation Analysis displays a heatmap of the correlation between samples with interactivity showing the actual correlation between the two intersecting samples. The example data shows most sample-sample correlations of 0.95 or larger, indicating relatively high correlation. The darker cells here signify less similar samples, which may yield more interesting differential expression results. This graph can indicate comparisons of interest in future analyses. In most cases, the high number of gene IDs with similar or identical expression estimates will cause correlations to be large, even for dissimilar genetic expression profiles:

4. Clicking on a cell will provided a scatterplot of the two intersecting samples with each data point representing one gene ID. A scatterplot with points falling more closely along the diagonal indicates samples with more similar genetic expression profiles. This scatterplot shows a clear trend of data points (gene IDs) falling along or close to the diagonal. That means these two samples have very similar genetic expression profiles. Data points that fall far from the diagonal line represent genes with dissimilar expression levels between the select samples:

1. The Sample Distance Matrix provides a heatmap of the Euclidean distance between the gene expression vectors for each sample pair. The larger distance (darker red color) indicates samples with the most dissimilar genetic expression profiles. This matrix also includes a clustering of the samples based on the vectorized expression profiles. With the example data, two distinct clusters can be observed through the first branching of the dendrogram. Additionally, as with the correlation heatmap, specific cells with a darker color indicate a more dissimilar pair of samples based on genetic expression:

6. The next visualization will be under the PCA subtab. This subtab provides Principal Component Analysis (PCA) for the expression estimation matrix:

1. This analysis has the option of selecting a factor of interest. With the example data, selecting “Rootstock” as the factor of interest provides a visualization of the first two components for each sample. In this application, PCA is a linear transformation of the gene expression levels, with the first component representing the transformed dimension with the most variability, and each subsequent component decreasing in variability. This analysis has the potential to isolate samples based on expression levels. Here, there does not appear to be any specific rootstock that separates from the others. If there were, it could help develop directions for further analysis. The axis labels indicate the first principal component accounts for 37% of the variance between samples, whereas the second principal component accounts for 7%:

8. The next visualization will be under the MDS subtab. This subtab provides multi dimensional scaling for the expression estimation matrix. This is similar to PCA except is develops components through non-linear transformations of the gene expressions:

1. Looking back at our sample data, we observe similar results to that of the PCA results, with similar potential interpretations if any sample or groups of samples were to differentiate from the others:

1. t-SNE allows for visualization of results in two and three dimensions. The three-dimensional figure is also interactive, allowing users to move the axes to gain a better understanding of the three-dimensional layout of the sample points.

11. The next visualization will be under the Heatmap subtab. This subtab provides an interactive heatmap with rows representing the gene IDs and columns representing sample IDs:

1. This heatmap requires an ID cutoff to select the indicated number of gene IDs with the highest mean expression values for display. Selecting a cell displays a plot showing the total read counts for that specific gene ID by the selected factor. With the example data, the 20 most variable gene IDs are displayed. The yellow color indicates gene IDs with a higher expression level for that sample, and the darker blue color represents a low expression level for that sample. Selecting ID: rna25007 shows the read counts for that ID by rootstock factor. This shows the “OWN” rootstock seems to have a higher expression level for that ID, with the exception of one sample:

1. On the next subtab (1), Biclustering performs a biclustering analysis using one of the selected biclustering tools (3) with a maximum bicluster size of the indicated cutoff value (2):

1. Launching the analysis results in display of the first bicluster. Alternative clusters can be selected from the dropdown menu and the IDs and plot for each bicluster can be downloaded using the button below the visualization. With the sample data, the biclusters can help select the samples under which certain gene IDs are similarly expressed. Since gene expression levels can vary greatly over all samples and conditions, a biclustering approach can isolate similar expression patterns on a level where traditional clustering may miss. The first cluster for the example data shows that samples B20715, D21515, and H12915 are expressed similarly under the isolated gene IDs. Interpretations can be made similarly for each subsequent bicluster:

1. The Clustering tab allows for users to select one of three clustering methods respectively representing hierarchical, representative, and graph-based clustering: Weighted Gene Co-expression Network Analysis (WGCNA), k-medoids, and the Markov Clustering Algorithm (MCL). While all three methods have demonstrated high performance related to module detection, the default is WGCNA due to it being the best of the three. Here, users should be careful with large datasets, as these methods can take large amounts of time to run. Setting the variable cutoff option lower will reduce the time by selecting fewer of the most differentially expressed genes to use in clustering.

1. The WGCNA method generates figures related to sample and gene dendrograms, as well as a topological overlap matrix based on the genes. Additionally, users can download lists of the genes and samples with respective cluster assignments.