# Enrichment Analyses

## Performing Phylostratum and Divergence Stratum Enrichment Analyses

Phylostratum and Divergence Stratum enrichment analyses have been performed by several studies to correlate organ or metabolic pathway evolution with the origin of organ or pathway specific genes (Sestak and Domazet-Loso, 2015).

In detail, Phylostratum and Divergence Stratum enrichment analyses can be performed analogously to Gene Ontology and Kegg enrichment analyses to study the enrichment of evolutionary age or sequence divergence in a set of selected genes against the entire genome/transcriptome. In case specific age categories are significantly over- or underrepresented in the selected gene set, assumptions or potential correlations between the evolutionary origin of a particular organ or metabolic pathway can be implied.

In this vignette we will use the data set published by Sestak and Domazet-Loso, 2015 to demonstrate how to perform enrichment analyses using myTAI.

## Enrichment Analyses using PlotEnrichment()

The PlotEnrichment() function implemented in myTAI computes and visualizes the significance of enriched (over- or underrepresented) Phylostrata or Divergence Strata within an input set of process/tissue specific genes. In detail this function takes the Phylostratum or Divergence Stratum distribution of all genes stored in the input ExpressionSet as background set and the Phylostratum or Divergence Stratum distribution of the specific gene set and performs a Fisher’s exact test for each Phylostratum or Divergence Stratum to quantify the statistical significance of over- or under-represented Phylostrata or Divergence Strata within the set of selected genes. In other words, the frequency distribution of Phylostrata or Divergence Strata in the complete sample is compared with the frequency distribution of Phylostrata or Divergence Strata in the set of selected genes and over- or under-representation is visualized by log-odds (or odds), where a log-odd of zero means that both frequency distributions are equal (see also Sestak and Domazet-Loso, 2015).

### Example Data Set Retrieval

Before using the PlotEnrichment() function, we need to download the example data set from Sestak and Domazet-Loso, 2015.

Download the Phylostratigraphic Map of D. rerio:

# download the Phylostratigraphic Map of Danio rerio
# from Sestak and Domazet-Loso, 2015
destfile = "MBE_2015a_Drerio_PhyloMap.xlsx" )


Read the *.xlsx file storing the Phylostratigraphic Map of D. rerio and format it for the use with myTAI:

# install the readxl package

DrerioPhyloMap.MBEa <- read_excel("MBE_2015a_Drerio_PhyloMap.xlsx", sheet = 1, skip = 4)

# format Phylostratigraphic Map for use with myTAI
Drerio.PhyloMap <- DrerioPhyloMap.MBEa[ , 1:2]

# have a look at the final format
head(Drerio.PhyloMap)
  Phylostrata            ZFIN_ID
1           1 ZDB-GENE-000208-13
2           1 ZDB-GENE-000208-17
3           1 ZDB-GENE-000208-18
4           1 ZDB-GENE-000208-23
5           1  ZDB-GENE-000209-3
6           1  ZDB-GENE-000209-4

Now, Drerio.PhyloMap stores the Phylostratigraphic Map of D. rerio which is used as background set to perform enrichment analyses with PlotEnrichment().

### Enrichment Analyses

Now, the PlotEnrichment() function visualizes the over- and underrepresented Phylostrata of brain specific genes when compared with the total number of genes stored in the Phylostratigraphic Map of D. rerio.

# read expression data (organ specific genes) from Sestak and Domazet-Loso, 2015
Drerio.OrganSpecificExpression <- read_excel("MBE_2015a_Drerio_PhyloMap.xlsx", sheet = 2, skip = 3)

# select only brain specific genes
Drerio.Brain.Genes <- unique(na.omit(Drerio.OrganSpecificExpression[ , "brain"]))

# visualize enriched Phylostrata of genes annotated as brain specific
PlotEnrichment(Drerio.PhyloMap,
test.set     = Drerio.Brain.Genes,
measure      = "log-foldchange",
use.only.map = TRUE,
legendName   = "PS")

Here, the first argument is either a standard ExpressionSet object (in case use.only.map = FALSE: default) or a Phylostratigraphic Map or Divergence Map (in case use.only.map = TRUE; see Introduction for details). The second argument test.set specifies the gene ids also stored in the corresponding ExpressionSet or Phylostratigraphic Map/ Divergence Map for which enrichment shall be quantified and visualized.

To visualize the odds or log-odds of over- or underrepresented genes within the test.set the following procedure is performed:

• $$N_{ij}$$ denotes the number of genes in group j and deriving from PS $$i$$, with $$i = 1, .. , n$$ and where $$j = 1$$ denotes the background set and $$j = 2$$ denotes the test.set

• $$N_{i.}$$ denotes the total number of genes within PS $$i$$

• $$N_{.j}$$ denotes the total number of genes within group $$j$$

• $$N_{..}$$ is the total number of genes within all groups $$j$$ and all PS $$i$$

• $$f_{ij}$$ = $$N_{ij}$$ / $$N_{..}$$ and $$g_{ij}$$ = $$f_{ij}$$ / $$f_{.j}$$ denote relative frequencies between groups

• $$f_{i.}$$ denotes the between group sum of $$f_{ij}$$

The result is the fold-change value (odds; measure = "foldchange") denoted as $$C_2 = g_{i2} / f_{i.}$$ which is visualized above and below zero or the log fold-change value (log-odds; measure = "log-foldchange"), where $$log_2$$(C) = $$log_2$$($$g_{i2}$$) - $$log_2$$($$f_{i.}$$) which is visualized symmetrically above and below zero by PlotEnrichment(). Analogously, $$C_1 = g_{i1} / f_{i.}$$ but is not visualized by this function.

Internally, PlotEnrichment() performs a Fisher’s exact test for each Phylostratum or Divergence Stratum separately, to quantify the significance of over- or under-representation of corresponding Phylostrata or Divergence Strata within the test.set when compared with the entire ExpressionSet. PlotEnrichment() visualizes significantly enriched (over- or underrepresented) Phylostrata or Divergence Strata with asterisks ’*’.

Notation:

• ’*’ = P-Value $$\leq$$ 0.05
• ’**’ = P-Value $$\leq$$ 0.005
• ’***’ = P-Value $$\leq$$ 0.0005

Users will notice that when performing the PlotEnrichment() function, the p-values and the enrichment matrix (storing $$C_1$$ and $$C_2$$) will be returned.

PlotEnrichment(Drerio.PhyloMap,
test.set     = Drerio.Brain.Genes,
measure      = "log-foldchange",
use.only.map = TRUE,
legendName   = "PS")
$p.values PS1 PS2 PS3 PS4 PS5 PS6 8.283490e-01 8.362880e-05 6.778981e-02 1.373816e-02 7.946309e-13 6.017041e-01 PS7 PS8 PS9 PS10 PS11 PS12 2.185021e-03 2.281194e-03 8.943147e-01 5.699612e-01 4.717058e-02 9.367759e-01 PS13 PS14 3.929949e-03 1.593834e-05$enrichment.matrix
BG_Set     Test_Set
PS1  -0.001132832  0.007668216
PS2   0.023733936 -0.172380714
PS3  -0.040879607  0.250587496
PS4  -0.048920465  0.294399729
PS5  -0.114888949  0.603817643
PS6   0.008678915 -0.060350168
PS7  -0.062948352  0.367240944
PS8   0.115630474 -1.206210187
PS9  -0.007353969  0.048964218
PS10 -0.031971192  0.200141519
PS11  0.039742253 -0.303363314
PS12 -0.002418079  0.016311853
PS13  0.101449988 -0.984621732
PS14  0.098211044 -0.938724783

In case users are only interested in the p-values of the Fisher test and the enrichment matrix without illustrating the bar plot, they can specify the plot.bars = FALSE argument to only retrieve the numeric results.

# specify plot.bars = FALSE to retrieve only numeric results
EnrichmentResult <- PlotEnrichment(Drerio.PhyloMap,
test.set     = Drerio.Brain.Genes,
measure      = "log-foldchange",
use.only.map = TRUE,
legendName   = "PS",
plot.bars    = FALSE)

# access p-values quantifying the enrichment for each Phylostratum
EnrichmentResult$p.values  PS1 PS2 PS3 PS4 PS5 PS6 8.283490e-01 8.362880e-05 6.778981e-02 1.373816e-02 7.946309e-13 6.017041e-01 PS7 PS8 PS9 PS10 PS11 PS12 2.185021e-03 2.281194e-03 8.943147e-01 5.699612e-01 4.717058e-02 9.367759e-01 PS13 PS14 3.929949e-03 1.593834e-05 # access enrichment matrix storing C_1 and C_2 EnrichmentResult$enrichment.matrix
           BG_Set     Test_Set
PS1  -0.001132832  0.007668216
PS2   0.023733936 -0.172380714
PS3  -0.040879607  0.250587496
PS4  -0.048920465  0.294399729
PS5  -0.114888949  0.603817643
PS6   0.008678915 -0.060350168
PS7  -0.062948352  0.367240944
PS8   0.115630474 -1.206210187
PS9  -0.007353969  0.048964218
PS10 -0.031971192  0.200141519
PS11  0.039742253 -0.303363314
PS12 -0.002418079  0.016311853
PS13  0.101449988 -0.984621732
PS14  0.098211044 -0.938724783

### Defining the Background Set

The Fisher test which is performed inside PlotEnrichment() assumes that all genes stored in the input ExpressionSet or Phylostratigraphic Map/Divergence Map are used to define the background set for constructing the test statistic. However, since in most cases the test.set is an subset of the input ExpressionSet or Phylostratigraphic Map/Divergence Map one could also specify the complete.bg argument to remove all test.set genes from the background set when performing the Fisher test and visualization.

The following two examples allow users to compare the results when retaining all genes as background set compared with the option to remove test.set genes from the background set.

# complete.bg = TRUE (default) -> retain test.set genes in background set
PlotEnrichment(Drerio.PhyloMap,
test.set     = Drerio.Brain.Genes,
measure      = "log-foldchange",
complete.bg  = TRUE,
use.only.map = TRUE,
legendName   = "PS")
# complete.bg = FALSE -> remove test.set genes from background set
PlotEnrichment(Drerio.PhyloMap,
test.set     = Drerio.Brain.Genes,
measure      = "log-foldchange",
complete.bg  = FALSE,
use.only.map = TRUE,
legendName   = "PS")

Users will notice that although some p-values change, the qualitative result did not change. In border line cases however, the results might influence whether or not some Phylostrata or Divergence Strata are denoted as significantly enriched or not. So always be aware of the interpretation when retaining or removing the test.set from the background set, because both options are valid and have advantages and disadvantages and depend on a valid interpretation.

## Interpretation of Enrichment Results

For the D. rerio brain genes example you can see that PS4, PS5, and PS7 are significantly over-represented in the set of brain specific genes.

PlotEnrichment(Drerio.PhyloMap,
test.set     = Drerio.Brain.Genes,
measure      = "foldchange",
complete.bg  = TRUE,
use.only.map = TRUE,
legendName   = "PS")

Again, we retrieve the D. rerio specific taxonomy represented by PS1-14 using the taxonomy() funtion (see Introduction and Taxonomy for details).

# retrieve the taxonomy of D. rerio
taxonomy(organism = "Danio rerio")
                    id         name    rank
1   cellular organisms      no rank  131567
2            Eukaryota superkingdom    2759
3         Opisthokonta      no rank   33154
4              Metazoa      kingdom   33208
5            Eumetazoa      no rank    6072
6            Bilateria      no rank   33213
7        Deuterostomia      no rank   33511
8             Chordata       phylum    7711
9             Craniata    subphylum   89593
10          Vertebrata      no rank    7742
11       Gnathostomata      no rank    7776
12          Teleostomi      no rank  117570
13        Euteleostomi      no rank  117571
14      Actinopterygii   superclass    7898
15         Actinopteri        class  186623
16         Neopterygii     subclass   41665
17           Teleostei   infraclass   32443
18 Osteoglossocephalai      no rank 1489341
19       Clupeocephala      no rank  186625
20           Otomorpha      no rank  186634
21        Ostariophysi      no rank   32519
22            Otophysa      no rank  186626
23       Cypriniphysae   superorder  186627
24       Cypriniformes        order    7952
25         Cyprinoidea  superfamily   30727
26          Cyprinidae       family    7953
27               Danio        genus    7954
28         Danio rerio      species    7955

Sestak and Domazet-Loso, 2015 collapsed these 28 taxonomic nodes into 14 taxonomic nodes (see Figure 2 in Sestak and Domazet-Loso, 2015) and labelled them as phylostrata 1 to phylostrata 14, where PS1 represents cellular organisms and PS14 represents D. rerio specific genes. Based on the phylostratum categorization of Sestak and Domazet-Loso, 2015, PS4 represents Holozoa (= Metazoa + allies), PS5 represents Metazoa, and PS7 represents Bilateria.

Now, the over-representation results of brain specific genes returned by PlotEnrichment() provide evidence, that brain specific genes might indeed have originated during the emergence of the nervous system at the metazoan-eumetazoan transition leading to the interpretation that the vertebrate brain has a step wise adaptive history where most of its extant organization was already present in the chordate ancestor as argued by Sestak and Domazet-Loso, 2015.

This example shall illustrate how the PlotEnrichment() function can be used to trace the evolutionary origin of tissue or process specific genes by investigating their age enrichment.

In case users have an ExpressionSet storing the Phylostratigraphic Map of D. rerio as well as an expression set, they can furthermore use the PlotGeneSet() function implemented in myTAI to visualize the expression levels of brain specific genes which have been shown to be significantly enriched in Metazoa specific phylostrata.

Example:

# the best parameter setting to visualize this plot:
# png("DrerioBrainSpecificGeneExpression.png",700,400)
PlotGeneSet(ExpressionSet = DrerioPhyloExpressionSet,
gene.set      = Drerio.Brain.Genes,
plot.legend   = FALSE,
type          = "l",
lty           = 1,
lwd           = 4,
xlab          = "Ontogeny",
ylab          = "Expression Level")

# dev.off()

Here DrerioPhyloExpressionSet denotes a hypothetical ExpressionSet of D. rerio development.

Additionally, the SelectGeneSet() function allows users to obtain the ExpresisonSet subset of selected genes (gene.set) for subsequent analyses.

# select the ExpressionSet subset of Brain specific genes
Brain.PhyloExpressionSet <- SelectGeneSet( ExpressionSet = DrerioPhyloExpressionSet,
gene.set      = Drerio.Brain.Genes )

head(Brain.PhyloExpressionSet)

### Adjust P-values for Multiple Comparisons

In case a large number of Phylostrata or Divergence Strata is included in the input ExpressionSet, p-values returned by PlotEnrichment() should be adjusted for multiple comparisons. For this purpose PlotEnrichment() includes the argument p.adjust.method. Here, all methods implemented in ?p.adjust can be specified:

# adjust p-values for multiple comparisons with Benjamini & Hochberg (1995)
PlotEnrichment(Drerio.PhyloMap,
test.set        = Drerio.Brain.Genes,
measure         = "log-foldchange",
complete.bg     = FALSE,
use.only.map    = TRUE,
legendName      = "PS",
p.adjust.method = "BH")

Please consult these reviews (Biostatistics Handbook, Gelman et al., 2008, and Slides) to decide whether or not to apply p-value adjustment to your own dataset.

## References

Sestak MS and Domazet-Loso T. Phylostratigraphic Profiles in Zebrafish Uncover Chordate Origins of the Vertebrate Brain. Mol. Biol. Evol. (2015) 32 (2): 299-312.