Microarray, colon cancer

Author

Aurélien Ginolhac, Eric Koncina

Published

June 30, 2025

Objective

Check on published results concerning the relationship between a gene and a microRNA in colon cancer

Those kind of questions are optional

Microarray analysis

We will be working with GEO datasets, places where submitting microarray experiments is required by most journals. You will experience that despite being around for many years, manipulating those data can be a hassle.

Here, we will investigate the relationship between the expression of ENTPD5 and mir-182 as it was described by Pizzini et al.. Even if the data are already normalised and should be ready to use, you will see that reproducing the claimed results can be tedious.

Retrieve the GEO study

The GEO dataset of interest is GSE35834

Load the study using the file gse35834.rds using read_rds(). Save as gse35834
Tip

RDS are R binary format. Bioconductor objects are not flat files but complex structures (cf scheme below). Thus, we need to use this binary format to load them. The function readr::read_rds() recognises this format.

Of note, we could use GEOquery to fetch directly from NCBI but it is not always available.

What kind of object is gse35834?
Tip

To help figuring out the ExpressionSet object, see the figure below. Mind that for this project, the list GSE contains 2 ExpressionSets!

Figure 1
Two expression platforms were used in this study, which ones according to the main page GSE35834?

Tidying ExpressionSet objects

Before the infusion of the tidyverse in Bioconductor, we needed to perform all the operations described in Figure 1 in order to obtain a tibble where metadata (both about patients and genes) and expressions values are together. Hopefully, this is simplified now thanks to tidySummarizedExperiment and Stefano Mangiola work.

Install the tidySummarizedExperiment, from Bioconductor
How can you extract the mRNA or mir data to each element of gse35834?
Tip

The object is a list, each sub-object is the ExpressionSet we need to access. For example, gse35834[[1]] will sub-select and extract (so the double square brackets) for the mRNA platform

Apply the function makeSummarizedExperimentFromExpressionSet from tidySummarizedExperiment to the RNA part of gse35834. Assign the name tidy_rna
Extension of tibble to Bioconductor objects

You can see now that by calling the object tidy_rna, that the embedded tibble is ~ 1.8M rows but the SummarizedExperiment contains 22,846 features (= genes) and 80 samples. The samples (.sample) were already pivoted to the long format

# A SummarizedExperiment-tibble abstraction: 1,798,880 × 55
# Features=22486 | Samples=80 | Assays=exprs
   .feature     .sample   exprs title geo_accession status submission_date last_update_date type  channel_count source_name_ch1
   <chr>        <chr>     <dbl> <fct> <chr>         <fct>  <fct>           <fct>            <fct> <fct>         <fct>          
 1 10000_at     GSM875933  5.86 exon… GSM875933     Publi… Feb 15 2012     Jan 16 2014      RNA   1             colon cancer, …
 2 10001_at     GSM875933  6.60 exon… GSM875933     Publi… Feb 15 2012     Jan 16 2014      RNA   1             colon cancer, …
 3 10002_at     GSM875933  5.43 exon… GSM875933     Publi… Feb 15 2012     Jan 16 2014  [...]
Apply the function makeSummarizedExperimentFromExpressionSet from tidySummarizedExperiment to the microRNA part of gse35834. Assign the name tidy_mir

Explore and merge both expression meta-data

mRNA and mir meta-data

Extract the mRNA meta-data as a tibble to which you assign the name rna_meta.
Tip

To be consistent, follow those two guidelines:

  • Convert the SummarizedExperiment to a tibble with as_tibble()
  • Retain only the columns .feature, .sample, exprs, source_name_ch1 and
    • characteristics_ch1 rename to gender
    • characteristics_ch1.1 rename to patient
Extract the mir expression meta-data similarly. Save as mir_rna

Filter RNA expression data for the ENTPD5 gene

For the microRNA, the .feature contains the mir names, however for the mRNA, we only have the probeset ids. We need one more step to extract only the expression of the desired gene.

featureData can be accessed, and for example the first line for mRNA:

Biobase::fData(gse35834[[1]]) |> 
  head(1)
               ID ENTREZ_GENE_ID
10000_at 10000_at          10000
                                                                     Description
10000_at v-akt murine thymoma viral oncogene homolog 3 (protein kinase B, gamma)
         SPOT_ID
10000_at   10000

We can see that ID column matches the .feature of rna_meta and allows to associate the ENTREZ gene id.

Find the Entrez gene id for ENTPD5 on NCBI
Join the expression data rna_meta to the platform annotations (fData(gse35834[[1]])). Assign the name rna_expression_feature
Filter rna_expression_feature for the gene of interest (ENTPD5). Assign the name rna_expression_entpd5.
Tip

At this point you should get a tibble of 80 values and 9 columns

Join with the mir expression data

In both tibbles, are the samples identifiers (GSM[...]) identical?

To join tables, we need one or more common key(s). We would like to somehow join both information.
Knowing that both data frames have different sample columns,

Merge both meta-data to get the correspondences between RNA GSM* and mir GSM*. Save the result as rna_entpd5_mir.
Warning

To be clear, you MUST use specify by which columns you join both tables and NOT include the column .sample. Since it exists in both tables and have nothing in common.

Tip

When tables are joined by specific columns, remaining columns that have identical names get default suffixes ‘.x’ and ‘.y’ appended. However, you can make more friendly suffixes that match your actual data using the suffix = c("_rna", "_mir")

Filter for the miR-182 from rna_entpd5_mir. Assign the name rna_entpd5_mir_182

Tip
  • Filter for the mir of interest: hsa-miR-182_st (hsa = Homo sapiens)

We discard the complementary hsa-miR-182-start_st

How many rows do you obtain? How many are expected?
Find the discrepancy, write down the missing GSM sample but proceed anyway.
Tip

anti_join() returns the NON match keys.

Examine gene expression according to some meta data

Plot the gene expression distribution by Gender. Is there any obvious difference?
Tip

boxplots are one option, but we get so much more information using:

    geom_violin(trim = FALSE) +
    ggbeeswarm::geom_quasirandom()
Plot gene AND mir expression distribution by Gender. Is there any obvious difference?
Tip

You will need to tidy by pivoting rna and mir expression to the long format

Plot gene AND mir expression distributions by source (control / cancer). Is there any difference?
Tip

To make it easier, a quick hack is separate(expression, source_name_ch1, c("source", "rest"), sep = 12) get source as control / cancer.

Like it is stated in the summary of the study, the expression of mir-182 seems indeed higher in cancer while the ENTPD5 expression seems lower.

Plot relation ENTPD5 ~ mir-182 as a scatter-plot for all patients
  • Add a linear trend using geom_smooth() for all data + per source. You should remove the confidence intervals.
  • Does it support the claim of the study?