Yeast transcriptomics

Author

Aurelien Ginolhac, Eric Koncina

Published

June 30, 2025

Objective

Objective: summarise a large transcriptomics study using linear regressions. Experience how wrangling real data can actually take 80% of the analysis job.

Yeast data

In 2008, Brauer et al. used microarrays to test the effect of starvation on the growth rate of yeast. For example, they limit the yeast’s supply of glucose (sugar to metabolize into energy), leucine (an essential amino acid), or of ammonia (a source of nitrogen) and assess how yeast cells reacted to this stress, how they adapt certain genes expression. Brauer et al, tested several growth rates in their chemostat, which means that the lower the growth rate is, the more severe the starvation for a nutrient is.

Retreive the data

Download `Brauer2008_DataSet1.tds` inside a `data` sub-folder you should create.

Load the Brauer2008_DataSet1.tds file as a tibble named original_data. This is the exact data that was published with the paper (though for some reason the link on the journal’s page is broken). It thus serves as a good example of tidying a biological dataset “found in the wild”.

Tidying the data

Have a look at the dataset. Is the data “tidy”?

Many variables are stored in one column `NAME`

Gene name e.g. SFB2. Note that not all genes have a name.
Biological process e.g. “proteolysis and peptidolysis”
Molecular function e.g. “metalloendopeptidase activity”
Systematic ID e.g. YNL049C. Unlike a gene name, every gene in this dataset has a systematic ID.
Another ID number e.g. 1082129. We don’t know what this number means, and it’s not annotated in the paper. Oh, well, we will discard it eventually.

Use the appropriate `tidyr` function to separate these values and generate a column for each variable. Save as `cleaned_data`

Tip

Preferred names for new columns are;

“name”
“BP”
“MF”
“systematic_name”
“number”

Once you separated the variables delimited by two “||”, check closer the new values: You will see that in columns like systematic_name, BP etc values are surrounded by white-spaces which might be inconvenient during the subsequent use.

For example, on the data below

# A tibble: 5,537 x 44
   GID    YORF  name  BP    MF    systematic_name number GWEIGHT G0.05  G0.1 [...]
   <chr>  <chr> <chr> <chr> <chr> <chr>           <chr>    <int> <dbl> <dbl> [...]
 1 GENE1… A_06… "SFB… " ER… " mo… " YNL049C "     " 108…       1 -0.24 -0.13 [...]
 2 GENE4… A_06… ""    " bi… " mo… " YNL095C "     " 108…       1  0.28  0.13 [...]
 3 GENE4… A_06… "QRI… " pr… " me… " YDL104C "     " 108…       1 -0.02 -0.27 [...]
 [...]

the test systematic_name == "YNL049C" is FALSE while systematic_name == " YNL049C " is TRUE

Remove the white spaces that start and end strings in the columns `name` to `number`. Save as `cleaned_ws_data`

Tip

dplyr allows us to apply a function to selected columns using across(). To remove these white-spaces, stringr provides a function called str_trim(). Let’s test how the function works:

stringr::str_trim(" Removing whitespaces at both ends ")

[1] "Removing whitespaces at both ends"

We are not going to use every column of the dataframe. Remove the unnecessary columns: `number`, `GID`, `YORF` and `GWEIGHT`. Save as `cleaned_ws_data`

Do you think that our dataset is now tidy?

Column names must not contain values, pivot the tibble so each column represents a variable. Save as `cleaned_data_melt`

Tip

At this point we are storing the sample name (will contain G0.05 …) as a new column sample and values in a column named expression.

Observe the `sample` column, print the unique values of this column.

We are again facing the problem that two variables are stored in a single column. The nutrient (G, N etc.) is the first character, then the growth rate.

Use the same function as before to split the `sample` column into two variables `nutrient` and `rate`. Assign the name `cleaned_data_melt_sp`

Tip

Use separate() and the appropriate delimitation in sep. Consider using the convert argument. It allows to convert strings to number when relevant like here.

Turn nutrient letters into more comprehensive words

Right now, the nutrients are designed by a single letter. It would be nice to have the full word instead. One could use a full mixture of if and else such as if_else(nutrient == "G", "Glucose", if_else(nutrient == "L", "Leucine", etc ...)) But, that would be cumbersome.

Using the following correspondences and `dplyr::recode()`, recode all nutrient names with their full explicit names. Save as `cleaned_data_melt_nut`

Here is the list of the correspondences:

G = "Glucose", L = "Leucine", P = "Phosphate",
S = "Sulfate", N = "Ammonia", U = "Uracil"

Cleaning up missing data

Two variables must be present for the further analysis:

Gene expression values named as expression
Systematic id (gene ids) named as systematic_name

Delete observations that are missing or empty (`""`) in any of the two mandatory variables. How many rows did you remove? Save as `cleaned_brauer`

Representing the data

Tidying the data is a crucial step allowing easy handling and representing.

Plot the expression data of the LEU1 gene

Plot the data for the gene called in `name` LEU1 and draw a line for each `nutrient` showing the expression in function of the growth rate.

Plot the expression data of a biological process

For this, we don’t need to filter by single gene names as the raw data provides us some information on the biological process for each gene.

Extract all the genes in the leucine biosynthesis process (column `BP`) and plot the expression in function of the growth rate for each nutrient.

Perform a linear regression in top of the plots

Let’s play with the graph a little more. These trends look vaguely linear.

Add a linear regression with the appropriate `ggplot2` function and carefully adjust the `method` argument.

Switch to another biological process

Once the dataset is tidy, it is very easy to switch to another biological process.

Instead of the “leucine biosynthesis”, plot the data corresponding to “sulfur metabolism”.

Tip

you can combine the facet headers using + in facet_wrap(). Adding the systematic name allows to get a name when the gene name is missing.

Yeast data

Retreive the data

Download Brauer2008_DataSet1.tds inside a data sub-folder you should create.

Tidying the data

Have a look at the dataset. Is the data “tidy”?

Many variables are stored in one column NAME

Use the appropriate tidyr function to separate these values and generate a column for each variable. Save as cleaned_data

Remove the white spaces that start and end strings in the columns name to number. Save as cleaned_ws_data

We are not going to use every column of the dataframe. Remove the unnecessary columns: number, GID, YORF and GWEIGHT. Save as cleaned_ws_data

Do you think that our dataset is now tidy?

Column names must not contain values, pivot the tibble so each column represents a variable. Save as cleaned_data_melt

Observe the sample column, print the unique values of this column.

Use the same function as before to split the sample column into two variables nutrient and rate. Assign the name cleaned_data_melt_sp

Turn nutrient letters into more comprehensive words

Using the following correspondences and dplyr::recode(), recode all nutrient names with their full explicit names. Save as cleaned_data_melt_nut

Cleaning up missing data

Delete observations that are missing or empty ("") in any of the two mandatory variables. How many rows did you remove? Save as cleaned_brauer

Representing the data

Plot the expression data of the LEU1 gene

Plot the data for the gene called in name LEU1 and draw a line for each nutrient showing the expression in function of the growth rate.

Plot the expression data of a biological process

Extract all the genes in the leucine biosynthesis process (column BP) and plot the expression in function of the growth rate for each nutrient.

Perform a linear regression in top of the plots

Add a linear regression with the appropriate ggplot2 function and carefully adjust the method argument.

Switch to another biological process

Instead of the “leucine biosynthesis”, plot the data corresponding to “sulfur metabolism”.

What can you conclude from those graphs? How does the yeast cells react to the lack of a specific nutrient?

Download `Brauer2008_DataSet1.tds` inside a `data` sub-folder you should create.

Many variables are stored in one column `NAME`

Use the appropriate `tidyr` function to separate these values and generate a column for each variable. Save as `cleaned_data`

Remove the white spaces that start and end strings in the columns `name` to `number`. Save as `cleaned_ws_data`

We are not going to use every column of the dataframe. Remove the unnecessary columns: `number`, `GID`, `YORF` and `GWEIGHT`. Save as `cleaned_ws_data`

Column names must not contain values, pivot the tibble so each column represents a variable. Save as `cleaned_data_melt`

Observe the `sample` column, print the unique values of this column.

Use the same function as before to split the `sample` column into two variables `nutrient` and `rate`. Assign the name `cleaned_data_melt_sp`

Using the following correspondences and `dplyr::recode()`, recode all nutrient names with their full explicit names. Save as `cleaned_data_melt_nut`

Delete observations that are missing or empty (`""`) in any of the two mandatory variables. How many rows did you remove? Save as `cleaned_brauer`

Plot the data for the gene called in `name` LEU1 and draw a line for each `nutrient` showing the expression in function of the growth rate.

Extract all the genes in the leucine biosynthesis process (column `BP`) and plot the expression in function of the growth rate for each nutrient.

Add a linear regression with the appropriate `ggplot2` function and carefully adjust the `method` argument.