stringr::str_trim(" Removing whitespaces at both ends ")[1] "Removing whitespaces at both ends"
Aurelien Ginolhac, Eric Koncina
June 30, 2026
Objective: summarise a large transcriptomics study using linear regressions. Experience how wrangling real data can actually take 80% of the analysis job.
In 2008, Brauer et al. used microarrays to test the effect of starvation on the growth rate of yeast. For example, they limit the yeast’s supply of glucose (sugar to metabolize into energy), leucine (an essential amino acid), or of ammonia (a source of nitrogen) and assess how yeast cells reacted to this stress, how they adapt certain genes expression. Brauer et al, tested several growth rates in their chemostat, which means that the lower the growth rate is, the more severe the starvation for a nutrient is.
Brauer2008_DataSet1.tds inside a data sub-folder you should create.Load the Brauer2008_DataSet1.tds file as a tibble named original_data. This is the exact data that was published with the paper (though for some reason the link on the journal’s page is broken). It thus serves as a good example of tidying a biological dataset “found in the wild”.
NAME1082129. We don’t know what this number means, and it’s not annotated in the paper. Oh, well, we will discard it eventually.tidyr function to separate these values and generate a column for each variable. Save as cleaned_dataPreferred names for new columns are;
Once you separated the variables delimited by two “||”, check closer the new values: You will see that in columns like systematic_name, BP etc values are surrounded by white-spaces which might be inconvenient during the subsequent use.
For example, on the data below
# A tibble: 5,537 x 44
GID YORF name BP MF systematic_name number GWEIGHT G0.05 G0.1 [...]
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <int> <dbl> <dbl> [...]
1 GENE1… A_06… "SFB… " ER… " mo… " YNL049C " " 108… 1 -0.24 -0.13 [...]
2 GENE4… A_06… "" " bi… " mo… " YNL095C " " 108… 1 0.28 0.13 [...]
3 GENE4… A_06… "QRI… " pr… " me… " YDL104C " " 108… 1 -0.02 -0.27 [...]
[...]the test systematic_name == "YNL049C" is FALSE while systematic_name == " YNL049C " is TRUE
name to number. Save as cleaned_ws_datadplyr allows us to apply a function to selected columns using across(). To remove these white-spaces, stringr provides a function called str_trim(). Let’s test how the function works:
number, GID, YORF and GWEIGHT. Save as cleaned_ws_datacleaned_data_meltAt this point we are storing the sample name (will contain G0.05 …) as a new column sample and values in a column named expression.
sample column, print the unique values of this column.We are again facing the problem that two variables are stored in a single column. The nutrient (G, N etc.) is the first character, then the growth rate.
sample column into two variables nutrient and rate. Assign the name cleaned_data_melt_spUse separate() and the appropriate delimitation in sep. Consider using the convert argument. It allows to convert strings to number when relevant like here.
Right now, the nutrients are designed by a single letter. It would be nice to have the full word instead. One could use a full mixture of if and else such as if_else(nutrient == "G", "Glucose", if_else(nutrient == "L", "Leucine", etc ...)) But, that would be cumbersome.
dplyr::recode(), recode all nutrient names with their full explicit names. Save as cleaned_data_melt_nutHere is the list of the correspondences:
Two variables must be present for the further analysis:
expressionsystematic_name"") in any of the two mandatory variables. How many rows did you remove? Save as cleaned_brauerTidying the data is a crucial step allowing easy handling and representing.
name LEU1 and draw a line for each nutrient showing the expression in function of the growth rate.For this, we don’t need to filter by single gene names as the raw data provides us some information on the biological process for each gene.
BP) and plot the expression in function of the growth rate for each nutrient.Let’s play with the graph a little more. These trends look vaguely linear.
ggplot2 function and carefully adjust the method argument.Once the dataset is tidy, it is very easy to switch to another biological process.
you can combine the facet headers using + in facet_wrap(). Adding the systematic name allows to get a name when the gene name is missing.