::str_trim(" Removing whitespaces at both ends ") stringr
[1] "Removing whitespaces at both ends"
Aurelien Ginolhac, Eric Koncina
June 30, 2025
Objective: summarise a large transcriptomics study using linear regressions. Experience how wrangling real data can actually take 80% of the analysis job.
In 2008, Brauer et al. used microarrays to test the effect of starvation on the growth rate of yeast. For example, they limit the yeast’s supply of glucose (sugar to metabolize into energy), leucine (an essential amino acid), or of ammonia (a source of nitrogen) and assess how yeast cells reacted to this stress, how they adapt certain genes expression. Brauer et al, tested several growth rates in their chemostat, which means that the lower the growth rate is, the more severe the starvation for a nutrient is.
Brauer2008_DataSet1.tds
inside a data
sub-folder you should create.Load the Brauer2008_DataSet1.tds
file as a tibble
named original_data
. This is the exact data that was published with the paper (though for some reason the link on the journal’s page is broken). It thus serves as a good example of tidying a biological dataset “found in the wild”.
NAME
1082129
. We don’t know what this number means, and it’s not annotated in the paper. Oh, well, we will discard it eventually.tidyr
function to separate these values and generate a column for each variable. Save as cleaned_data
Preferred names for new columns are;
Once you separated the variables delimited by two “||
”, check closer the new values: You will see that in columns like systematic_name
, BP
etc values are surrounded by white-spaces which might be inconvenient during the subsequent use.
For example, on the data below
# A tibble: 5,537 x 44
GID YORF name BP MF systematic_name number GWEIGHT G0.05 G0.1 [...]
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <int> <dbl> <dbl> [...]
1 GENE1… A_06… "SFB… " ER… " mo… " YNL049C " " 108… 1 -0.24 -0.13 [...]
2 GENE4… A_06… "" " bi… " mo… " YNL095C " " 108… 1 0.28 0.13 [...]
3 GENE4… A_06… "QRI… " pr… " me… " YDL104C " " 108… 1 -0.02 -0.27 [...]
[...]
the test systematic_name == "YNL049C"
is FALSE while systematic_name == " YNL049C "
is TRUE
name
to number
. Save as cleaned_ws_data
dplyr
allows us to apply a function to selected columns using across()
. To remove these white-spaces, stringr
provides a function called str_trim()
. Let’s test how the function works:
number
, GID
, YORF
and GWEIGHT
. Save as cleaned_ws_data
cleaned_data_melt
At this point we are storing the sample name (will contain G0.05
…) as a new column sample
and values in a column named expression
.
sample
column, print the unique values of this column.We are again facing the problem that two variables are stored in a single column. The nutrient
(G
, N
etc.) is the first character, then the growth rate
.
sample
column into two variables nutrient
and rate
. Assign the name cleaned_data_melt_sp
Use separate()
and the appropriate delimitation in sep
. Consider using the convert
argument. It allows to convert strings to number when relevant like here.
Right now, the nutrients are designed by a single letter. It would be nice to have the full word instead. One could use a full mixture of if
and else
such as if_else(nutrient == "G", "Glucose", if_else(nutrient == "L", "Leucine", etc ...))
But, that would be cumbersome.
dplyr::recode()
, recode all nutrient names with their full explicit names. Save as cleaned_data_melt_nut
Here is the list of the correspondences:
Two variables must be present for the further analysis:
expression
systematic_name
""
) in any of the two mandatory variables. How many rows did you remove? Save as cleaned_brauer
Tidying the data is a crucial step allowing easy handling and representing.
name
LEU1 and draw a line for each nutrient
showing the expression in function of the growth rate.For this, we don’t need to filter by single gene names as the raw data provides us some information on the biological process for each gene.
BP
) and plot the expression in function of the growth rate for each nutrient.Let’s play with the graph a little more. These trends look vaguely linear.
ggplot2
function and carefully adjust the method
argument.Once the dataset is tidy, it is very easy to switch to another biological process.
you can combine the facet headers using +
in facet_wrap()
. Adding the systematic name allows to get a name when the gene name is missing.