Data transformation

dplyr

Author

Milena Zizovic, Roland Krause

Published

April 29, 2025

Objective

This tutorial allows you to explore dplyr functionality based on the individual lectures. Every question can be answered with a combination of |> pipes

Those questions are optional

Part one - Basic dplyr

Import the judgments data from the website.

Assign to the name judgments

judgments <- read_tsv("https://biostat2.uni.lu/practicals/data/judgments.tsv",
                      show_col_types = FALSE)
judgments
# A tibble: 188 × 158
   start_date end_date  finished condition subject gender   age mood_pre
   <chr>      <chr>        <dbl> <chr>       <dbl> <chr>  <dbl>    <dbl>
 1 11/3/2014  11/3/2014        1 control         2 female    24       81
 2 11/3/2014  11/3/2014        1 stress          1 female    19       59
 3 11/3/2014  11/3/2014        1 stress          3 female    19       22
 4 11/3/2014  11/3/2014        1 stress          4 female    22       53
 5 11/3/2014  11/3/2014        1 control         7 female    22       48
 6 11/3/2014  11/3/2014        1 stress          6 female    22       73
 7 11/3/2014  11/3/2014        1 control         5 female    18       NA
 8 11/3/2014  11/3/2014        1 control         9 male      20      100
 9 11/3/2014  11/3/2014        1 stress         16 female    21       67
10 11/3/2014  11/3/2014        1 stress         13 female    19       30
# ℹ 178 more rows
# ℹ 150 more variables: mood_post <dbl>, STAI_pre_1_1 <dbl>,
#   STAI_pre_1_2 <dbl>, STAI_pre_1_3 <dbl>, STAI_pre_1_4 <dbl>,
#   STAI_pre_1_5 <dbl>, STAI_pre_1_6 <dbl>, STAI_pre_1_7 <dbl>,
#   STAI_pre_2_1 <dbl>, STAI_pre_2_2 <dbl>, STAI_pre_2_3 <dbl>,
#   STAI_pre_2_4 <dbl>, STAI_pre_2_5 <dbl>, STAI_pre_2_6 <dbl>,
#   STAI_pre_2_7 <dbl>, STAI_pre_3_1 <dbl>, STAI_pre_3_2 <dbl>, …
Select all columns that refer to the mood of participants.
  • Do you see any differences whether you retain column names that starts or contains "mood"?
select(judgments, starts_with("mood"))
# A tibble: 188 × 2
   mood_pre mood_post
      <dbl>     <dbl>
 1       81        NA
 2       59        42
 3       22        60
 4       53        68
 5       48        NA
 6       73        73
 7       NA        NA
 8      100        NA
 9       67        74
10       30        68
# ℹ 178 more rows
select(judgments, contains("mood"))
# A tibble: 188 × 2
   mood_pre mood_post
      <dbl>     <dbl>
 1       81        NA
 2       59        42
 3       22        60
 4       53        68
 5       48        NA
 6       73        73
 7       NA        NA
 8      100        NA
 9       67        74
10       30        68
# ℹ 178 more rows
# same selection
How many participants have no missing data in both their mood indicators?
Tip

From {tidyr}, the function drop_na() accepts {tidyselect} helpers

select(judgments, starts_with("mood")) |> 
  drop_na(starts_with("mood"))
# A tibble: 97 × 2
   mood_pre mood_post
      <dbl>     <dbl>
 1       59        42
 2       22        60
 3       53        68
 4       73        73
 5       67        74
 6       30        68
 7       55        57
 8       53        38
 9       79        59
10       70        38
# ℹ 87 more rows
# 97
Retrieve all subjects younger than 20 which are in the stress group.
Tip

The column for the group is named condition.

# we use a comma for an AND operator (& is an alternative)
filter(judgments,
       age < 20, condition == "stress")
# A tibble: 58 × 158
   start_date end_date  finished condition subject gender   age mood_pre
   <chr>      <chr>        <dbl> <chr>       <dbl> <chr>  <dbl>    <dbl>
 1 11/3/2014  11/3/2014        1 stress          1 female    19       59
 2 11/3/2014  11/3/2014        1 stress          3 female    19       22
 3 11/3/2014  11/3/2014        1 stress         13 female    19       30
 4 11/3/2014  11/3/2014        1 stress         18 female    19       55
 5 11/3/2014  11/3/2014        1 stress         14 female    18       53
 6 11/3/2014  11/3/2014        1 stress         17 female    18       79
 7 11/3/2014  11/3/2014        1 stress         15 female    17       70
 8 11/3/2014  11/3/2014        1 stress         25 female    19       60
 9 11/3/2014  11/3/2014        1 stress         22 female    18       13
10 11/3/2014  11/3/2014        1 stress         32 female    19       74
# ℹ 48 more rows
# ℹ 150 more variables: mood_post <dbl>, STAI_pre_1_1 <dbl>,
#   STAI_pre_1_2 <dbl>, STAI_pre_1_3 <dbl>, STAI_pre_1_4 <dbl>,
#   STAI_pre_1_5 <dbl>, STAI_pre_1_6 <dbl>, STAI_pre_1_7 <dbl>,
#   STAI_pre_2_1 <dbl>, STAI_pre_2_2 <dbl>, STAI_pre_2_3 <dbl>,
#   STAI_pre_2_4 <dbl>, STAI_pre_2_5 <dbl>, STAI_pre_2_6 <dbl>,
#   STAI_pre_2_7 <dbl>, STAI_pre_3_1 <dbl>, STAI_pre_3_2 <dbl>, …
# 58 participants
Arrange all observations by STAI_pre so that the subject with the lowest stress level is on top.

What is the subject in question?

arrange(judgments, STAI_pre) |> 
  relocate(STAI_pre, .before = finished)
# A tibble: 188 × 158
   start_date end_date STAI_pre finished condition subject gender   age mood_pre
   <chr>      <chr>       <dbl>    <dbl> <chr>       <dbl> <chr>  <dbl>    <dbl>
 1 11/6/2014  11/6/20…       20        1 stress        142 female    22       96
 2 11/3/2014  11/3/20…       21        1 control         9 male      20      100
 3 11/7/2014  11/7/20…       22        1 stress        162 male      19       77
 4 11/3/2014  11/3/20…       23        1 control        39 male      20       87
 5 11/5/2014  11/5/20…       23        1 control       105 female    21       93
 6 11/7/2014  11/7/20…       23        1 control       176 female    21       75
 7 11/7/2014  11/7/20…       24        1 control       179 male      23       66
 8 11/4/2014  11/4/20…       26        1 control        64 female    18       76
 9 11/6/2014  11/6/20…       26        1 control       143 female    17       82
10 11/5/2014  11/5/20…       27        1 control       106 female    19       83
# ℹ 178 more rows
# ℹ 149 more variables: mood_post <dbl>, STAI_pre_1_1 <dbl>,
#   STAI_pre_1_2 <dbl>, STAI_pre_1_3 <dbl>, STAI_pre_1_4 <dbl>,
#   STAI_pre_1_5 <dbl>, STAI_pre_1_6 <dbl>, STAI_pre_1_7 <dbl>,
#   STAI_pre_2_1 <dbl>, STAI_pre_2_2 <dbl>, STAI_pre_2_3 <dbl>,
#   STAI_pre_2_4 <dbl>, STAI_pre_2_5 <dbl>, STAI_pre_2_6 <dbl>,
#   STAI_pre_2_7 <dbl>, STAI_pre_3_1 <dbl>, STAI_pre_3_2 <dbl>, …
# subject 142
Count how many male/females in the condition column are present?
count(judgments, condition, gender, sort = TRUE)
# A tibble: 4 × 3
  condition gender     n
  <chr>     <chr>  <int>
1 stress    female    82
2 control   female    65
3 control   male      26
4 stress    male      15
Add the columns moral_judgment and moral_judgment_disgust, and sort this new column sum_moral in a descending order?
Tip

You can either relocate() the new column so it is displayed at the first place or at the creation step use the argument .before = 1L to achieve the same result.

judgments |> 
  mutate(sum_moral = moral_judgment + moral_judgment_disgust, .before = 1L) |> 
  arrange(desc(sum_moral))
# A tibble: 188 × 159
   sum_moral start_date end_date  finished condition subject gender   age
       <dbl> <chr>      <chr>        <dbl> <chr>       <dbl> <chr>  <dbl>
 1      17.4 11/6/2014  11/6/2014        1 stress        159 female    18
 2      17.3 11/6/2014  11/6/2014        1 control       132 female    18
 3      17.1 11/3/2014  11/3/2014        1 control        26 male      20
 4      17   11/5/2014  11/5/2014        1 stress        107 male      31
 5      17   11/6/2014  11/6/2014        1 control       128 female    19
 6      17   11/7/2014  11/7/2014        1 stress        171 female    18
 7      16.9 11/5/2014  11/5/2014        1 stress         88 female    18
 8      16.9 11/7/2014  11/7/2014        1 control       172 male      20
 9      16.7 11/3/2014  11/3/2014        1 stress         30 female    20
10      16.7 11/6/2014  11/6/2014        1 control       121 female    21
# ℹ 178 more rows
# ℹ 151 more variables: mood_pre <dbl>, mood_post <dbl>, STAI_pre_1_1 <dbl>,
#   STAI_pre_1_2 <dbl>, STAI_pre_1_3 <dbl>, STAI_pre_1_4 <dbl>,
#   STAI_pre_1_5 <dbl>, STAI_pre_1_6 <dbl>, STAI_pre_1_7 <dbl>,
#   STAI_pre_2_1 <dbl>, STAI_pre_2_2 <dbl>, STAI_pre_2_3 <dbl>,
#   STAI_pre_2_4 <dbl>, STAI_pre_2_5 <dbl>, STAI_pre_2_6 <dbl>,
#   STAI_pre_2_7 <dbl>, STAI_pre_3_1 <dbl>, STAI_pre_3_2 <dbl>, …
Rescale values in the REI group

Divide all entries in the REI questionnaire by 5, the maximal value, so the values will be between 0 and 1.

Tip
  • across() allows modification of multiple columns in one go. See additional slides
  • To control the naming of the created columns, you can bundle the division by 5 inside a list(name = xx) statement where:
    • name is the desired suffix
    • xx the lambda function to divide by 5
  • After the REI normalisation, you can drop the missing values for all those columns using drop_na()
judgments |> 
  mutate(across(contains("REI"), list(norm = \(x) x / 5))) |>
  drop_na(ends_with("_norm")) |> 
  relocate(ends_with("norm"))
# A tibble: 78 × 202
   REI_1_norm REI_2_norm REI_3_norm REI_4_norm REI_5_norm REI_6_norm REI_7_norm
        <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
 1        1          0.8        1          0.8        0.8        1          0.6
 2        0.8        0.8        0.8        0.8        0.8        0.8        0.6
 3        0.8        1          1          1          1          1          1  
 4        0.4        0.4        0.8        0.4        0.6        0.4        0.6
 5        1          0.4        0.6        1          0.8        0.8        1  
 6        0.6        0.8        0.8        1          0.6        0.8        0.6
 7        0.6        0.8        0.8        0.4        0.4        0.8        0.8
 8        0.4        0.4        0.4        0.8        0.8        0.4        0.4
 9        0.8        0.8        0.6        0.8        1          0.4        0.8
10        0.8        0.8        0.8        1          1          0.8        0.6
# ℹ 68 more rows
# ℹ 195 more variables: REI_8_norm <dbl>, REI_9_norm <dbl>, REI_10_norm <dbl>,
#   REI_11_norm <dbl>, REI_12_norm <dbl>, REI_13_norm <dbl>, REI_14_norm <dbl>,
#   REI_15_norm <dbl>, REI_16_norm <dbl>, REI_17_norm <dbl>, REI_18_norm <dbl>,
#   REI_19_norm <dbl>, REI_20_norm <dbl>, REI_21_norm <dbl>, REI_22_norm <dbl>,
#   REI_23_norm <dbl>, REI_24_norm <dbl>, REI_25_norm <dbl>, REI_26_norm <dbl>,
#   REI_27_norm <dbl>, REI_28_norm <dbl>, REI_29_norm <dbl>, …


Part two - Grouping and summarizing

Find the average age of subject per gender, report also from how many participants the mean was computed
summarise(judgments, 
          ave_age = mean(age), 
          n = n(),
          .by = gender)
# A tibble: 2 × 3
  gender ave_age     n
  <chr>    <dbl> <int>
1 female    19.4   147
2 male      20.4    41

Compute the average of moral_judgment per gender and condition. Which gender appears with the higher moral judgment regardless of the condition?

  • How many participants are in the smallest of the 4 groups?
summarise(judgments, 
          ave_moral_judg = mean(moral_judgment), 
          n = n(),
          .by = c(gender, condition))
# A tibble: 4 × 4
  gender condition ave_moral_judg     n
  <chr>  <chr>              <dbl> <int>
1 female control             6.52    65
2 female stress              6.87    82
3 male   control             6.23    26
4 male   stress              6.15    15
# Females

# 15 in stressed males
Sort by groups
  1. Find the number of subjects by age, gender and condition, e.g. how many 20 years of age females are in the stress group.
  2. Sort the resulting tibble such that the condition that contains the most populous group is sorted first (i.e. stress or control appear together).
count(judgments, condition, gender, age, sort = TRUE)
# A tibble: 33 × 4
   condition gender   age     n
   <chr>     <chr>  <dbl> <int>
 1 stress    female    18    27
 2 control   female    18    25
 3 stress    female    19    19
 4 control   female    19    17
 5 stress    female    20    14
 6 stress    female    22     9
 7 control   female    21     7
 8 control   male      18     7
 9 control   male      20     6
10 control   male      22     5
# ℹ 23 more rows
# 14 females of 20 years old stressed


Part three - joins and pivots

Judgements reprise

Using across() we perform the following operations on several columns:

summarise(judgments,
          across(starts_with("moral_dilemma"),
                 list(
                   ave = mean,
                   sd = sd,
                   med = median ,
                   min = min,
                   max = max
                 )),
          .by = condition) -> judgments_condition_stats
judgments_condition_stats
# A tibble: 2 × 36
  condition moral_dilemma_dog_ave moral_dilemma_dog_sd moral_dilemma_dog_med
  <chr>                     <dbl>                <dbl>                 <dbl>
1 control                    7.24                 2.23                     8
2 stress                     7.45                 2.11                     8
# ℹ 32 more variables: moral_dilemma_dog_min <dbl>,
#   moral_dilemma_dog_max <dbl>, moral_dilemma_wallet_ave <dbl>,
#   moral_dilemma_wallet_sd <dbl>, moral_dilemma_wallet_med <dbl>,
#   moral_dilemma_wallet_min <dbl>, moral_dilemma_wallet_max <dbl>,
#   moral_dilemma_plane_ave <dbl>, moral_dilemma_plane_sd <dbl>,
#   moral_dilemma_plane_med <dbl>, moral_dilemma_plane_min <dbl>,
#   moral_dilemma_plane_max <dbl>, moral_dilemma_resume_ave <dbl>, …

In one operation, we computed 5 summary statistics applied to 35 columns.

Pivoting for readability, first convert to the long format

Use the object named judgments_condition_stats from before and convert it to the long format.

After this step, your tibble should contain three columns:

  • The condition
  • The name of the moral dilemma and stats (in one column, such as dog_mean, dog_sd …)
  • The values of the stats by moral dilemma.
Warning

If you don’t clean up the dilemma prefix, the names will be moral_dilemma_dog_mean, moral_dilemma_dog_sd etc… Use the option names_prefix = in pivot_longer() to get rid of the identical prefix moral_dilemma_ for every rows.

The expected tibble should be like

# A tibble: 70 × 3
   condition dilemma    stat_values
   <chr>     <chr>            <dbl>
 1 control   dog_ave           7.24
 2 control   dog_sd            2.23
 3 control   dog_med           8   
 4 control   dog_min           1 
[...]
pivot_longer(judgments_condition_stats,
             cols = -condition, # or starts_with("moral_dilemma"),
             names_to = "dilemma",
             names_prefix = "moral_dilemma_",
             values_to = "stat_values") 
# A tibble: 70 × 3
   condition dilemma    stat_values
   <chr>     <chr>            <dbl>
 1 control   dog_ave           7.24
 2 control   dog_sd            2.23
 3 control   dog_med           8   
 4 control   dog_min           1   
 5 control   dog_max           9   
 6 control   wallet_ave        6.87
 7 control   wallet_sd         2.16
 8 control   wallet_med        8   
 9 control   wallet_min        1   
10 control   wallet_max        9   
# ℹ 60 more rows
Continue from the previous step, to pivot to the wider format again.

The stats of each dilemma will be individual columns, should look like this:

# A tibble: 2 × 36
  condition dog_ave dog_sd dog_med dog_min dog_max wallet_ave  [...]
  <chr>       <dbl>  <dbl>   <dbl>   <dbl>   <dbl>      <dbl>  [...]
1 control      7.24   2.23       8       1       9       6.87  [...]
2 stress       7.45   2.11       8       1       9       7.40  [...]
judgments_condition_stats |> 
  pivot_longer(cols = -condition,
               names_to = "dilemma",
               names_prefix = "moral_dilemma_",
               values_to = "stat_values") |> 
  pivot_wider(names_from = dilemma,
              values_from = stat_values)
# A tibble: 2 × 36
  condition dog_ave dog_sd dog_med dog_min dog_max wallet_ave wallet_sd
  <chr>       <dbl>  <dbl>   <dbl>   <dbl>   <dbl>      <dbl>     <dbl>
1 control      7.24   2.23       8       1       9       6.87      2.16
2 stress       7.45   2.11       8       1       9       7.40      1.86
# ℹ 28 more variables: wallet_med <dbl>, wallet_min <dbl>, wallet_max <dbl>,
#   plane_ave <dbl>, plane_sd <dbl>, plane_med <dbl>, plane_min <dbl>,
#   plane_max <dbl>, resume_ave <dbl>, resume_sd <dbl>, resume_med <dbl>,
#   resume_min <dbl>, resume_max <dbl>, kitten_ave <dbl>, kitten_sd <dbl>,
#   kitten_med <dbl>, kitten_min <dbl>, kitten_max <dbl>, trolley_ave <dbl>,
#   trolley_sd <dbl>, trolley_med <dbl>, trolley_min <dbl>, trolley_max <dbl>,
#   control_ave <dbl>, control_sd <dbl>, control_med <dbl>, …
From judgments_condition_stats, convert to the longer format to the tidy format displayed
# A tibble: 70 × 4
   condition dilemma stats values
   <chr>     <chr>   <chr>  <dbl>
 1 control   dog     ave     7.24
 2 control   dog     sd      2.23
 3 control   dog     med     8   
 4 control   dog     min     1   
 5 control   dog     max     9   
 6 control   wallet  ave     6.87
 7 control   wallet  sd      2.16
 8 control   wallet  med     8   
 9 control   wallet  min     1   
10 control   wallet  max     9   
# ℹ 60 more rows
Tip

pivot_longer() can takes multiple strings in the names_to = with the relevant separator specified.

pivot_longer(judgments_condition_stats,
             cols = -condition,
             names_to = c("dilemma", "stats"),
             names_sep = "_",
             names_prefix = "moral_dilemma_",
             values_to = c("values"))
# A tibble: 70 × 4
   condition dilemma stats values
   <chr>     <chr>   <chr>  <dbl>
 1 control   dog     ave     7.24
 2 control   dog     sd      2.23
 3 control   dog     med     8   
 4 control   dog     min     1   
 5 control   dog     max     9   
 6 control   wallet  ave     6.87
 7 control   wallet  sd      2.16
 8 control   wallet  med     8   
 9 control   wallet  min     1   
10 control   wallet  max     9   
# ℹ 60 more rows

Genetic variants

The tribble (transposed tibble) below contains changes of the sequence of a gene. The format in the input is the expected nucleotide (the reference allele), the position and the variant, commonly called alternative allele.

In T6G, T is the reference allele, 6 is the position (along the gene) and G is the variant allele.

variants <- tribble(
  ~ sampleid, ~ var1, ~ var2, ~ var3,
  "S1",       "A3T",  "T5G",  "T6G",
  "S2",       "A3G",  "T5G",  NA,
  "S3",       "A3T",  "T6C",  "G10C",
  "S4",       "A3T",  "T6C",  "G10C"
)
variants
# A tibble: 4 × 4
  sampleid var1  var2  var3 
  <chr>    <chr> <chr> <chr>
1 S1       A3T   T5G   T6G  
2 S2       A3G   T5G   <NA> 
3 S3       A3T   T6C   G10C 
4 S4       A3T   T6C   G10C 

A second table, contains the annotation of variants:

variant_significance <- tribble(
  ~ variant, ~ significance,
  "A3T",     "unknown",
  "A3G",     "damaging",
  "T5G",     "benign",
  "T6G",     "damaging",
  "T6C",     "benign",
  "G10C",    "unknown"
)
variant_significance
# A tibble: 6 × 2
  variant significance
  <chr>   <chr>       
1 A3T     unknown     
2 A3G     damaging    
3 T5G     benign      
4 T6G     damaging    
5 T6C     benign      
6 G10C    unknown     
Identify the subjects in the table variants that carry variants labeled as damaging in variant_significance table.
Tip
  • You should ask yourself if variants is tidy, and if not, pivot accordingly.
  • Then, it is a join, paying attention that the matching key columns have different names. So specify the right correspondence in the by = join_by() argument.
  • Last you want to extract only the relevant rows
variants |>
  pivot_longer(
    cols = -sampleid,
    names_to = "varcol", 
    values_to = "mutation"
    ) |> 
  inner_join(variant_significance, 
             by = join_by(mutation == variant)) |> 
  filter(significance == "damaging")
# A tibble: 2 × 4
  sampleid varcol mutation significance
  <chr>    <chr>  <chr>    <chr>       
1 S1       var3   T6G      damaging    
2 S2       var1   A3G      damaging