Data transformation

dplyr

Author

Milena Zizovic, Roland Krause

Published

April 18, 2026

NoteObjective

This tutorial allows you to explore dplyr functionality based on the individual lectures. Every question can be answered with a combination of |> pipes

Those questions are optional

Part one - Basic dplyr

Import the judgments data from the website.

Assign to the name judgments

Select all columns that refer to the mood of participants.
  • Do you see any differences whether you retain column names that starts or contains "mood"?
How many participants have no missing data in both their mood indicators?
TipTip

From {tidyr}, the function drop_na() accepts {tidyselect} helpers

Retrieve all subjects younger than 20 which are in the stress group.
TipTip

The column for the group is named condition.

Arrange all observations by STAI_pre so that the subject with the lowest stress level is on top.

What is the subject in question?

Count how many male/females in the condition column are present?
Add the columns moral_judgment and moral_judgment_disgust, and sort this new column sum_moral in a descending order?
TipTip

You can either relocate() the new column so it is displayed at the first place or at the creation step use the argument .before = 1L to achieve the same result.

Rescale values in the REI group

Divide all entries in the REI questionnaire by 5, the maximal value, so the values will be between 0 and 1.

TipTip
  • across() allows modification of multiple columns in one go. See additional slides
  • To control the naming of the created columns, you can bundle the division by 5 inside a list(name = xx) statement where:
    • name is the desired suffix
    • xx the lambda function to divide by 5
  • After the REI normalisation, you can drop the missing values for all those columns using drop_na()


Part two - Grouping and summarizing

Find the average age of subject per gender, report also from how many participants the mean was computed

Compute the average of moral_judgment per gender and condition. Which gender appears with the higher moral judgment regardless of the condition?

  • How many participants are in the smallest of the 4 groups?
Sort by groups
  1. Find the number of subjects by age, gender and condition, e.g. how many 20 years of age females are in the stress group.
  2. Sort the resulting tibble such that the condition that contains the most populous group is sorted first (i.e. stress or control appear together).


Part three - joins and pivots

Judgements reprise

Using across() we perform the following operations on several columns:

summarise(judgments,
          across(starts_with("moral_dilemma"),
                 list(
                   ave = mean,
                   sd = sd,
                   med = median ,
                   min = min,
                   max = max
                 )),
          .by = condition) -> judgments_condition_stats
judgments_condition_stats
# A tibble: 2 × 36
  condition moral_dilemma_dog_ave moral_dilemma_dog_sd moral_dilemma_dog_med
  <chr>                     <dbl>                <dbl>                 <dbl>
1 control                    7.24                 2.23                     8
2 stress                     7.45                 2.11                     8
# ℹ 32 more variables: moral_dilemma_dog_min <dbl>,
#   moral_dilemma_dog_max <dbl>, moral_dilemma_wallet_ave <dbl>,
#   moral_dilemma_wallet_sd <dbl>, moral_dilemma_wallet_med <dbl>,
#   moral_dilemma_wallet_min <dbl>, moral_dilemma_wallet_max <dbl>,
#   moral_dilemma_plane_ave <dbl>, moral_dilemma_plane_sd <dbl>,
#   moral_dilemma_plane_med <dbl>, moral_dilemma_plane_min <dbl>,
#   moral_dilemma_plane_max <dbl>, moral_dilemma_resume_ave <dbl>, …

In one operation, we computed 5 summary statistics applied to 35 columns.

Pivoting for readability, first convert to the long format

Use the object named judgments_condition_stats from before and convert it to the long format.

After this step, your tibble should contain three columns:

  • The condition
  • The name of the moral dilemma and stats (in one column, such as dog_mean, dog_sd …)
  • The values of the stats by moral dilemma.
ImportantWarning

If you don’t clean up the dilemma prefix, the names will be moral_dilemma_dog_mean, moral_dilemma_dog_sd etc… Use the option names_prefix = in pivot_longer() to get rid of the identical prefix moral_dilemma_ for every rows.

The expected tibble should be like

# A tibble: 70 × 3
   condition dilemma    stat_values
   <chr>     <chr>            <dbl>
 1 control   dog_ave           7.24
 2 control   dog_sd            2.23
 3 control   dog_med           8   
 4 control   dog_min           1 
[...]
Continue from the previous step, to pivot to the wider format again.

The stats of each dilemma will be individual columns, should look like this:

# A tibble: 2 × 36
  condition dog_ave dog_sd dog_med dog_min dog_max wallet_ave  [...]
  <chr>       <dbl>  <dbl>   <dbl>   <dbl>   <dbl>      <dbl>  [...]
1 control      7.24   2.23       8       1       9       6.87  [...]
2 stress       7.45   2.11       8       1       9       7.40  [...]
From judgments_condition_stats, convert to the longer format to the tidy format displayed
# A tibble: 70 × 4
   condition dilemma stats values
   <chr>     <chr>   <chr>  <dbl>
 1 control   dog     ave     7.24
 2 control   dog     sd      2.23
 3 control   dog     med     8   
 4 control   dog     min     1   
 5 control   dog     max     9   
 6 control   wallet  ave     6.87
 7 control   wallet  sd      2.16
 8 control   wallet  med     8   
 9 control   wallet  min     1   
10 control   wallet  max     9   
# ℹ 60 more rows
TipTip

pivot_longer() can takes multiple strings in the names_to = with the relevant separator specified.

Genetic variants

The tribble (transposed tibble) below contains changes of the sequence of a gene. The format in the input is the expected nucleotide (the reference allele), the position and the variant, commonly called alternative allele.

In T6G, T is the reference allele, 6 is the position (along the gene) and G is the variant allele.

variants <- tribble(
  ~ sampleid, ~ var1, ~ var2, ~ var3,
  "S1",       "A3T",  "T5G",  "T6G",
  "S2",       "A3G",  "T5G",  NA,
  "S3",       "A3T",  "T6C",  "G10C",
  "S4",       "A3T",  "T6C",  "G10C"
)
variants
# A tibble: 4 × 4
  sampleid var1  var2  var3 
  <chr>    <chr> <chr> <chr>
1 S1       A3T   T5G   T6G  
2 S2       A3G   T5G   <NA> 
3 S3       A3T   T6C   G10C 
4 S4       A3T   T6C   G10C 

A second table, contains the annotation of variants:

variant_significance <- tribble(
  ~ variant, ~ significance,
  "A3T",     "unknown",
  "A3G",     "damaging",
  "T5G",     "benign",
  "T6G",     "damaging",
  "T6C",     "benign",
  "G10C",    "unknown"
)
variant_significance
# A tibble: 6 × 2
  variant significance
  <chr>   <chr>       
1 A3T     unknown     
2 A3G     damaging    
3 T5G     benign      
4 T6G     damaging    
5 T6C     benign      
6 G10C    unknown     
Identify the subjects in the table variants that carry variants labeled as damaging in variant_significance table.
TipTip
  • You should ask yourself if variants is tidy, and if not, pivot accordingly.
  • Then, it is a join, paying attention that the matching key columns have different names. So specify the right correspondence in the by = join_by() argument.
  • Last you want to extract only the relevant rows