Data wrangling

dplyr

Aurélien Ginolhac

University of Luxembourg

Friday, the 25^th of April, 2025

Learning objectives

You will learn to:

Data munging

Preparing data is the most time consuming part of data analysis.
Individual steps might look easy.
Essential part of understanding the data you’re working with.
Additional data preparation before modeling is impossible to avoid.

dplyr is a tool box for working with data in tibbles, offering a unified language for operations scattered through base .

Key operations

Data wrangling learning

Organisation

Split into 3 slide decks (1, 2, 3)

Learn the grammar to operate on rows and columns of a table
Selection and manipulation of (1)
- Observations,
- Variables and
- Values
Grouping and summarizing (2)
Joining and intersecting tibbles (3)

`dplyr` introduction: Cheatsheet

Companion real data for this course

Description

Van’t Veer & Sleegers. (2019) Psychology data from an exploration of the effect of anticipatory stress on disgust vs. non-disgust related moral judgments. J. of Open Psychology Data.

Data is (largely) tidy.
Typical data you might see in the wild.
Moral dilemma (trolley problem, survival after plane crash, etc. moral)
Standard questionnaires
- Private Body Consciousness (PBC, range 0 - 4, as the awareness of internal sensations)
- Rational-Experiential Inventory (REI, range 1 - 5, assesses tendencies to engage in rational and experiential information processing)
- Multidimensional Assessment of Interoceptive Awareness (MAIA, range 0 - 5, feeling the feels)
- State Trait Anxiety Inventory (STAI, feelings such as apprehension, tension, nervousness, and worry)

Your turn!

Load the data into your RStudio session if you wish to follow along

02:00

judgments <- read_tsv("https://biostat2.uni.lu/practicals/data/judgments.tsv", show_col_types = FALSE)
judgments

# A tibble: 188 × 158
   start_date end_date  finished condition subject gender   age mood_pre
   <chr>      <chr>        <dbl> <chr>       <dbl> <chr>  <dbl>    <dbl>
 1 11/3/2014  11/3/2014        1 control         2 female    24       81
 2 11/3/2014  11/3/2014        1 stress          1 female    19       59
 3 11/3/2014  11/3/2014        1 stress          3 female    19       22
 4 11/3/2014  11/3/2014        1 stress          4 female    22       53
 5 11/3/2014  11/3/2014        1 control         7 female    22       48
 6 11/3/2014  11/3/2014        1 stress          6 female    22       73
 7 11/3/2014  11/3/2014        1 control         5 female    18       NA
 8 11/3/2014  11/3/2014        1 control         9 male      20      100
 9 11/3/2014  11/3/2014        1 stress         16 female    21       67
10 11/3/2014  11/3/2014        1 stress         13 female    19       30
# ℹ 178 more rows
# ℹ 150 more variables: mood_post <dbl>, STAI_pre_1_1 <dbl>,
#   STAI_pre_1_2 <dbl>, STAI_pre_1_3 <dbl>, STAI_pre_1_4 <dbl>,
#   STAI_pre_1_5 <dbl>, STAI_pre_1_6 <dbl>, STAI_pre_1_7 <dbl>,
#   STAI_pre_2_1 <dbl>, STAI_pre_2_2 <dbl>, STAI_pre_2_3 <dbl>,
#   STAI_pre_2_4 <dbl>, STAI_pre_2_5 <dbl>, STAI_pre_2_6 <dbl>,
#   STAI_pre_2_7 <dbl>, STAI_pre_3_1 <dbl>, STAI_pre_3_2 <dbl>, …

Selecting columns

The columns you want: select(tibble, col1, ...)

select(judgments, gender, age, condition)

# A tibble: 188 × 3
   gender   age condition
   <chr>  <dbl> <chr>    
 1 female    24 control  
 2 female    19 stress   
 3 female    19 stress   
 4 female    22 stress   
 5 female    22 control  
 6 female    22 stress   
 7 female    18 control  
 8 male      20 control  
 9 female    21 stress   
10 female    19 stress   
# ℹ 178 more rows

Also in the order chosen

Warning

Other packages provide a select() function. If loaded, add dplyr::

dplyr::select(judgments, age, gender)

# A tibble: 188 × 2
     age gender
   <dbl> <chr> 
 1    24 female
 2    19 female
 3    19 female
 4    22 female
 5    22 female
 6    22 female
 7    18 female
 8    20 male  
 9    21 female
10    19 female
# ℹ 178 more rows

Rename when selecting

select(judgments, age, sex = gender)

# A tibble: 188 × 2
     age sex   
   <dbl> <chr> 
 1    24 female
 2    19 female
 3    19 female
 4    22 female
 5    22 female
 6    22 female
 7    18 female
 8    20 male  
 9    21 female
10    19 female
# ℹ 178 more rows

`tidyselect`

Helper functions

To select columns with names that:

contains() - a string
starts_with() - a string
ends_with() - a string
one_of() - names in a character vector
matches() - using regular expressions
everything() - all remaining columns
last_col() - last column

Danger

Avoid selecting columns by index!

To ensure reproducibility select columns by bare names

select(judgments, starts_with("moral"))

# A tibble: 188 × 10
   moral_dilemma_dog moral_dilemma_wallet moral_dilemma_plane
               <dbl>                <dbl>               <dbl>
 1                 9                    9                   8
 2                 9                    9                   9
 3                 8                    7                   8
 4                 8                    4                   8
 5                 3                    9                   9
 6                 9                    9                   9
 7                 9                    5                   7
 8                 9                    4                   1
 9                 6                    9                   3
10                 6                    8                   9
# ℹ 178 more rows
# ℹ 7 more variables: moral_dilemma_resume <dbl>, moral_dilemma_kitten <dbl>,
#   moral_dilemma_trolley <dbl>, moral_dilemma_control <dbl>,
#   moral_judgment <dbl>, moral_judgment_disgust <dbl>,
#   moral_judgment_non_disgust <dbl>

Combining helpers

Remark

Found in several functions (x)
- read_delim(... col_select = x)
- mutate(..., .by = x)
- summarise(..., .by = x)
- pivot_longer(..., .cols = x)
- nest(..., .by = x)
- across()
Helpers are evaluated from left to right, it matters for negative selection!

select(judgments, ends_with("date"), contains("dilemma")) |> 
  slice_head(n = 3)

# A tibble: 3 × 9
  start_date end_date moral_dilemma_dog moral_dilemma_wallet moral_dilemma_plane
  <chr>      <chr>                <dbl>                <dbl>               <dbl>
1 11/3/2014  11/3/20…                 9                    9                   8
2 11/3/2014  11/3/20…                 9                    9                   9
3 11/3/2014  11/3/20…                 8                    7                   8
# ℹ 4 more variables: moral_dilemma_resume <dbl>, moral_dilemma_kitten <dbl>,
#   moral_dilemma_trolley <dbl>, moral_dilemma_control <dbl>

select(judgments, -ends_with("date"), starts_with("start")) |> 
  slice_head(n = 3)

# A tibble: 3 × 157
  finished condition subject gender   age mood_pre mood_post STAI_pre_1_1
     <dbl> <chr>       <dbl> <chr>  <dbl>    <dbl>     <dbl>        <dbl>
1        1 control         2 female    24       81        NA            2
2        1 stress          1 female    19       59        42            3
3        1 stress          3 female    19       22        60            4
# ℹ 149 more variables: STAI_pre_1_2 <dbl>, STAI_pre_1_3 <dbl>,
#   STAI_pre_1_4 <dbl>, STAI_pre_1_5 <dbl>, STAI_pre_1_6 <dbl>,
#   STAI_pre_1_7 <dbl>, STAI_pre_2_1 <dbl>, STAI_pre_2_2 <dbl>,
#   STAI_pre_2_3 <dbl>, STAI_pre_2_4 <dbl>, STAI_pre_2_5 <dbl>,
#   STAI_pre_2_6 <dbl>, STAI_pre_2_7 <dbl>, STAI_pre_3_1 <dbl>,
#   STAI_pre_3_2 <dbl>, STAI_pre_3_3 <dbl>, STAI_pre_3_4 <dbl>,
#   STAI_pre_3_5 <dbl>, STAI_pre_3_6 <dbl>, STAI_post_1_1 <dbl>, …

Column start_date is gone with the first helper

Filtering for rows: `filter()`

Let’s take a look at all the data that were excluded.

filter(judgments, exclude == 1)

# A tibble: 3 × 158
  start_date end_date finished condition subject gender   age mood_pre mood_post
  <chr>      <chr>       <dbl> <chr>       <dbl> <chr>  <dbl>    <dbl>     <dbl>
1 11/3/2014  11/3/20…        1 stress         28 male      22       53        68
2 11/3/2014  11/3/20…        1 stress         32 female    19       74        77
3 11/7/2014  11/7/20…        1 stress        181 male      22       47        65
# ℹ 149 more variables: STAI_pre_1_1 <dbl>, STAI_pre_1_2 <dbl>,
#   STAI_pre_1_3 <dbl>, STAI_pre_1_4 <dbl>, STAI_pre_1_5 <dbl>,
#   STAI_pre_1_6 <dbl>, STAI_pre_1_7 <dbl>, STAI_pre_2_1 <dbl>,
#   STAI_pre_2_2 <dbl>, STAI_pre_2_3 <dbl>, STAI_pre_2_4 <dbl>,
#   STAI_pre_2_5 <dbl>, STAI_pre_2_6 <dbl>, STAI_pre_2_7 <dbl>,
#   STAI_pre_3_1 <dbl>, STAI_pre_3_2 <dbl>, STAI_pre_3_3 <dbl>,
#   STAI_pre_3_4 <dbl>, STAI_pre_3_5 <dbl>, STAI_pre_3_6 <dbl>, …

Test equality sign is ==
= is an assignment.

Filtering rows

Multiple conditions: AND

comma separated conditions are equivalent to & (AND).
Filter for females older than 20.

filter(judgments,
  age > 20,
  gender == "female")

# A tibble: 34 × 158
   start_date end_date  finished condition subject gender   age mood_pre
   <chr>      <chr>        <dbl> <chr>       <dbl> <chr>  <dbl>    <dbl>
 1 11/3/2014  11/3/2014        1 control         2 female    24       81
 2 11/3/2014  11/3/2014        1 stress          4 female    22       53
 3 11/3/2014  11/3/2014        1 control         7 female    22       48
 4 11/3/2014  11/3/2014        1 stress          6 female    22       73
 5 11/3/2014  11/3/2014        1 stress         16 female    21       67
 6 11/3/2014  11/3/2014        1 control        10 female    21       72
 7 11/3/2014  11/3/2014        1 control        23 female    23       78
 8 11/3/2014  11/3/2014        1 control        29 female    22       65
 9 11/3/2014  11/3/2014        1 stress         36 female    21       32
10 11/3/2014  11/3/2014        1 control        31 female    23       69
# ℹ 24 more rows
# ℹ 150 more variables: mood_post <dbl>, STAI_pre_1_1 <dbl>,
#   STAI_pre_1_2 <dbl>, STAI_pre_1_3 <dbl>, STAI_pre_1_4 <dbl>,
#   STAI_pre_1_5 <dbl>, STAI_pre_1_6 <dbl>, STAI_pre_1_7 <dbl>,
#   STAI_pre_2_1 <dbl>, STAI_pre_2_2 <dbl>, STAI_pre_2_3 <dbl>,
#   STAI_pre_2_4 <dbl>, STAI_pre_2_5 <dbl>, STAI_pre_2_6 <dbl>,
#   STAI_pre_2_7 <dbl>, STAI_pre_3_1 <dbl>, STAI_pre_3_2 <dbl>, …

Multiple conditions: OR

vertical bar (|) separated conditions are combined with OR.
Filter females or age > 20 (so males too)

filter(judgments,
  age > 20 |
  gender == "female")

# A tibble: 164 × 158
   start_date end_date  finished condition subject gender   age mood_pre
   <chr>      <chr>        <dbl> <chr>       <dbl> <chr>  <dbl>    <dbl>
 1 11/3/2014  11/3/2014        1 control         2 female    24       81
 2 11/3/2014  11/3/2014        1 stress          1 female    19       59
 3 11/3/2014  11/3/2014        1 stress          3 female    19       22
 4 11/3/2014  11/3/2014        1 stress          4 female    22       53
 5 11/3/2014  11/3/2014        1 control         7 female    22       48
 6 11/3/2014  11/3/2014        1 stress          6 female    22       73
 7 11/3/2014  11/3/2014        1 control         5 female    18       NA
 8 11/3/2014  11/3/2014        1 stress         16 female    21       67
 9 11/3/2014  11/3/2014        1 stress         13 female    19       30
10 11/3/2014  11/3/2014        1 stress         18 female    19       55
# ℹ 154 more rows
# ℹ 150 more variables: mood_post <dbl>, STAI_pre_1_1 <dbl>,
#   STAI_pre_1_2 <dbl>, STAI_pre_1_3 <dbl>, STAI_pre_1_4 <dbl>,
#   STAI_pre_1_5 <dbl>, STAI_pre_1_6 <dbl>, STAI_pre_1_7 <dbl>,
#   STAI_pre_2_1 <dbl>, STAI_pre_2_2 <dbl>, STAI_pre_2_3 <dbl>,
#   STAI_pre_2_4 <dbl>, STAI_pre_2_5 <dbl>, STAI_pre_2_6 <dbl>,
#   STAI_pre_2_7 <dbl>, STAI_pre_3_1 <dbl>, STAI_pre_3_2 <dbl>, …

Filtering out rows

Row vs column selection

filter() acts on rows
select() acts on columns

Remove excluded participants (initially 188 rows)
Combine with relocate() to place mood columns first

filter(judgments, exclude == 0) |> 
  relocate(contains("mood"))

# A tibble: 185 × 158
   mood_pre mood_post start_date end_date  finished condition subject gender
      <dbl>     <dbl> <chr>      <chr>        <dbl> <chr>       <dbl> <chr> 
 1       81        NA 11/3/2014  11/3/2014        1 control         2 female
 2       59        42 11/3/2014  11/3/2014        1 stress          1 female
 3       22        60 11/3/2014  11/3/2014        1 stress          3 female
 4       53        68 11/3/2014  11/3/2014        1 stress          4 female
 5       48        NA 11/3/2014  11/3/2014        1 control         7 female
 6       73        73 11/3/2014  11/3/2014        1 stress          6 female
 7       NA        NA 11/3/2014  11/3/2014        1 control         5 female
 8      100        NA 11/3/2014  11/3/2014        1 control         9 male  
 9       67        74 11/3/2014  11/3/2014        1 stress         16 female
10       30        68 11/3/2014  11/3/2014        1 stress         13 female
# ℹ 175 more rows
# ℹ 150 more variables: age <dbl>, STAI_pre_1_1 <dbl>, STAI_pre_1_2 <dbl>,
#   STAI_pre_1_3 <dbl>, STAI_pre_1_4 <dbl>, STAI_pre_1_5 <dbl>,
#   STAI_pre_1_6 <dbl>, STAI_pre_1_7 <dbl>, STAI_pre_2_1 <dbl>,
#   STAI_pre_2_2 <dbl>, STAI_pre_2_3 <dbl>, STAI_pre_2_4 <dbl>,
#   STAI_pre_2_5 <dbl>, STAI_pre_2_6 <dbl>, STAI_pre_2_7 <dbl>,
#   STAI_pre_3_1 <dbl>, STAI_pre_3_2 <dbl>, STAI_pre_3_3 <dbl>, …

Set operations with `filter()`

For larger operations use filtering joins such as semi_join().
Below, tidyselect helper: for a range of columns

judgments |>
  filter(is.element(start_date, c("11/3/2014", "11/5/2014"))) |>
  select(start_date:age)

# A tibble: 79 × 7
   start_date end_date  finished condition subject gender   age
   <chr>      <chr>        <dbl> <chr>       <dbl> <chr>  <dbl>
 1 11/3/2014  11/3/2014        1 control         2 female    24
 2 11/3/2014  11/3/2014        1 stress          1 female    19
 3 11/3/2014  11/3/2014        1 stress          3 female    19
 4 11/3/2014  11/3/2014        1 stress          4 female    22
 5 11/3/2014  11/3/2014        1 control         7 female    22
 6 11/3/2014  11/3/2014        1 stress          6 female    22
 7 11/3/2014  11/3/2014        1 control         5 female    18
 8 11/3/2014  11/3/2014        1 control         9 male      20
 9 11/3/2014  11/3/2014        1 stress         16 female    21
10 11/3/2014  11/3/2014        1 stress         13 female    19
# ℹ 69 more rows

judgments |>
  filter(start_date %in% c("11/3/2014", "11/5/2014")) |>
  select(start_date:age)

# A tibble: 79 × 7
   start_date end_date  finished condition subject gender   age
   <chr>      <chr>        <dbl> <chr>       <dbl> <chr>  <dbl>
 1 11/3/2014  11/3/2014        1 control         2 female    24
 2 11/3/2014  11/3/2014        1 stress          1 female    19
 3 11/3/2014  11/3/2014        1 stress          3 female    19
 4 11/3/2014  11/3/2014        1 stress          4 female    22
 5 11/3/2014  11/3/2014        1 control         7 female    22
 6 11/3/2014  11/3/2014        1 stress          6 female    22
 7 11/3/2014  11/3/2014        1 control         5 female    18
 8 11/3/2014  11/3/2014        1 control         9 male      20
 9 11/3/2014  11/3/2014        1 stress         16 female    21
10 11/3/2014  11/3/2014        1 stress         13 female    19
# ℹ 69 more rows

%in% is an important operator in

Filter out rows that are unique: `distinct()`

Do we have different start / end dates?

select(judgments, start_date, end_date)

# A tibble: 188 × 2
   start_date end_date 
   <chr>      <chr>    
 1 11/3/2014  11/3/2014
 2 11/3/2014  11/3/2014
 3 11/3/2014  11/3/2014
 4 11/3/2014  11/3/2014
 5 11/3/2014  11/3/2014
 6 11/3/2014  11/3/2014
 7 11/3/2014  11/3/2014
 8 11/3/2014  11/3/2014
 9 11/3/2014  11/3/2014
10 11/3/2014  11/3/2014
# ℹ 178 more rows

Too many identical rows.

Use `distinct()` to remove duplicated rows:

judgments |>
  filter(exclude == 0) |> 
  select(start_date, end_date) |> 
  distinct()

# A tibble: 5 × 2
  start_date end_date 
  <chr>      <chr>    
1 11/3/2014  11/3/2014
2 11/4/2014  11/4/2014
3 11/5/2014  11/5/2014
4 11/6/2014  11/6/2014
5 11/7/2014  11/7/2014

Also possible (except columns order):

judgments |>
  filter(exclude == 0) |> 
  distinct(start_date, end_date)

Sort columns: `arrange()`

A nested sorting example

Sort by mood_pre
Within each group of mood_pre, sort by mood_post

judgments |> 
  arrange(age, mood_post) |> 
  select(subject, age, mood_post)

# A tibble: 188 × 3
   subject   age mood_post
     <dbl> <dbl>     <dbl>
 1      81    17        11
 2      15    17        38
 3      58    17        60
 4     174    17        76
 5     147    17        82
 6      33    17        NA
 7      55    17        NA
 8     143    17        NA
 9     159    18         0
10     127    18        20
# ℹ 178 more rows

Reverse sort columns

Use arrange() with the helper function desc()
For example, oldest participant first

judgments |> 
  arrange(desc(age), mood_post) |> 
  select(subject, age, mood_post)

# A tibble: 188 × 3
   subject   age mood_post
     <dbl> <dbl>     <dbl>
 1     107    31        99
 2      61    27        64
 3      41    26        NA
 4     183    25        94
 5     115    24        79
 6      93    24       100
 7       2    24        NA
 8     137    24        NA
 9      86    23        62
10      49    23        75
# ℹ 178 more rows

Your turn!

Select all columns that refer to the STAI questionnaire.
Retrieve all subjects younger than 20 which are in the stress group. The column for the group is condition.
Arrange all observations by STAI_pre so that the subject with the lowest stress level is on top. What is the subject in question?

05:00

Solution

select(judgments, starts_with("STAI"))

# A tibble: 188 × 42
   STAI_pre_1_1 STAI_pre_1_2 STAI_pre_1_3 STAI_pre_1_4 STAI_pre_1_5 STAI_pre_1_6
          <dbl>        <dbl>        <dbl>        <dbl>        <dbl>        <dbl>
 1            2            1            2            2            2            2
 2            3            2            3            1            3            2
 3            4            3            3            3            4            2
 4            2            2            2            2            3            1
 5            1            1            1            1            2            1
 6            2            2            1            1            2            1
 7            2            2            1            1            2            1
 8            1            1            1            1            1            1
 9            2            2            1            1            2            1
10            4            2            3            3            3            1
# ℹ 178 more rows
# ℹ 36 more variables: STAI_pre_1_7 <dbl>, STAI_pre_2_1 <dbl>,
#   STAI_pre_2_2 <dbl>, STAI_pre_2_3 <dbl>, STAI_pre_2_4 <dbl>,
#   STAI_pre_2_5 <dbl>, STAI_pre_2_6 <dbl>, STAI_pre_2_7 <dbl>,
#   STAI_pre_3_1 <dbl>, STAI_pre_3_2 <dbl>, STAI_pre_3_3 <dbl>,
#   STAI_pre_3_4 <dbl>, STAI_pre_3_5 <dbl>, STAI_pre_3_6 <dbl>,
#   STAI_post_1_1 <dbl>, STAI_post_1_2 <dbl>, STAI_post_1_3 <dbl>, …

judgments |> 
  filter(age < 20, condition == "stress") |> 
  relocate(age, condition) |> 
  slice_head(n = 3)

# A tibble: 3 × 158
    age condition start_date end_date finished subject gender mood_pre mood_post
  <dbl> <chr>     <chr>      <chr>       <dbl>   <dbl> <chr>     <dbl>     <dbl>
1    19 stress    11/3/2014  11/3/20…        1       1 female       59        42
2    19 stress    11/3/2014  11/3/20…        1       3 female       22        60
3    19 stress    11/3/2014  11/3/20…        1      13 female       30        68
# ℹ 149 more variables: STAI_pre_1_1 <dbl>, STAI_pre_1_2 <dbl>,
#   STAI_pre_1_3 <dbl>, STAI_pre_1_4 <dbl>, STAI_pre_1_5 <dbl>,
#   STAI_pre_1_6 <dbl>, STAI_pre_1_7 <dbl>, STAI_pre_2_1 <dbl>,
#   STAI_pre_2_2 <dbl>, STAI_pre_2_3 <dbl>, STAI_pre_2_4 <dbl>,
#   STAI_pre_2_5 <dbl>, STAI_pre_2_6 <dbl>, STAI_pre_2_7 <dbl>,
#   STAI_pre_3_1 <dbl>, STAI_pre_3_2 <dbl>, STAI_pre_3_3 <dbl>,
#   STAI_pre_3_4 <dbl>, STAI_pre_3_5 <dbl>, STAI_pre_3_6 <dbl>, …

judgments |> 
 arrange(STAI_pre) |> 
  relocate(STAI_pre) |> 
  slice_head(n = 3)

# A tibble: 3 × 158
  STAI_pre start_date end_date  finished condition subject gender   age mood_pre
     <dbl> <chr>      <chr>        <dbl> <chr>       <dbl> <chr>  <dbl>    <dbl>
1       20 11/6/2014  11/6/2014        1 stress        142 female    22       96
2       21 11/3/2014  11/3/2014        1 control         9 male      20      100
3       22 11/7/2014  11/7/2014        1 stress        162 male      19       77
# ℹ 149 more variables: mood_post <dbl>, STAI_pre_1_1 <dbl>,
#   STAI_pre_1_2 <dbl>, STAI_pre_1_3 <dbl>, STAI_pre_1_4 <dbl>,
#   STAI_pre_1_5 <dbl>, STAI_pre_1_6 <dbl>, STAI_pre_1_7 <dbl>,
#   STAI_pre_2_1 <dbl>, STAI_pre_2_2 <dbl>, STAI_pre_2_3 <dbl>,
#   STAI_pre_2_4 <dbl>, STAI_pre_2_5 <dbl>, STAI_pre_2_6 <dbl>,
#   STAI_pre_2_7 <dbl>, STAI_pre_3_1 <dbl>, STAI_pre_3_2 <dbl>,
#   STAI_pre_3_3 <dbl>, STAI_pre_3_4 <dbl>, STAI_pre_3_5 <dbl>, …

Transforming columns

Changing column names

`rename(data, new_name = old_name)`

To remember the order of appearance, consider = as “was”.

rename(judgments, 
       done = finished,
       sex = gender)

# A tibble: 188 × 158
   start_date end_date   done condition subject sex      age mood_pre mood_post
   <chr>      <chr>     <dbl> <chr>       <dbl> <chr>  <dbl>    <dbl>     <dbl>
 1 11/3/2014  11/3/2014     1 control         2 female    24       81        NA
 2 11/3/2014  11/3/2014     1 stress          1 female    19       59        42
 3 11/3/2014  11/3/2014     1 stress          3 female    19       22        60
 4 11/3/2014  11/3/2014     1 stress          4 female    22       53        68
 5 11/3/2014  11/3/2014     1 control         7 female    22       48        NA
 6 11/3/2014  11/3/2014     1 stress          6 female    22       73        73
 7 11/3/2014  11/3/2014     1 control         5 female    18       NA        NA
 8 11/3/2014  11/3/2014     1 control         9 male      20      100        NA
 9 11/3/2014  11/3/2014     1 stress         16 female    21       67        74
10 11/3/2014  11/3/2014     1 stress         13 female    19       30        68
# ℹ 178 more rows
# ℹ 149 more variables: STAI_pre_1_1 <dbl>, STAI_pre_1_2 <dbl>,
#   STAI_pre_1_3 <dbl>, STAI_pre_1_4 <dbl>, STAI_pre_1_5 <dbl>,
#   STAI_pre_1_6 <dbl>, STAI_pre_1_7 <dbl>, STAI_pre_2_1 <dbl>,
#   STAI_pre_2_2 <dbl>, STAI_pre_2_3 <dbl>, STAI_pre_2_4 <dbl>,
#   STAI_pre_2_5 <dbl>, STAI_pre_2_6 <dbl>, STAI_pre_2_7 <dbl>,
#   STAI_pre_3_1 <dbl>, STAI_pre_3_2 <dbl>, STAI_pre_3_3 <dbl>, …

With a function: `rename_with()`

For the STAI columns convert names to lower case

rename_with(judgments, 
            stringr::str_to_lower, # functional prog, function as arg
            starts_with("STAI"))

# A tibble: 188 × 158
   start_date end_date  finished condition subject gender   age mood_pre
   <chr>      <chr>        <dbl> <chr>       <dbl> <chr>  <dbl>    <dbl>
 1 11/3/2014  11/3/2014        1 control         2 female    24       81
 2 11/3/2014  11/3/2014        1 stress          1 female    19       59
 3 11/3/2014  11/3/2014        1 stress          3 female    19       22
 4 11/3/2014  11/3/2014        1 stress          4 female    22       53
 5 11/3/2014  11/3/2014        1 control         7 female    22       48
 6 11/3/2014  11/3/2014        1 stress          6 female    22       73
 7 11/3/2014  11/3/2014        1 control         5 female    18       NA
 8 11/3/2014  11/3/2014        1 control         9 male      20      100
 9 11/3/2014  11/3/2014        1 stress         16 female    21       67
10 11/3/2014  11/3/2014        1 stress         13 female    19       30
# ℹ 178 more rows
# ℹ 150 more variables: mood_post <dbl>, stai_pre_1_1 <dbl>,
#   stai_pre_1_2 <dbl>, stai_pre_1_3 <dbl>, stai_pre_1_4 <dbl>,
#   stai_pre_1_5 <dbl>, stai_pre_1_6 <dbl>, stai_pre_1_7 <dbl>,
#   stai_pre_2_1 <dbl>, stai_pre_2_2 <dbl>, stai_pre_2_3 <dbl>,
#   stai_pre_2_4 <dbl>, stai_pre_2_5 <dbl>, stai_pre_2_6 <dbl>,
#   stai_pre_2_7 <dbl>, stai_pre_3_1 <dbl>, stai_pre_3_2 <dbl>, …

Adding columns: `mutate()`

Let’s create a new column mood_change that describes the change of the mood of the participant across the experiment.

New column name: mood_change
Computation: subtract mood_pre from mood_post

judgments |>
  mutate(mood_change = mood_post - mood_pre) |> 
  relocate(starts_with("mood"))

# A tibble: 188 × 159
   mood_pre mood_post mood_change start_date end_date finished condition subject
      <dbl>     <dbl>       <dbl> <chr>      <chr>       <dbl> <chr>       <dbl>
 1       81        NA          NA 11/3/2014  11/3/20…        1 control         2
 2       59        42         -17 11/3/2014  11/3/20…        1 stress          1
 3       22        60          38 11/3/2014  11/3/20…        1 stress          3
 4       53        68          15 11/3/2014  11/3/20…        1 stress          4
 5       48        NA          NA 11/3/2014  11/3/20…        1 control         7
 6       73        73           0 11/3/2014  11/3/20…        1 stress          6
 7       NA        NA          NA 11/3/2014  11/3/20…        1 control         5
 8      100        NA          NA 11/3/2014  11/3/20…        1 control         9
 9       67        74           7 11/3/2014  11/3/20…        1 stress         16
10       30        68          38 11/3/2014  11/3/20…        1 stress         13
# ℹ 178 more rows
# ℹ 151 more variables: gender <chr>, age <dbl>, STAI_pre_1_1 <dbl>,
#   STAI_pre_1_2 <dbl>, STAI_pre_1_3 <dbl>, STAI_pre_1_4 <dbl>,
#   STAI_pre_1_5 <dbl>, STAI_pre_1_6 <dbl>, STAI_pre_1_7 <dbl>,
#   STAI_pre_2_1 <dbl>, STAI_pre_2_2 <dbl>, STAI_pre_2_3 <dbl>,
#   STAI_pre_2_4 <dbl>, STAI_pre_2_5 <dbl>, STAI_pre_2_6 <dbl>,
#   STAI_pre_2_7 <dbl>, STAI_pre_3_1 <dbl>, STAI_pre_3_2 <dbl>, …

Within one mutate statement

Instant availability

Use new variables in the same function call right away!

judgments |>
  mutate(
    mood_change = mood_post - mood_pre,
    # remove missing data before computation
    mood_change_norm =
      abs(mood_change / mean(mood_change, na.rm = TRUE))) |>
  relocate(starts_with("mood")) |> 
  arrange(desc(mood_change_norm))

# A tibble: 188 × 160
   mood_pre mood_post mood_change mood_change_norm start_date end_date  finished
      <dbl>     <dbl>       <dbl>            <dbl> <chr>      <chr>        <dbl>
 1       66         0         -66             9.12 11/3/2014  11/3/2014        1
 2       77        22         -55             7.60 11/4/2014  11/4/2014        1
 3       47       100          53             7.32 11/5/2014  11/5/2014        1
 4       25        72          47             6.49 11/4/2014  11/4/2014        1
 5       22        69          47             6.49 11/5/2014  11/5/2014        1
 6       37        83          46             6.36 11/6/2014  11/6/2014        1
 7       20        62          42             5.80 11/6/2014  11/6/2014        1
 8       60       100          40             5.53 11/4/2014  11/4/2014        1
 9       22        60          38             5.25 11/3/2014  11/3/2014        1
10       30        68          38             5.25 11/3/2014  11/3/2014        1
# ℹ 178 more rows
# ℹ 153 more variables: condition <chr>, subject <dbl>, gender <chr>,
#   age <dbl>, STAI_pre_1_1 <dbl>, STAI_pre_1_2 <dbl>, STAI_pre_1_3 <dbl>,
#   STAI_pre_1_4 <dbl>, STAI_pre_1_5 <dbl>, STAI_pre_1_6 <dbl>,
#   STAI_pre_1_7 <dbl>, STAI_pre_2_1 <dbl>, STAI_pre_2_2 <dbl>,
#   STAI_pre_2_3 <dbl>, STAI_pre_2_4 <dbl>, STAI_pre_2_5 <dbl>,
#   STAI_pre_2_6 <dbl>, STAI_pre_2_7 <dbl>, STAI_pre_3_1 <dbl>, …

na.rm = TRUE for computing mean after removing missing data

Replacing columns

Update existing

Using existing columns updates their content.

Code lines 2 and 3

Warning

If not using names actions are used as names (avoid) - Code line 4

`mutate()` existing columns, centering `mood` columns

judgments |>
  mutate(mood_pre = mood_pre - mean(mood_pre, na.rm = TRUE),
         mood_post = mood_post - mean(mood_post, na.rm = TRUE),
         mood_pre - mean(mood_post, na.rm = TRUE)) |> 
  select(starts_with("mood"))

# A tibble: 188 × 3
   mood_pre mood_post `mood_pre - mean(mood_post, na.rm = TRUE)`
      <dbl>     <dbl>                                      <dbl>
 1   21.6       NA                                        21.6  
 2   -0.358    -19.8                                      -0.358
 3  -37.4       -1.76                                    -37.4  
 4   -6.36       6.24                                     -6.36 
 5  -11.4       NA                                       -11.4  
 6   13.6       11.2                                      13.6  
 7   NA         NA                                        NA    
 8   40.6       NA                                        40.6  
 9    7.64      12.2                                       7.64 
10  -29.4        6.24                                    -29.4  
# ℹ 178 more rows

Switch statement `case_when()`

Categorize mood_pre. Tests come sequentially.

judgments |>
  mutate(mood_pre_cat = case_when(
    mood_pre < 25  ~ "poor", 
    mood_pre < 50  ~ "mid", 
    mood_pre < 75 ~ "great",
    mood_pre <= 100 ~ "exceptional",
    .default = "missing data")) |> 
  select(mood_pre, mood_pre_cat)

# A tibble: 188 × 2
   mood_pre mood_pre_cat
      <dbl> <chr>       
 1       81 exceptional 
 2       59 great       
 3       22 poor        
 4       53 great       
 5       48 mid         
 6       73 great       
 7       NA missing data
 8      100 exceptional 
 9       67 great       
10       30 mid         
# ℹ 178 more rows

.default = is better than TRUE for the last category.

Your turn!

Create a new STAI_pre_category column.

Use case_when() to categorize values in STAI_pre as low, normal or high.

For values < 25 in STAI_pre assign low
For values > 64 assign high
For all other values assign normal.

Hint

To easily see the new column, use relocate() to move it to the first position of the tibble

10:00

Solution

judgments |>
  mutate(STAI_pre_category = case_when(
    STAI_pre < 25  ~ "low",
    STAI_pre < 65 ~ "normal",
    .default = "high")) |>
  relocate(STAI_pre_category, STAI_pre)

# A tibble: 188 × 159
   STAI_pre_category STAI_pre start_date end_date  finished condition subject
   <chr>                <dbl> <chr>      <chr>        <dbl> <chr>       <dbl>
 1 normal                  32 11/3/2014  11/3/2014        1 control         2
 2 normal                  49 11/3/2014  11/3/2014        1 stress          1
 3 high                    65 11/3/2014  11/3/2014        1 stress          3
 4 normal                  42 11/3/2014  11/3/2014        1 stress          4
 5 normal                  33 11/3/2014  11/3/2014        1 control         7
 6 normal                  34 11/3/2014  11/3/2014        1 stress          6
 7 normal                  32 11/3/2014  11/3/2014        1 control         5
 8 low                     21 11/3/2014  11/3/2014        1 control         9
 9 normal                  31 11/3/2014  11/3/2014        1 stress         16
10 normal                  60 11/3/2014  11/3/2014        1 stress         13
# ℹ 178 more rows
# ℹ 152 more variables: gender <chr>, age <dbl>, mood_pre <dbl>,
#   mood_post <dbl>, STAI_pre_1_1 <dbl>, STAI_pre_1_2 <dbl>,
#   STAI_pre_1_3 <dbl>, STAI_pre_1_4 <dbl>, STAI_pre_1_5 <dbl>,
#   STAI_pre_1_6 <dbl>, STAI_pre_1_7 <dbl>, STAI_pre_2_1 <dbl>,
#   STAI_pre_2_2 <dbl>, STAI_pre_2_3 <dbl>, STAI_pre_2_4 <dbl>,
#   STAI_pre_2_5 <dbl>, STAI_pre_2_6 <dbl>, STAI_pre_2_7 <dbl>, …

Before we stop

You learned to:

Selection and manipulation of
- observations,
- variables and
- values

Next step: Grouping and Summarizing

Acknowledgments 🙏 👏

Lise Vaudor nice blog
Allison Horst for the great ArtWork
Jenny Bryan
poorman by Nathan Eastwood, a re-implementation of dplyr in base only

Contributions

Milena Zizovic
Roland Krause
Veronica Codoni

Thank you for your attention!

Data wrangling

Learning objectives

Key operations

Data wrangling learning

Organisation

dplyr introduction: Cheatsheet

Companion real data for this course

Your turn!

Selecting columns

tidyselect

Helper functions

Combining helpers

Filtering for rows: filter()

Filtering rows

Multiple conditions: AND

Multiple conditions: OR

Filtering out rows

Row vs column selection

Set operations with filter()

Filter out rows that are unique: distinct()

Do we have different start / end dates?

Use distinct() to remove duplicated rows:

Sort columns: arrange()

A nested sorting example

Reverse sort columns

Your turn!

Solution

Transforming columns

Changing column names

rename(data, new_name = old_name)

With a function: rename_with()

Adding columns: mutate()

Within one mutate statement

Replacing columns

mutate() existing columns, centering mood columns

Switch statement case_when()

Your turn!

Solution

Before we stop

`dplyr` introduction: Cheatsheet

`tidyselect`

Filtering for rows: `filter()`

Set operations with `filter()`

Filter out rows that are unique: `distinct()`

Use `distinct()` to remove duplicated rows:

Sort columns: `arrange()`

`rename(data, new_name = old_name)`

With a function: `rename_with()`

Adding columns: `mutate()`

`mutate()` existing columns, centering `mood` columns

Switch statement `case_when()`