Principles of tidy data

and tidyr

Aurélien Ginolhac, DLSM

University of Luxembourg

Wednesday, the 26th of March, 2025

Learning objectives

You will learn to:

  • List the principles of tidy data to structure data in tables
  • Identify errors in existing data sets
  • Rearrangements
    • Split columns
    • Unite columns
    • Pivot to long format
    • Pivot to wider

A definition of tidy data

  • Variable: A quantity, quality, or property that you can measure.
  • Observation: A set of values that display the relationship between variables. To be an observation, values need to be measured under similar conditions, usually measured on the same observational unit at the same time.
  • Value: The state of a variable that you observe when you measure it.
  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.

Bad data exercise

Questions

  • The following table lists data from two questionnaires –stai and rec– recorded in different languages.
  • What’s wrong with the Excel sheet?
  • Which problems are tidy issues?

Tidy errors

Error Tidy violation Comment
Person name No Data protection violation
Identical column names Yes Variable error
Inconsistent variables names No Bad practice
Non-English columns names No Bad practice
Color coding No The horror, the horror
Inconsistent dates No Use ISO8601
Multiple columns for one item Yes One observation per line
Redundant information Yes Each variable is in its own column
Repeated rows Yes Each observation is in its own row
Missing coding Yes/No Each value in its own cell
Unnecessary information (Birthdate, comments) No Bad practice
Name of the table No Bad practice

Basic rearrangements

tidyr

Splitting values into columns

Key-value pairs

# A tibble: 6 × 2
  subject_id gender_age
       <int> <chr>     
1       1001 m-40      
2       1002 f-42      
3       1003 m-37      
4       1004 f-45      
5       1005 m-55      
6       1006 f-24      
separate_wider_delim(patient,
                     cols = gender_age, 
                     delim = "-",
                     names = c("sex", "age"))
# A tibble: 6 × 3
  subject_id sex   age  
       <int> <chr> <chr>
1       1001 m     40   
2       1002 f     42   
3       1003 m     37   
4       1004 f     45   
5       1005 m     55   
6       1006 f     24   

Collating columns into one

Input tibble

data_value <-
  tibble(
    year = c(2015, 2014, 2014),
    month = c(11, 2, 4),
    day = c(23, 1, 30),
    value = c("high", "low", "low"))
data_value
# A tibble: 3 × 4
   year month   day value
  <dbl> <dbl> <dbl> <chr>
1  2015    11    23 high 
2  2014     2     1 low  
3  2014     4    30 low  

Unite columns into one

date_unite <-  unite(data_value, 
                     col = date, 
                     year, month, day, 
                     sep = "-") 
date_unite
# A tibble: 3 × 2
  date       value
  <chr>      <chr>
1 2015-11-23 high 
2 2014-2-1   low  
3 2014-4-30  low  

No need to clean up old columns.

mutate(date_unite,
       date2 = as_date(date), .before = date)
# A tibble: 3 × 3
  date2      date       value
  <date>     <chr>      <chr>
1 2015-11-23 2015-11-23 high 
2 2014-02-01 2014-2-1   low  
3 2014-04-30 2014-4-30  low  

Now data type is real date

Parsing dates

A gift from your collaborators

visit_times <- tribble(
  ~subject, ~visit_date,
  1,        "01/07/2001",
  2,        "01.MAY.2012",
  3,        "12-07-2015",
  4,        "4/5/14",
  5,        "12. Jun 1999"
)
visit_times
# A tibble: 5 × 2
  subject visit_date  
    <dbl> <chr>       
1       1 01/07/2001  
2       2 01.MAY.2012 
3       3 12-07-2015  
4       4 4/5/14      
5       5 12. Jun 1999

Mix of everything

lubridate to the rescue!

mutate(visit_times,
       good_date = dmy(visit_date))
# A tibble: 5 × 3
  subject visit_date   good_date 
    <dbl> <chr>        <date>    
1       1 01/07/2001   2001-07-01
2       2 01.MAY.2012  2012-05-01
3       3 12-07-2015   2015-07-12
4       4 4/5/14       2014-05-04
5       5 12. Jun 1999 1999-06-12

lubridate has a range of functions for parsing ill-formatted dates and times.

Separate rows with multiple entries

Multiple values per cell

patient_df <- tibble(
    subject_id = 1001:1003, 
    visit_id = c("1,2, 3", "1|2", "1"),
    measured = c("9,0, 11", "11, 3", "12"))
patient_df
# A tibble: 3 × 3
  subject_id visit_id measured
       <int> <chr>    <chr>   
1       1001 1,2, 3   9,0, 11 
2       1002 1|2      11, 3   
3       1003 1        12      

Note the incoherent white space and separators.

Combinations of variables

separate_rows(patient_df, 
              visit_id, 
              measured,
              convert = TRUE)  # parse as integers if possible
# A tibble: 6 × 3
  subject_id visit_id measured
       <int>    <int>    <int>
1       1001        1        9
2       1001        2        0
3       1001        3       11
4       1002        1       11
5       1002        2        3
6       1003        1       12

Now, in tidy format

Pivot, each variable in its own column

Chromosome lengths

Is it in tidy format?

tibble(org = c("yeast", "mouse"),
       chr1 = c(230218, 195154279),
       chr2 = c(813184, 181755017),
       MT = c(85779, 16299)) -> chr_lg
chr_lg
# A tibble: 2 × 4
  org        chr1      chr2    MT
  <chr>     <dbl>     <dbl> <dbl>
1 yeast    230218    813184 85779
2 mouse 195154279 181755017 16299

Pivot to the longer format

Act on all columns but org

pivot_longer(chr_lg, cols = -org,
             names_to = "chromosome",
             values_to = "length_bp") -> chr_lg_long
chr_lg_long
# A tibble: 6 × 3
  org   chromosome length_bp
  <chr> <chr>          <dbl>
1 yeast chr1          230218
2 yeast chr2          813184
3 yeast MT             85779
4 mouse chr1       195154279
5 mouse chr2       181755017
6 mouse MT             16299

The org IDs are replicated by the number of pivoted columns (3)

Pivot longer is also called melt

Pivot wider, useful for some computation

Compute the absolute difference

Of chromosome lengths between mouse and yeast?

Hint

We need the org in their own column

# A tibble: 6 × 3
  org   chromosome length_bp
  <chr> <chr>          <dbl>
1 yeast chr1          230218
2 yeast chr2          813184
3 yeast MT             85779
4 mouse chr1       195154279
5 mouse chr2       181755017
6 mouse MT             16299

Pivot wider, ids are now chromosomes

pivot_wider(chr_lg_long, id_cols = chromosome,
            names_from = org,
            values_from = length_bp)
# A tibble: 3 × 3
  chromosome  yeast     mouse
  <chr>       <dbl>     <dbl>
1 chr1       230218 195154279
2 chr2       813184 181755017
3 MT          85779     16299

Now do the absolute diff

pivot_wider(chr_lg_long, id_cols = chromosome,
            names_from = org,
            values_from = length_bp) |> 
  mutate(abs_diff = abs(mouse - yeast))
# A tibble: 3 × 4
  chromosome  yeast     mouse  abs_diff
  <chr>       <dbl>     <dbl>     <dbl>
1 chr1       230218 195154279 194924061
2 chr2       813184 181755017 180941833
3 MT          85779     16299     69480

Longer vs Wider

Before we stop

You learned to:

  • What is the tidy format
  • Split and Unite columns
  • Pivot long / wide

Further reading 📚

Acknowledgments 🙏 👏

  • Hadley Wickham
  • Roland Krause
  • Alison Hill
  • Artwork by Allison Horst

Thank you for your attention!