# A tibble: 6 × 2
subject_id gender_age
<int> <chr>
1 1001 m-40
2 1002 f-42
3 1003 m-37
4 1004 f-45
5 1005 m-55
6 1006 f-24
and tidyr
University of Luxembourg
Wednesday, the 26th of March, 2025
You will learn to:
Credit: Artwork by Allison Horst
Source: Garret Grolemund and vignette("tidy-data")
Questions
Error | Tidy violation | Comment |
---|---|---|
Person name | No | Data protection violation |
Identical column names | Yes | Variable error |
Inconsistent variables names | No | Bad practice |
Non-English columns names | No | Bad practice |
Color coding | No | The horror, the horror |
Inconsistent dates | No | Use ISO8601 |
Multiple columns for one item | Yes | One observation per line |
Redundant information | Yes | Each variable is in its own column |
Repeated rows | Yes | Each observation is in its own row |
Missing coding | Yes/No | Each value in its own cell |
Unnecessary information (Birthdate, comments) | No | Bad practice |
Name of the table | No | Bad practice |
tidyr
# A tibble: 6 × 2
subject_id gender_age
<int> <chr>
1 1001 m-40
2 1002 f-42
3 1003 m-37
4 1004 f-45
5 1005 m-55
6 1006 f-24
# A tibble: 3 × 2
date value
<chr> <chr>
1 2015-11-23 high
2 2014-2-1 low
3 2014-4-30 low
No need to clean up old columns.
visit_times <- tribble(
~subject, ~visit_date,
1, "01/07/2001",
2, "01.MAY.2012",
3, "12-07-2015",
4, "4/5/14",
5, "12. Jun 1999"
)
visit_times
# A tibble: 5 × 2
subject visit_date
<dbl> <chr>
1 1 01/07/2001
2 2 01.MAY.2012
3 3 12-07-2015
4 4 4/5/14
5 5 12. Jun 1999
Mix of everything
# A tibble: 5 × 3
subject visit_date good_date
<dbl> <chr> <date>
1 1 01/07/2001 2001-07-01
2 2 01.MAY.2012 2012-05-01
3 3 12-07-2015 2015-07-12
4 4 4/5/14 2014-05-04
5 5 12. Jun 1999 1999-06-12
lubridate
has a range of functions for parsing ill-formatted dates and times.
patient_df <- tibble(
subject_id = 1001:1003,
visit_id = c("1,2, 3", "1|2", "1"),
measured = c("9,0, 11", "11, 3", "12"))
patient_df
# A tibble: 3 × 3
subject_id visit_id measured
<int> <chr> <chr>
1 1001 1,2, 3 9,0, 11
2 1002 1|2 11, 3
3 1003 1 12
Note the incoherent white space and separators.
Is it in tidy format?
Act on all columns but org
pivot_longer(chr_lg, cols = -org,
names_to = "chromosome",
values_to = "length_bp") -> chr_lg_long
chr_lg_long
# A tibble: 6 × 3
org chromosome length_bp
<chr> <chr> <dbl>
1 yeast chr1 230218
2 yeast chr2 813184
3 yeast MT 85779
4 mouse chr1 195154279
5 mouse chr2 181755017
6 mouse MT 16299
The org
IDs are replicated by the number of pivoted columns (3)
Pivot longer is also called melt
Of chromosome lengths between mouse and yeast?
Hint
We need the org
in their own column
# A tibble: 6 × 3
org chromosome length_bp
<chr> <chr> <dbl>
1 yeast chr1 230218
2 yeast chr2 813184
3 yeast MT 85779
4 mouse chr1 195154279
5 mouse chr2 181755017
6 mouse MT 16299
# A tibble: 3 × 3
chromosome yeast mouse
<chr> <dbl> <dbl>
1 chr1 230218 195154279
2 chr2 813184 181755017
3 MT 85779 16299
Credit: Artwork by Allison Horst
You learned to:
Further reading 📚
Acknowledgments 🙏 👏
Thank you for your attention!