Data import

using readr

Author

A. Ginolhac, V. Codoni

Published

March 17, 2025

Objective

In this practical, you’ll learn how to import flat files using the readr package

Before you start

To perform reproducible research it is a good practice to store the files in a standardized location. We will take advantage of the RStudio projects and store data files in a sub-folder called data. This tutorial is meant to be completed as part of the repository that you use for all practicals of the R tidyverse workshop.

Prepare your project’s folder
  1. Check that the project is active: the name you chose should appear on the top-right corner.

  2. Create a folder named data within your project’s folder. Use the Files pane in the lower right Rstudio panel or your favorite file browser.

  3. Download the file blood_fat.csv and place it in the data sub-folder you just created.

  4. Add a setup code chunk to this Quarto document to load the packages (library(tidyverse)) You don’t need to install the packages if those lines are working fine. Such chunk looks like this (include option to false to hide the messages in rendered documents):

```{r}
#| label: setup
#| include: false

library(tidyverse)
```
Warning

If you load the library only in the console and forget to place a chunk to load it, the render process will fail. Indeed, when you click on the Render button, the chunks are evaluated in a new and fresh environment.

Use readr to load your first file

Read the blood_fat file using read_delim()
Tip

the relative path can be safely built using "data/blood_fat.csv" if you followed the preliminary steps above, download the CSV in a sub-folder data of a RStudio project

For example, you folder structure could be (depending on the picked names). Here:

  • RStudio project is for example basv53-practicals
  • Quarto document is 03_import_answer.qmd
.
├── 03_import_answer.qmd
└── basv53-practicals.Rproj
├── data
│   └── blood_fat.csv
read_delim("https://biostat2.uni.lu/practicals/data/blood_fat.csv")
Which delimiter was guessed? Does it fit the file extension?

A comma was found and it fits the comma-separated-values (csv) extension.

Load again the same file, silencing the read_delim() message
Tip

read_delim() execution is reporting the dimensions of the file, along with the guessed delimiter and data type of each columns

If we are happy with the guessed delimiter and the column names / types, we could silent this reporting using the argument show_col_types = FALSE

read_delim("data/blood_fat.csv", show_col_types = FALSE)
# A tibble: 25 × 5
      id group weight   age   fat
   <dbl> <chr>  <dbl> <dbl> <dbl>
 1     1 A         84    46  354.
 2     2 A         73    20  190.
 3     3 A         65    52  406.
 4     4 A         70    30  264.
 5     5 A         76    57  452.
 6     6 A         69    25  302.
 7     7 A         63    28  288.
 8     8 A         72    36  386.
 9     9 A         79    57  402.
10    10 A         75    44  366.
# ℹ 15 more rows

The tibble

read_delim() loads the data as a tibble. The main advantage to use tibbles over a regular data frame is the printing.

  • Tibbles show some useful information such as the number of rows and columns:
    • Look at the top of the tibble and find the information “A tibble rows x cols”
    • How many rows are in the tibble?
  • The columns of a tibble report their type:
    • Look at the tibble header, the type of a columns is reported just below its name.
    • What is the type of the age column?
  • 25 rows and 5 columns
  • age are double numbers

Actually, both age and id are integers, and should be read as such.

Read the blood_fat.csv specifying the data types of age and id as integers
Tip

In the col_types = cols(....) you can use the columns bare names and either the long description to call the specific data type like col_integer() or the shortcut "i"

read_delim("data/blood_fat.csv", 
           col_types = cols(age = "i",
                            id = "i"))
# A tibble: 25 × 5
      id group weight   age   fat
   <int> <chr>  <dbl> <int> <dbl>
 1     1 A         84    46  354.
 2     2 A         73    20  190.
 3     3 A         65    52  406.
 4     4 A         70    30  264.
 5     5 A         76    57  452.
 6     6 A         69    25  302.
 7     7 A         63    28  288.
 8     8 A         72    36  386.
 9     9 A         79    57  402.
10    10 A         75    44  366.
# ℹ 15 more rows
Read the blood_fat.csv specifying the data types of age and id as integers, skipping weight
read_delim("data/blood_fat.csv", 
           col_types = cols(age = "i",
                            weight = "_",
                            id = "i"))
# A tibble: 25 × 4
      id group   age   fat
   <int> <chr> <int> <dbl>
 1     1 A        46  354.
 2     2 A        20  190.
 3     3 A        52  406.
 4     4 A        30  264.
 5     5 A        57  452.
 6     6 A        25  302.
 7     7 A        28  288.
 8     8 A        36  386.
 9     9 A        57  402.
10    10 A        44  366.
# ℹ 15 more rows
Read blood_fat.csv using the relevant readr variant for CSV files and assign the name blood_fat
Tip

An assignment (<- or -> operators) returns nothing but populate the Global Environment. Add the the code line blood_fat to call the object and display the associated tibble.

blood_fat <- read_csv("data/blood_fat.csv", show_col_types = FALSE)
blood_fat
# A tibble: 25 × 5
      id group weight   age   fat
   <dbl> <chr>  <dbl> <dbl> <dbl>
 1     1 A         84    46  354.
 2     2 A         73    20  190.
 3     3 A         65    52  406.
 4     4 A         70    30  264.
 5     5 A         76    57  452.
 6     6 A         69    25  302.
 7     7 A         63    28  288.
 8     8 A         72    36  386.
 9     9 A         79    57  402.
10    10 A         75    44  366.
# ℹ 15 more rows