Import data

From flat / Excel files

Aurélien Ginolhac, DHML

University of Luxembourg

Tuesday 17 March, 2026

Learning objectives

You will learn to:

Download a data file to a dedicated data sub-folder
Learn about tibbles
Use readr to import flat-files data into
Use the interactive RStudio interface to visualise your data import
Use readxl to import excel files into

Importing data

Getting started

Represents probably the first step of your work
R can handle multiple data types
- Flat files (.csv, .tsv, …)
- Excel files (.xls, .xlsx)
- Foreign statistical formats
  - .sas from SAS
  - .sav from SPSS
  - .dta from Stata
- databases (SQL, SQLite …)

Tidyverse implementation

R base already provides functions for text files
- read.csv()
- read.delim()
- …
Tidyverse redefines these functions:
- Speed
- Characters are not coerced to factors by default (R base neither since >=4.0)
- Generates tibbles

Tibbles

Reports the table dimensions (rows x columns)
Shows only the first 10 rows.
Each column reports its type (dbl, int, lgl, chr, date)
Show columns that fits your width (list the rest with types).
Have NO rownames

Tip

Use as_tibble() to convert a data.frame to a tibble.

dbl: doubles
int: integers
lgl: logicals (booleans)
chr: characters
date: dates
list: list-column

`data.frame` vs `tibble`

Data frame

swiss

             Fertility Agriculture Examination Education Catholic
Courtelary        80.2        17.0          15        12     9.96
Delemont          83.1        45.1           6         9    84.84
Franches-Mnt      92.5        39.7           5         5    93.40
Moutier           85.8        36.5          12         7    33.77
Neuveville        76.9        43.5          17        15     5.16
Porrentruy        76.1        35.3           9         7    90.57
Broye             83.8        70.2          16         7    92.85
Glane             92.4        67.8          14         8    97.16
Gruyere           82.4        53.3          12         7    97.67
Sarine            82.9        45.2          16        13    91.38
Veveyse           87.1        64.5          14         6    98.61
Aigle             64.1        62.0          21        12     8.52
Aubonne           66.9        67.5          14         7     2.27
Avenches          68.9        60.7          19        12     4.43
Cossonay          61.7        69.3          22         5     2.82
Echallens         68.3        72.6          18         2    24.20
Grandson          71.7        34.0          17         8     3.30
Lausanne          55.7        19.4          26        28    12.11
La Vallee         54.3        15.2          31        20     2.15
Lavaux            65.1        73.0          19         9     2.84
Morges            65.5        59.8          22        10     5.23
Moudon            65.0        55.1          14         3     4.52
Nyone             56.6        50.9          22        12    15.14
Orbe              57.4        54.1          20         6     4.20
Oron              72.5        71.2          12         1     2.40
Payerne           74.2        58.1          14         8     5.23
Paysd'enhaut      72.0        63.5           6         3     2.56
Rolle             60.5        60.8          16        10     7.72
Vevey             58.3        26.8          25        19    18.46
Yverdon           65.4        49.5          15         8     6.10
Conthey           75.5        85.9           3         2    99.71
Entremont         69.3        84.9           7         6    99.68
Herens            77.3        89.7           5         2   100.00
Martigwy          70.5        78.2          12         6    98.96
Monthey           79.4        64.9           7         3    98.22
St Maurice        65.0        75.9           9         9    99.06
Sierre            92.2        84.6           3         3    99.46
Sion              79.3        63.1          13        13    96.83
Boudry            70.4        38.4          26        12     5.62
La Chauxdfnd      65.7         7.7          29        11    13.79
Le Locle          72.7        16.7          22        13    11.22
Neuchatel         64.4        17.6          35        32    16.92
Val de Ruz        77.6        37.6          15         7     4.97
ValdeTravers      67.6        18.7          25         7     8.65
V. De Geneve      35.0         1.2          37        53    42.34
Rive Droite       44.7        46.6          16        29    50.43
Rive Gauche       42.8        27.7          22        29    58.33
             Infant.Mortality
Courtelary               22.2
Delemont                 22.2
Franches-Mnt             20.2
Moutier                  20.3
Neuveville               20.6
Porrentruy               26.6
Broye                    23.6
Glane                    24.9
Gruyere                  21.0
Sarine                   24.4
Veveyse                  24.5
Aigle                    16.5
Aubonne                  19.1
Avenches                 22.7
Cossonay                 18.7
Echallens                21.2
Grandson                 20.0
Lausanne                 20.2
La Vallee                10.8
Lavaux                   20.0
Morges                   18.0
Moudon                   22.4
Nyone                    16.7
Orbe                     15.3
Oron                     21.0
Payerne                  23.8
Paysd'enhaut             18.0
Rolle                    16.3
Vevey                    20.9
Yverdon                  22.5
Conthey                  15.1
Entremont                19.8
Herens                   18.3
Martigwy                 19.4
Monthey                  20.2
St Maurice               17.8
Sierre                   16.3
Sion                     18.1
Boudry                   20.3
La Chauxdfnd             20.5
Le Locle                 18.9
Neuchatel                23.0
Val de Ruz               20.0
ValdeTravers             19.5
V. De Geneve             18.0
Rive Droite              18.2
Rive Gauche              19.3

Tibble

as_tibble(swiss)

# A tibble: 47 × 6
   Fertility Agriculture Examination Education Catholic Infant.Mortality
       <dbl>       <dbl>       <int>     <int>    <dbl>            <dbl>
 1      80.2        17            15        12     9.96             22.2
 2      83.1        45.1           6         9    84.8              22.2
 3      92.5        39.7           5         5    93.4              20.2
 4      85.8        36.5          12         7    33.8              20.3
 5      76.9        43.5          17        15     5.16             20.6
 6      76.1        35.3           9         7    90.6              26.6
 7      83.8        70.2          16         7    92.8              23.6
 8      92.4        67.8          14         8    97.2              24.9
 9      82.4        53.3          12         7    97.7              21  
10      82.9        45.2          16        13    91.4              24.4
# ℹ 37 more rows

We lost the province names (rownames of swiss)

Tibbles characteristics

tibbles never use rownames. They are lost unless you assign them to a dedicated column

# library(tibble) or dplyr or tidyverse
as_tibble(swiss, rownames = "Province")

# A tibble: 47 × 7
   Province     Fertility Agriculture Examination Education Catholic
   <chr>            <dbl>       <dbl>       <int>     <int>    <dbl>
 1 Courtelary        80.2        17            15        12     9.96
 2 Delemont          83.1        45.1           6         9    84.8 
 3 Franches-Mnt      92.5        39.7           5         5    93.4 
 4 Moutier           85.8        36.5          12         7    33.8 
 5 Neuveville        76.9        43.5          17        15     5.16
 6 Porrentruy        76.1        35.3           9         7    90.6 
 7 Broye             83.8        70.2          16         7    92.8 
 8 Glane             92.4        67.8          14         8    97.2 
 9 Gruyere           82.4        53.3          12         7    97.7 
10 Sarine            82.9        45.2          16        13    91.4 
# ℹ 37 more rows
# ℹ 1 more variable: Infant.Mortality <dbl>

Other enhancements

No characters to factors (R base neither since >=4.0)
Colnames can be repaired, "minimal", "unique", "universal", "check_unique", see details

The tidyverse packages to import your data

readr

read_csv(): comma separated (,)
read_csv2(): semi-colon separated (;)
read_tsv(): tab separated
read_delim(): general delimited files, auto-guesses delimiter
read_fwf(): fixed width files
read_table(): columns separated by white-space(s)

readxl

To import Excel files

read_excel()
- read_xls()
- read_xlsx()

haven

read_sas() for SAS
read_sav() for SPSS
read_dta() for Stata

Follow along: downloading flat files (= text)

Example: swiss.csv

Use a RStudio project
Download the swiss.csv file to your project folder, in the sub-folder data (if ou have one)
- Right-click on the link
- Choose Save Link As

You MUST have created a RStudio project
Use Finder/Explorer to select this project folder
If not present New Folder called data
Go inside this sub-folder data
Hit Save

It’s always useful to check in your file browser.

Check if file is present and preview it

In RStudio:

Use the Files panel
Go in data and check if swiss.csv is present
Click, menu should be:

If swiss.csv is present in the data folder
You can View File to see its content

What is the delimiter?
Does it fits the file extension (.csv)?
What is the decimal mark?

Follow along: Import the file with `read_delim()`

The file says .csv but could be wrong

# library(readr) or 
# library(tidyverse)
read_delim("data/swiss.csv")

Rows: 47 Columns: 6                                                                          
── Column specification ──────────────────────────────────
Delimiter: ","
dbl (6): Fertility, Agriculture, Examination, Educatio...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 47 × 6
   Fertility Agriculture Examination Education Catholic
       <dbl>       <dbl>       <dbl>     <dbl>    <dbl>
 1      80.2        17            15        12     9.96
 2      83.1        45.1           6         9    84.8 
 3      92.5        39.7           5         5    93.4 
 4      85.8        36.5          12         7    33.8 
 5      76.9        43.5          17        15     5.16
 6      76.1        35.3           9         7    90.6 
 7      83.8        70.2          16         7    92.8 
 8      92.4        67.8          14         8    97.2 
 9      82.4        53.3          12         7    97.7 
10      82.9        45.2          16        13    91.4 
# ℹ 37 more rows
# ℹ 1 more variable: Infant.Mortality <dbl>
# ℹ Use `print(n = ...)` to see more rows

What do we observe with `read_delim()`?

Guesses automatically the field separator: ,
Reports the dimensions of the table read
Also guessed the column types (all 6 double)
Returns a tibble by default

Watch out!

We didn’t assign the tibble to a name!

To do so:

swiss_db <- read_delim("data/swiss.csv")

`readr` features

Column type guessing
Delimiter guessing
Reporting parsing problems
Progress bar
Reading from urls
Reading compressed files
FAST!
- Multi-threaded
- Use ALTREP
Column selection
Merge multiple files into one tibble

Catch and report reading issues

Toy example with 2 created issues

read_delim("data/swiss_ray.csv")

Rows: 5 Columns: 6                                      
── Column specification ──────────────────────────────────
Delimiter: ","
chr (1): fer
dbl (4): agri, exa, edu, cath
lgl (1): xray

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 5 × 6
  fer    agri   exa   edu  cath xray 
  <chr> <dbl> <dbl> <dbl> <dbl> <lgl>
1 80.2   17      15    12 NA    NA   
2 a      45.1     6     9 84.8  NA   
3 92.5   39.7     5     5 NA    NA   
4 85.8   36.5    12     7 33.8  NA   
5 76.9   43.5    17    15  5.16 NA

Two columns with missing data
fer is no longer of type double
Empty xray column imported as logical!

Specifying column types

readr record issues, use problems() to see them

sw2 <- read_delim("data/swiss_ray.csv",
        col_types = cols(fer = col_double(),
                         agri = col_double(),
                         exa = col_integer(),
                         edu = col_integer(),
                         cath = col_double(),
                         xray = col_skip()))
sw2

Warning message:                                        
One or more parsing issues, call `problems()` on your
data frame for details, e.g.:
  dat <- vroom(...)
  problems(dat)

problems(sw2)

# A tibble: 1 × 5
    row   col expected actual file 
  <int> <int> <chr>    <chr>  <chr>
1     3     1 a double a      "data/swiss_ray.csv"

Choosing which columns to read

Lighter column type definitions

We want to skip the wrong 3 columns

And let read_delim guess the agri type (?)

read_delim("data/swiss_ray.csv",
  col_types = cols(fer = "_",  # skip
                   agri = "?", # guess
                   exa = "i",  # int
                   edu = "i",
                   cath = "_",
                   xray = "_"))

Columns selection

tidyselect syntax

read_delim("data/swiss_ray.csv",
           col_select = c(agri, edu))

# A tibble: 5 × 2
   agri   edu
  <dbl> <dbl>
1  17      12
2  45.1     9
3  39.7     5
4  36.5     7
5  43.5    15

Column selection with `tidyselect`

This package is the backend for programatic column selection in the tidyverse.

Features are described in this article.

Operators

c() for combining names
: for selecting a range
- for negating a selection

Helpers

everything() all columns
last_col() last column

Using patterns

starts_with() with quoted prefix
ends_with() with quoted suffix
contains() with quoted string

read_delim("data/swiss_ray.csv",
    col_select = starts_with("e"), n_max = 1)

# A tibble: 1 × 2
    exa   edu
  <dbl> <dbl>
1    15    12

read_delim("data/swiss_ray.csv", 
    col_select = contains("e"), n_max = 1)

# A tibble: 1 × 3
    fer   exa   edu
  <dbl> <dbl> <dbl>
1  80.2    15    12

read_delim("data/swiss_ray.csv",
      col_select = fer:edu, n_max = 1)

# A tibble: 1 × 4
    fer  agri   exa   edu
  <dbl> <dbl> <dbl> <dbl>
1  80.2    17    15    12

Header, what to do when absent

Example

Using the example.csv
This CSV toy data contains 3 columns

Content is:

dog,red,1
cat,blue,2
chicken,green,6

Naive approach

exa <- "https://biostat2.uni.lu/practicals/data/example.csv"
read_delim(exa)

# A tibble: 2 × 3
  dog     red     `1`
  <chr>   <chr> <dbl>
1 cat     blue      2
2 chicken green     6

Satisfying?

Provide the header

Read all lines

Colnames are self-created
show_col_types = FALSE suppresses the verbose mode

read_delim(exa, col_names = FALSE, 
           show_col_types = FALSE)

# A tibble: 3 × 3
  X1      X2       X3
  <chr>   <chr> <dbl>
1 dog     red       1
2 cat     blue      2
3 chicken green     6

Satisfying?

Supply the columns names

read_delim(exa, 
           col_names = c("animal", "color", "value"),
           show_col_types = FALSE)

# A tibble: 3 × 3
  animal  color value
  <chr>   <chr> <dbl>
1 dog     red       1
2 cat     blue      2
3 chicken green     6

Better

Exercise

Override the column types

Import example.csv but

Skip the colour column
Read in the value column as integer

Desired output

# A tibble: 3 × 2
  animal  value
  <chr>   <int>
1 dog         1
2 cat         2
3 chicken     6

Hints

Column types are specified using col_types = cols()

Function	Short	Description
`col_logical()`	l	TRUE/FALSE, 1/0
`col_integer()`	i	integers
`col_double()`	d	floating point values
`col_character()`	c	everything else
`col_skip()`	`_`	skip this column
`col_guess()`	?	guess

Solution

Using one-letter shorcuts

read_delim(exa, 
    col_names = c("animal", "color", "value"),
    col_types = cols(
      animal = "c",
      color = "_",
      value = "i"
    ))

# A tibble: 3 × 2
  animal  value
  <chr>   <int>
1 dog         1
2 cat         2
3 chicken     6

Even more compact

read_delim(exa, 
      col_names = c("animal", "color", "value"),
      col_types = "c_i")

# A tibble: 3 × 2
  animal  value
  <chr>   <int>
1 dog         1
2 cat         2
3 chicken     6

Use the single character code in the column order

Skipping lines

Our current example

animal  color value
dog     red       1
cat     blue      2
chicken green     6

Skip one line

read_delim(exa,
  skip = 1,
  col_names = FALSE,
  show_col_types = FALSE)

# A tibble: 2 × 3
  X1      X2       X3
  <chr>   <chr> <dbl>
1 cat     blue      2
2 chicken green     6

Read only 2 lines (w/o header)

read_delim(exa,
  n_max = 2,
  col_names = FALSE,
  show_col_types = FALSE)

# A tibble: 2 × 3
  X1    X2       X3
  <chr> <chr> <dbl>
1 dog   red       1
2 cat   blue      2

Importing Excel files

readxl author: Hadley Wickham and Jenny Bryan
read_excel() reads both xls and xlsx files and detects the format from the extension. Otherwise:
- read_xls()
- read_xlsx()
Return tibbles
Column type guessing
Discovers the minimal data rectangle and returns that, by default
Exert more control with range, skip, and n_max
excel_sheets() returns the sheet names
No external library dependencies, e.g., Java or Perl

Before we stop

You learned to:

Appreciate the tibble features
Learn set-up working directory with RStudio projects
Learn the tidyselect syntax
Use readr::read_delim to import your flat files
Adjust the imported data types
Use readxl to import excel files

Further reading

R for Data Science

Acknowledgments 🙏 👏

Development

Eric Koncina (initial installment)
Veronica Codoni (swiss data set)
Roland Krause (readr 2.0 update)

Input and inspiration

Jim Hester (vroom, readr development)
Jenny Bryan (advice with project organization)
Hadley Wickham
Nicholas Tierney

Thank you for your attention!