Introduction to R

with the tidyverse

Aurélien Ginolhac, DLSM

University of Luxembourg

Tuesday, the 25^th of February, 2025

Learning objectives

You will learn to:

specificities
- Community
- Package ecosystem
- Vectorization
Opinionated tidyverse
Data types
Data structures

What is really?

is a shorthand for “GNU R”:

An interactive programming language derived from S (J. Chambers, Bell Lab, 1976)
Appeared in 1995, created by Ross Ihaka and Robert Gentleman, University of Auckland, NZ
20^th birthday of v1.0.0 released 29^th Feb. 2000
Focus on data analysis and plotting
is also shorthand for the ecosystem around this language
- Book authors
- Package developers
- Ordinary useRs

Learning to use will make you more efficient and facilitate the use of advanced data analysis tools

Why using R?

It’s free! and open-source
Easy to install / maintain
Multi-platform ( Windows, macOS, GNU/Linux)
Can process big files and analyse huge amounts of data (db tools)
Integrated data visualization tools, even dynamic shiny
Fast, and even faster with C++ integration via Rcpp or cpp11.
Easy to get help, welcoming community
- community in the web
- stackoverflow with a lot of tags like r, ggplot2 etc.
- rbloggers
- R ladies

About is slow

Execution speed is easy to measure but what about development speed?

Constant trend in use for interactive stats environments

Source: Touchon & McCoy. Ecosphere. 2016

Source: D. Robinson, StackOverflow blog

The bad news is that when ever you learn a new skill you’re going to suck. It’s going to be frustrating. The good news is that is typical and happens to everyone and it is only temporary. You can’t go from knowing nothing to becoming an expert without going through a period of great frustration and great suckiness.

— Hadley Wickham

Hard to learn

base is complex, has a long history and many contributors

Unhelpful help ?print
Generic methods print.data.frame
Too many commands colnames, names
Inconsistent names read.csv, load, readRDS
Un-strict syntax, was designed for interactive usage
Too many ways to select variables df$x, df$"x", df[,"x"], df[[1]]
[…] see r4stats’ post for the full list
The tidyverse curse

Navigating the balance between base and the tidyverse is a challenge to learn

— Robert A. Muenchen

Help pages

2 possibilities for manual pages.

?log
help(log)

Sadly, manpages are often unhelpful, vignettes or articles better described workflow (below readxl website).

In Rstudio, the help page can be viewed in the bottom right pane

The ambiguity [of the S language] is real and goes to a key objective: we wanted users to be able to begin in an interactive environment, where they did not consciously think of themselves as programming. Then as their needs became clearer and their sophistication increased, they should be able slide gradually into programming, when the language and system aspects would become more important.

— John Chambers, “Stages in the Evolution of S”

Tidyverse origin

Hadley Wickham is Chief Scientist at Posit

Coined the tidyverse at userR meeting in 2016
Developed and maintains most of the core tidyverse packages
25 years history written down in this document

We think the tidyverse is better, especially for beginners. It is:

Relatively recent (both an issue and an advantage)
Becomes stable
Allows doing powerful things quickly
Unified (see bookdown on tidyverse design)
Consistent, one way to do things
Give strength to learn base R

Tidyverse, core packages

2022: lubridate joined the core

Tidyverse features introduced to base

Construct		Base	Version
Strings read as factors	`tibbles`	Default	v4.0
`c(factor("a"), factor("b"))`	`[1] a b`	Was `[1] 1 1`	v4.1
Pipe	`%>%`	`\|>`	v4.1
Lambda	`~ .x`	`\(x)`	v4.1
Placeholder in pipe	`.`	`_`	v4.2
Unnamed placeholder	`list(a = 1) %>% .$a`	`list(a = 1) \|> _$a`	v4.3
`NULL` assignment	`rlang::%\|\|%`	`%\|\|%`	v4.4
Dataset	`palmerpenguins::penguins`	`penguins`	v4.5

Naming symbols for this course

= equal

. dot

, comma

~ tilde

* star (asterisk)

- hyphen

_ underscore

" double quotation marks

' single quotation marks

` backticks

# hash

| (vertical) bar

/ (forward) slash

\ backslash

Enclosures

() parentheses

[] (square) brackets

{} (curly) braces

<> chevrons

R-specific operators

<- assignment (left)

-> right assignment

|> (base) pipe

Using `library()`, ensure function’ origin

With only `base` loaded

x <- 1:10
filter(x, rep(1, 3))

Time Series:
Start = 1 
End = 10 
Frequency = 1 
 [1] NA  6  9 12 15 18 21 24 27 NA

Conflict: 2 packages export the same function

The latest loaded wins

library(dplyr)
filter(x, rep(1, 3))

Error in UseMethod("filter") : 
  no applicable method for 'filter' applied to an object of class "c('integer', 'numeric')"

Solution: prefix with :: to call functions from a specific package

x <- 1:10
stats::filter(x, rep(1, 3))

Time Series:
Start = 1 
End = 10 
Frequency = 1 
 [1] NA  6  9 12 15 18 21 24 27 NA

Or use the conflicted package

Data types and structures

Data types

3 main types + extras

Type	Example
character (strings)	‘tidyverse!’
boolean	`TRUE` / `FALSE` (`T`/`F` not protected)
numeric	integer (2), double (2.34)
date (also doubles)	`2024-03-04` (`Sys.Date()`)
datetime	`2024-03-04 09:12:24 CET`, (`Sys.time()`)
complex	2+0i

Missing data and special cases

NA   # not available, missing data
NA_real_
NA_integer_
NA_character_
NA_complex_
NULL # empty
-Inf/Inf # infinite values
NaN # Not a Number

2L

[1] 2

typeof(2L)

[1] "integer"

2.34

[1] 2.34

typeof(2.34)

[1] "double"

"tidyverse!"

[1] "tidyverse!"

typeof(TRUE)

[1] "logical"

Sys.time()

[1] "2025-06-10 13:22:25 UTC"

2+0i

[1] 2+0i

Structures

Vectors

c() is the function for concatenate

[1] 4

c(43, 5.6, 2.90)

[1] 43.0  5.6  2.9

Factors

convert strings to factors, levels is the dictionary

factor(c("AA", "BB", "AA", "CC"))

[1] AA BB AA CC
Levels: AA BB CC

Lists

Can contain any other data type.

list(f = factor(c("AA", "AA")),
     v = c(43, 5.6, 2.90),
     s = 4L)

$f
[1] AA AA
Levels: AA

$v
[1] 43.0  5.6  2.9

$s
[1] 4

Data frames are special lists

`data.frame`

same as list but where all objects must have the same length

data.frame(
  f = factor(c("AA", "AA", "BB")),
  v = c(43, 5.6, 2.90),
  s = rep(4, 3))

   f    v s
1 AA 43.0 4
2 AA  5.6 4
3 BB  2.9 4

Example, missing one element in `v`

data.frame(
  f = factor(c("AA", "AA", "BB")),
  v = c(43, 5.6),
  s = rep(4, 3))

Error in data.frame(f = factor(c("AA", "AA", "BB")), v = c(43, 5.6), s = rep(4, : arguments imply differing number of rows: 3, 2

Concatenate atomic elements

Collection of simple things

Things are the smallest elements: atomic
Must be of same mode: automatic coercion
Indexed, from 1 to length(vector)
Created with the c() function

c(2, TRUE, "a string")

[1] "2"        "TRUE"     "a string"

Tidyverse prevents those coercions

vctrs::vec_c(2, TRUE, "a string")

Error in `vctrs::vec_c()`:
! Can't combine `..1` <double> and `..3` <character>.

Manual coercion

as.character(c(2, TRUE, "a string"))

[1] "2"        "TRUE"     "a string"

as.double(c(2, TRUE, "a string"))

[1]  2 NA NA

as.double(c(2, 2.456, "a string"))

[1] 2.000 2.456    NA

Assignment

Operator is <-, associate a name to an object, right version -> is a valid alias

my_vec <- c(3, 4, 1:3)
my_vec

[1] 3 4 1 2 3

Say: the vector gets assigned the name my_vec

Tip

Rstudio has the built-in shortcut Alt+- for <-

Rationale

If you don’t assigned a name to a created object it is only temporary. Assigning allows to save and re-use the object for a downstream step.

Binding names to values: an object has no name

In #rstats, it's surprisingly important to realise that names have objects; objects don't have names pic.twitter.com/bEMO1YVZX0
— Hadley Wickham (@hadleywickham) May 16, 2016

Vector binds to the name `x`

x <- 1:3

Same vector is also bind to the name `y`

y <- x

Efficient management of the memory. Names are pointers to memory addresses.

Hierarchy

is.vector(c("a", "c"))

[1] TRUE

is.vector(list(a = 1))

[1] TRUE

is.atomic(list(a = 1))

[1] FALSE

is.data.frame(list(a = 1))

[1] FALSE

Subsetting vectors

Important!

Unlike python or Perl, vectors use 1-based index!!

The : operator generates integer sequences

3:10

[1]  3  4  5  6  7  8  9 10

Subset elements

Select elements from position 3 to 10:

LETTERS[3:10]

[1] "C" "D" "E" "F" "G" "H" "I" "J"

Break in sequence, use `c()`

LETTERS[c(2:5, 7)]

[1] "B" "C" "D" "E" "G"

Negative selection

LETTERS[-(2:21)]

[1] "A" "V" "W" "X" "Y" "Z"

Difference Python / R

b <- c(1, 3, 5)
b[2] <- 5
b

[1] 1 5 5

b[5] # does not exist

[1] NA

b[5] <- 6 # assign to out of bound
b

[1]  1  5  5 NA  6

R extends the vector! And uses missing values NA

>>> b=[1,3,5]
>>> b
[1, 3, 5]
>>> b[1]=5
>>> b
[1, 5, 5]
>>> b[5]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: list index out of range

Practise subsetting

LETTERS a built-in vectors of the 26 UPPER case letters.

Subset LETTERS to obtain A, B, C, D, E

LETTERS[1:5]

[1] "A" "B" "C" "D" "E"

Subset LETTERS to obtain B, D and F

LETTERS[c(2, 4, 6)]

[1] "B" "D" "F"

Several solutions exist.

Remove Z from LETTERS

length(x) returns the numbers of items in vector x

LETTERS[-length(LETTERS)]

 [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
[20] "T" "U" "V" "W" "X" "Y"

Keep even letters (B, D, E … Z)

%% is modulo integer remainder of division

LETTERS[1:length(LETTERS) %% 2 == 0]

 [1] "B" "D" "F" "H" "J" "L" "N" "P" "R" "T" "V" "X" "Z"

We will detail this expression later

Before we stop

You learned to:

Introduction to
The tidyverse rationale
Data types
Data structures
Subsetting vectors

Acknowledgments 🙏 👏

Hadley Wickham
Robert Muenchen
Romain François
David Gohel
Jenny Bryan
James J. Balamuta, quarto-webr

Thank you for your attention!

Introduction to R

Learning objectives

What is really?

Why using R?

About is slow

Constant trend in use for interactive stats environments

Hard to learn

base is complex, has a long history and many contributors

Help pages

Tidyverse origin

Tidyverse, core packages

Tidyverse features introduced to base

Naming symbols for this course

Enclosures

R-specific operators

Using library(), ensure function’ origin

With only base loaded

Data types and structures

Data types

3 main types + extras

Missing data and special cases

Structures

Vectors

Factors

Lists

Data frames are special lists

data.frame

Example, missing one element in v

Concatenate atomic elements

Collection of simple things

Tidyverse prevents those coercions

Manual coercion

Assignment

Rationale

Binding names to values: an object has no name

Vector binds to the name x

Same vector is also bind to the name y

Hierarchy

Subsetting vectors

Subset elements

Break in sequence, use c()

Negative selection

Difference Python / R

Practise subsetting

Before we stop

Using `library()`, ensure function’ origin

With only `base` loaded

`data.frame`

Example, missing one element in `v`

Vector binds to the name `x`

Same vector is also bind to the name `y`

Break in sequence, use `c()`