Introduction to R

with the tidyverse

Aurélien Ginolhac, DLSM

University of Luxembourg

Tuesday, the 25th of February, 2025

Learning objectives

You will learn to:

  • specificities
    • Community
    • Package ecosystem
    • Vectorization
  • Opinionated tidyverse
  • Data types
  • Data structures

What is really?

is a shorthand for “GNU R”:

  • An interactive programming language derived from S (J. Chambers, Bell Lab, 1976)
  • Appeared in 1995, created by Ross Ihaka and Robert Gentleman, University of Auckland, NZ
  • 20th birthday of v1.0.0 released 29th Feb. 2000
  • Focus on data analysis and plotting
  • is also shorthand for the ecosystem around this language
    • Book authors
    • Package developers
    • Ordinary useRs

Learning to use will make you more efficient and facilitate the use of advanced data analysis tools

Why using R?

  • It’s free! and open-source
  • Easy to install / maintain
  • Multi-platform ( Windows, macOS, GNU/Linux)
  • Can process big files and analyse huge amounts of data (db tools)
  • Integrated data visualization tools, even dynamic shiny
  • Fast, and even faster with C++ integration via Rcpp or cpp11.
  • Easy to get help, welcoming community

About is slow

Execution speed is easy to measure but what about development speed?

Constant trend in use for interactive stats environments

The bad news is that when ever you learn a new skill you’re going to suck. It’s going to be frustrating. The good news is that is typical and happens to everyone and it is only temporary. You can’t go from knowing nothing to becoming an expert without going through a period of great frustration and great suckiness.

Hadley Wickham

Hard to learn

base is complex, has a long history and many contributors

  • Unhelpful help ?print
  • Generic methods print.data.frame
  • Too many commands colnames, names
  • Inconsistent names read.csv, load, readRDS
  • Un-strict syntax, was designed for interactive usage
  • Too many ways to select variables df$x, df$"x", df[,"x"], df[[1]]
  • […] see r4stats’ post for the full list
  • The tidyverse curse

Navigating the balance between base and the tidyverse is a challenge to learn

Robert A. Muenchen

Help pages

2 possibilities for manual pages.

?log
help(log)

Sadly, manpages are often unhelpful, vignettes or articles better described workflow (below readxl website).

In Rstudio, the help page can be viewed in the bottom right pane

The ambiguity [of the S language] is real and goes to a key objective: we wanted users to be able to begin in an interactive environment, where they did not consciously think of themselves as programming. Then as their needs became clearer and their sophistication increased, they should be able slide gradually into programming, when the language and system aspects would become more important.

John Chambers, “Stages in the Evolution of S”

Tidyverse origin

Hadley Wickham is Chief Scientist at Posit

We think the tidyverse is better, especially for beginners. It is:

Tidyverse, core packages

2022: lubridate joined the core

Tidyverse features introduced to base

Construct Base Version
Strings read as factors tibbles Default v4.0
c(factor("a"), factor("b")) [1] a b Was [1] 1 1 v4.1
Pipe %>% |> v4.1
Lambda ~ .x \(x) v4.1
Placeholder in pipe . _ v4.2
Unnamed placeholder list(a = 1) %>% .$a list(a = 1) |> _$a v4.3
NULL assignment rlang::%||% %||% v4.4
Dataset palmerpenguins::penguins penguins v4.5

Naming symbols for this course

= equal

. dot

, comma

~ tilde

* star (asterisk)

- hyphen

_ underscore

" double quotation marks

' single quotation marks

` backticks

# hash

| (vertical) bar

/ (forward) slash

\ backslash

Enclosures

() parentheses

[] (square) brackets

{} (curly) braces

<> chevrons

R-specific operators

<- assignment (left)

-> right assignment

|> (base) pipe

Using library(), ensure function’ origin

With only base loaded

x <- 1:10
filter(x, rep(1, 3))
Time Series:
Start = 1 
End = 10 
Frequency = 1 
 [1] NA  6  9 12 15 18 21 24 27 NA

Conflict: 2 packages export the same function

The latest loaded wins

library(dplyr)
filter(x, rep(1, 3))
Error in UseMethod("filter") : 
  no applicable method for 'filter' applied to an object of class "c('integer', 'numeric')"

Solution: prefix with :: to call functions from a specific package

x <- 1:10
stats::filter(x, rep(1, 3))
Time Series:
Start = 1 
End = 10 
Frequency = 1 
 [1] NA  6  9 12 15 18 21 24 27 NA

Or use the conflicted package

Data types and structures

Data types

3 main types + extras

Type Example
character (strings) ‘tidyverse!’
boolean TRUE / FALSE (T/F not protected)
numeric integer (2), double (2.34)
date (also doubles) 2024-03-04 (Sys.Date())
datetime 2024-03-04 09:12:24 CET, (Sys.time())
complex 2+0i

Missing data and special cases

NA   # not available, missing data
NA_real_
NA_integer_
NA_character_
NA_complex_
NULL # empty
-Inf/Inf # infinite values
NaN # Not a Number
2L
[1] 2
typeof(2L)
[1] "integer"
2.34
[1] 2.34
typeof(2.34)
[1] "double"
"tidyverse!"
[1] "tidyverse!"
typeof(TRUE)
[1] "logical"
Sys.time()
[1] "2025-06-10 13:22:25 UTC"
2+0i
[1] 2+0i

Structures

Vectors

c() is the function for concatenate

4
[1] 4
c(43, 5.6, 2.90)
[1] 43.0  5.6  2.9

Factors

convert strings to factors, levels is the dictionary

factor(c("AA", "BB", "AA", "CC"))
[1] AA BB AA CC
Levels: AA BB CC

Lists

Can contain any other data type.

list(f = factor(c("AA", "AA")),
     v = c(43, 5.6, 2.90),
     s = 4L)
$f
[1] AA AA
Levels: AA

$v
[1] 43.0  5.6  2.9

$s
[1] 4

Data frames are special lists

data.frame

same as list but where all objects must have the same length

data.frame(
  f = factor(c("AA", "AA", "BB")),
  v = c(43, 5.6, 2.90),
  s = rep(4, 3))
   f    v s
1 AA 43.0 4
2 AA  5.6 4
3 BB  2.9 4

Example, missing one element in v

data.frame(
  f = factor(c("AA", "AA", "BB")),
  v = c(43, 5.6),
  s = rep(4, 3))
Error in data.frame(f = factor(c("AA", "AA", "BB")), v = c(43, 5.6), s = rep(4, : arguments imply differing number of rows: 3, 2

Concatenate atomic elements

Collection of simple things

  • Things are the smallest elements: atomic
  • Must be of same mode: automatic coercion
  • Indexed, from 1 to length(vector)
  • Created with the c() function
c(2, TRUE, "a string")
[1] "2"        "TRUE"     "a string"

Tidyverse prevents those coercions

vctrs::vec_c(2, TRUE, "a string")
Error in `vctrs::vec_c()`:
! Can't combine `..1` <double> and `..3` <character>.

Manual coercion

as.character(c(2, TRUE, "a string"))
[1] "2"        "TRUE"     "a string"
as.double(c(2, TRUE, "a string"))
[1]  2 NA NA
as.double(c(2, 2.456, "a string"))
[1] 2.000 2.456    NA

Assignment

Operator is <-, associate a name to an object, right version -> is a valid alias

my_vec <- c(3, 4, 1:3)
my_vec
[1] 3 4 1 2 3

Say: the vector gets assigned the name my_vec

Tip

Rstudio has the built-in shortcut Alt+- for <-

Rationale

If you don’t assigned a name to a created object it is only temporary. Assigning allows to save and re-use the object for a downstream step.

Binding names to values: an object has no name

Vector binds to the name x

x <- 1:3

Same vector is also bind to the name y

y <- x

Efficient management of the memory. Names are pointers to memory addresses.

Hierarchy

is.vector(c("a", "c"))
[1] TRUE
is.vector(list(a = 1))
[1] TRUE
is.atomic(list(a = 1))
[1] FALSE
is.data.frame(list(a = 1))
[1] FALSE

Subsetting vectors

Important!

Unlike python or Perl, vectors use 1-based index!!

The : operator generates integer sequences

3:10
[1]  3  4  5  6  7  8  9 10

Subset elements

Select elements from position 3 to 10:

LETTERS[3:10]
[1] "C" "D" "E" "F" "G" "H" "I" "J"

Break in sequence, use c()

LETTERS[c(2:5, 7)]
[1] "B" "C" "D" "E" "G"

Negative selection

LETTERS[-(2:21)]
[1] "A" "V" "W" "X" "Y" "Z"

Difference Python / R

b <- c(1, 3, 5)
b[2] <- 5
b
[1] 1 5 5
b[5] # does not exist
[1] NA
b[5] <- 6 # assign to out of bound
b
[1]  1  5  5 NA  6

R extends the vector! And uses missing values NA

>>> b=[1,3,5]
>>> b
[1, 3, 5]
>>> b[1]=5
>>> b
[1, 5, 5]
>>> b[5]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: list index out of range

Practise subsetting

LETTERS a built-in vectors of the 26 UPPER case letters.

Subset LETTERS to obtain A, B, C, D, E

LETTERS[1:5]
[1] "A" "B" "C" "D" "E"

Subset LETTERS to obtain B, D and F

LETTERS[c(2, 4, 6)]
[1] "B" "D" "F"

Several solutions exist.

Remove Z from LETTERS

  • length(x) returns the numbers of items in vector x
LETTERS[-length(LETTERS)]
 [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
[20] "T" "U" "V" "W" "X" "Y"

Keep even letters (B, D, E … Z)

  • %% is modulo integer remainder of division
LETTERS[1:length(LETTERS) %% 2 == 0]
 [1] "B" "D" "F" "H" "J" "L" "N" "P" "R" "T" "V" "X" "Z"

We will detail this expression later

Before we stop

You learned to:

  • Introduction to
  • The tidyverse rationale
  • Data types
  • Data structures
  • Subsetting vectors

Acknowledgments 🙏 👏

  • Hadley Wickham
  • Robert Muenchen
  • Romain François
  • David Gohel
  • Jenny Bryan
  • James J. Balamuta, quarto-webr

Thank you for your attention!