String manipulation

regular expressions and stringr

Aurélien Ginolhac

University of Luxembourg

Saturday, the 17^th of May, 2025

Learning objectives

You will learn to:

Perform pattern matching and string manipulation:
- Detection
- Extraction
- Counting
- Sub-setting
Simplifies and unifies string operations
Different matching engines providing locale-sensitive matches.

Regular expressions

Matching and substituting of strings with meta-characters
See R for data science chapter

String examples in Base R

Strings are character objects

# A character object aka "string"
my_string <- "cat"
my_string

[1] "cat"

# single quotes works but doubles are adviced
my_other_string <- 'catastrophe'
my_other_string

[1] "catastrophe"

not_so_numeric <- as.character(3.1415)
not_so_numeric

[1] "3.1415"

Also as atomic vector

my_string_vec <- c("atg", "ttg", "tga")
my_string_vec

[1] "atg" "ttg" "tga"

Examples of `stringr` usage

Inputs

pattern <- "r"
my_words <- c( "cat", "cart",
               "carrot", "catastrophe",
               "dog", "rat", "bet")

Outputs, functions are vectorized

# indices of matches
str_which(my_words, pattern)

[1] 2 3 4 6

# only elements that match the pattern
str_subset(my_words, pattern)

[1] "cart"        "carrot"      "catastrophe" "rat"

# return substring of each elements
str_sub(my_words, 1, 3)

[1] "cat" "car" "car" "cat" "dog" "rat" "bet"

# aka Search and Replace
str_replace(my_words, pattern, "R")

[1] "cat"         "caRt"        "caRrot"      "catastRophe" "dog"        
[6] "Rat"         "bet"

Why use `stringr`?

Consistency
Less typing and looking up things
All functions in stringr start with str_
All take a vector of strings as the first argument
- (“data first”)
- Pipes work as expected
All functions properly vectorised

Useful additions

Viewing matches rendered in ASCII colors Matches enclosed in chevrons (< >)

str_view(my_words, pattern)

[2] │ ca<r>t
[3] │ ca<r><r>ot
[4] │ catast<r>ophe
[6] │ <r>at

Well documented

https://stringr.tidyverse.org HTML cheatsheet

Matching strings

Return logical

my_words

[1] "cat"         "cart"        "carrot"      "catastrophe" "dog"        
[6] "rat"         "bet"

str_detect(my_words, "a")

[1]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE

Retrieving (only) matching strings

str_subset(my_words, "a")

[1] "cat"         "cart"        "carrot"      "catastrophe" "rat"

Inverting in all `stringr` functions

str_subset(my_words, "a", negate = TRUE)

[1] "dog" "bet"

str_detect(my_words, "a", negate = TRUE)

[1] FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE

How long is my string?

Length of items a in character vector

str_length(my_words)

[1]  3  4  6 11  3  3  3

Warning

length(my_words)

[1] 7

length() is the length of the vector. i.e number of items

Elements of strings

Substrings

my_words

[1] "cat"         "cart"        "carrot"      "catastrophe" "dog"        
[6] "rat"         "bet"

str_sub(my_words, 2, 4)

[1] "at"  "art" "arr" "ata" "og"  "at"  "et"

Replace

str_replace(my_words, "a", "#")

[1] "c#t"         "c#rt"        "c#rrot"      "c#tastrophe" "dog"        
[6] "r#t"         "bet"

Check c#tastrophe!

Only one a modified

`str_replace_all()`

str_replace_all(my_words, "a", "#")

[1] "c#t"         "c#rt"        "c#rrot"      "c#t#strophe" "dog"        
[6] "r#t"         "bet"

Splitting strings

Splitting leads to complex output

c("2013-06-26_BRAFWTNEGASSAY_Plasmid-Cellline100-1Mutant_A01.csv",
  "2013-06-26_BRAFWTNEGASSAY_Plasmid-Cellline100-1Mutant_H02.csv",
  "2013-06-26_BRAFWTNEGASSAY_Plasmid-Cellline100-1Mutant_H03.csv",
  "2013-06-27_BRAFWTNEGASSAY_FFPEDNA-CRC-1-41_A01.csv",
  "2013-06-27_BRAFWTNEGASSAY_FFPEDNA-CRC-1-41_A02.csv",
  "2013-06-27_BRAFWTNEGASSAY_FFPEDNA-CRC-1-41_A03.csv") -> filenames
str_split(filenames, "_")

[[1]]
[1] "2013-06-26"                  "BRAFWTNEGASSAY"             
[3] "Plasmid-Cellline100-1Mutant" "A01.csv"                    

[[2]]
[1] "2013-06-26"                  "BRAFWTNEGASSAY"             
[3] "Plasmid-Cellline100-1Mutant" "H02.csv"                    

[[3]]
[1] "2013-06-26"                  "BRAFWTNEGASSAY"             
[3] "Plasmid-Cellline100-1Mutant" "H03.csv"                    

[[4]]
[1] "2013-06-27"       "BRAFWTNEGASSAY"   "FFPEDNA-CRC-1-41" "A01.csv"         

[[5]]
[1] "2013-06-27"       "BRAFWTNEGASSAY"   "FFPEDNA-CRC-1-41" "A02.csv"         

[[6]]
[1] "2013-06-27"       "BRAFWTNEGASSAY"   "FFPEDNA-CRC-1-41" "A03.csv"

str_split() creates lists. Harder to work with

Underscores delimit fields
Hyphens delimit words within fields

Simplification to a matrix

str_split_fixed(filenames, "[_\\.]", 5)

     [,1]         [,2]             [,3]                          [,4]  [,5] 
[1,] "2013-06-26" "BRAFWTNEGASSAY" "Plasmid-Cellline100-1Mutant" "A01" "csv"
[2,] "2013-06-26" "BRAFWTNEGASSAY" "Plasmid-Cellline100-1Mutant" "H02" "csv"
[3,] "2013-06-26" "BRAFWTNEGASSAY" "Plasmid-Cellline100-1Mutant" "H03" "csv"
[4,] "2013-06-27" "BRAFWTNEGASSAY" "FFPEDNA-CRC-1-41"            "A01" "csv"
[5,] "2013-06-27" "BRAFWTNEGASSAY" "FFPEDNA-CRC-1-41"            "A02" "csv"
[6,] "2013-06-27" "BRAFWTNEGASSAY" "FFPEDNA-CRC-1-41"            "A03" "csv"

"[_\\.]" is a regular expression: any underscores or periods
\\ to tell literal period, not the metacharacter any character

Splitting in a tibble context

Using delim function (with regex)

tibble(files = filenames) |>
  separate_wider_delim(
    cols = files,
    # regular expression: _ or .
    delim = regex("[_\\.]"), 
    # NA to discard extension .csv
    names = c("date", "assay", "line", "well", NA)
  )

# A tibble: 6 × 4
  date       assay          line                        well 
  <chr>      <chr>          <chr>                       <chr>
1 2013-06-26 BRAFWTNEGASSAY Plasmid-Cellline100-1Mutant A01  
2 2013-06-26 BRAFWTNEGASSAY Plasmid-Cellline100-1Mutant H02  
3 2013-06-26 BRAFWTNEGASSAY Plasmid-Cellline100-1Mutant H03  
4 2013-06-27 BRAFWTNEGASSAY FFPEDNA-CRC-1-41            A01  
5 2013-06-27 BRAFWTNEGASSAY FFPEDNA-CRC-1-41            A02  
6 2013-06-27 BRAFWTNEGASSAY FFPEDNA-CRC-1-41            A03

Using specialised regex function

tibble(files = filenames) |>
  separate_wider_regex(
    cols = files,
    # a named vector pairing column name and content via regex
    patterns = c(date = "^\\d{4}-\\d{2}-\\d{2}", # date ISO8601
                 "_",   # unnamed so discarded
                 assay = "\\w+", # word letters, at least one
                 "_",   # unnamed so discarded
                 line = "[[:alnum:]\\-]+", # numbers, letters + hyphen
                 "_",   # unnamed so discarded
                 well = "[A-Z]\\d{2}", # could be [A-H]
                 ".+$"  # anything that ends but not empty
  ))

# A tibble: 6 × 4
  date       assay          line                        well 
  <chr>      <chr>          <chr>                       <chr>
1 2013-06-26 BRAFWTNEGASSAY Plasmid-Cellline100-1Mutant A01  
2 2013-06-26 BRAFWTNEGASSAY Plasmid-Cellline100-1Mutant H02  
3 2013-06-26 BRAFWTNEGASSAY Plasmid-Cellline100-1Mutant H03  
4 2013-06-27 BRAFWTNEGASSAY FFPEDNA-CRC-1-41            A01  
5 2013-06-27 BRAFWTNEGASSAY FFPEDNA-CRC-1-41            A02  
6 2013-06-27 BRAFWTNEGASSAY FFPEDNA-CRC-1-41            A03

Getting started with regular expressions (= regex)

Higher aims

Extract particular characters, e.g. numbers only
Extract characters, e.g. at the beginning
Define the occurrences e.g. at least one digit
Express characters following / preceding patterns e.g. followed by _ and a letter
Express character case (upper/lower)
Matching any character
Not matching particular characters

Translation as regular expressions

Describe the range of number: [0-9]
Anchors: ^[0-9]
Occurrence: ^[0-9]+
Followed by: ^[0-9]+_[a-z]
Case: ^[0-9]+_[a-zA-Z]
Any character is .: ^[0-9]+_[a-zA-Z].
Not matching characters (don’t followed by lowercase a, b, c):
- ^[0-9]+_[a-zA-Z].[^abc]

str_view(c("3_a#d", "45_AAA", "45_AAa"), 
         pattern = "^[0-9]+_[a-zA-Z].[^abc]")

[1] │ <3_a#d>
[2] │ <45_AAA>

Flexible matching through metacharacters

Metacharacters

Symbols matching groups of characters

. (dot) represents any character except for the newline character (\n)

str_view(my_words, ".at")

[1] │ <cat>
[4] │ <cat>astrophe
[6] │ <rat>

. (dot) on it’s own matches exactly one occurrence

 str_subset(my_words, "c..t")

[1] "cart"

+ (plus) represents one or more occurrences

str_subset(my_words, "c.r+")

[1] "cart"   "carrot"

* (star) represents zero or more occurrences

str_subset(my_words, "c.r*")

[1] "cat"         "cart"        "carrot"      "catastrophe"

Quantifying a number of matches

Quantifiers

The preceding item will be matched …

? at most once.
* matched zero or more times.
+ one or more times.
{n} exactly ‘n’ times.
{n,} ‘n’ or more times.
{n,m} at least ‘n’ times, but not more than ‘m’ times.

Examples

?

dna <- "ATGGTAACCGGTAGGTAGTAAAGGTCCC"
str_view(dna, "AA?")

[1] │ <A>TGGT<AA>CCGGT<A>GGT<A>GT<AA><A>GGTCCC

+

str_view(dna, "AA+")

[1] │ ATGGT<AA>CCGGTAGGTAGT<AAA>GGTCCC

{3,}

str_view(dna, "A{3,}")

[1] │ ATGGTAACCGGTAGGTAGT<AAA>GGTCCC

Anchors

`^` Start of string

my_words

[1] "cat"         "cart"        "carrot"      "catastrophe" "dog"        
[6] "rat"         "bet"

str_subset(my_words , "^c")

[1] "cat"         "cart"        "carrot"      "catastrophe"

Means:

c (lower case) is the first character

`$` End of string

str_subset(my_words, "r.$")

[1] "cart"

Means:

last to one character is r
last character is any character

Escaping: to match metacharacters

Strings containing only a full stop

We saw special characters such as ., +, * or $ having special meaning in regular expressions.

vec2 <- c("YKL045W-A", "12+45=57", "$1200.00", "ID2.2")

str_subset(vec2 , ".")

[1] "YKL045W-A" "12+45=57"  "$1200.00"  "ID2.2"

Not what we wanted!

Use an escape character

str_subset(vec2, "\.")

Error: '\.' is an unrecognized escape in character string (<input>:1:20)

Error: we don’t want either!

Need to escape with \ twice!

str_subset(vec2, "\\.")

[1] "$1200.00" "ID2.2"

Implicit conversion

wraps regular expressions as strings without explicit interference of the user. When converting from string to regular expression internally, single backslashes (\) are already converted.

Character classes

Special characters

Pattern	Matches	Complement	Matches
`\\d`	Digit	`\\D`	No digit
`\\s`	Whitespace	`\\S`	No whitespace
`\\w`	Word chars	`\\W`	No word char
`\\b`	Boundaries	`\\B`	Within words

Examples

uniprot <- c("Q6QU88_CALBL", "CO1A2_HUMAN", 
             "SAMH1_HUMAN", "CRC4_MYO5B", 
             "NPRL2_DROME", "GLUC_HUMAN")

Only the first digit encountered

str_extract(uniprot, "\\d")

[1] "6" "1" "1" "4" "2" NA

Only first TWO consecutive NON digit characters

str_extract(uniprot, "\\D{2}")

[1] "QU" "CO" "SA" "CR" "NP" "GL"

str_view(uniprot, "\\D{2}")

[1] │ Q6<QU>88<_C><AL><BL>
[2] │ <CO>1A2<_H><UM><AN>
[3] │ <SA><MH>1<_H><UM><AN>
[4] │ <CR>C4<_M><YO>5B
[5] │ <NP><RL>2<_D><RO><ME>
[6] │ <GL><UC><_H><UM><AN>

Grouping using parentheses and backreferences

Vertical bar for OR

str_extract(my_words, "(at)|(og)")

[1] "at" NA   NA   "at" "og" "at" NA

str_view(my_words, "(at)|(og)")

[1] │ c<at>
[4] │ c<at>astrophe
[5] │ d<og>
[6] │ r<at>

Backreferences

\1, \2 etc. refer to groups matched with ().

Here exchange both sub-strings by underscore:

uniprot

[1] "Q6QU88_CALBL" "CO1A2_HUMAN"  "SAMH1_HUMAN"  "CRC4_MYO5B"   "NPRL2_DROME" 
[6] "GLUC_HUMAN"

str_replace(uniprot, "(\\S+)_(\\S+)", "\\2: \\1")

[1] "CALBL: Q6QU88" "HUMAN: CO1A2"  "HUMAN: SAMH1"  "MYO5B: CRC4"  
[5] "DROME: NPRL2"  "HUMAN: GLUC"

Helpers

`regexplain`

Interactive addin for RStudio by Garrick Aden-Buie

Test regular expressions on the fly
Reference library
Cheatsheet
Test it live

remotes::install_github("gadenbuie/regexplain")

regexplain::regexplain_gadget()

Before we stop

You learned to:

What is a regular expression
String manipulations

Further reading

stringi – General implementation of regular expressions
stringr – Wrapper for vectorisation and convenience functions
Strings in R for Data Science
How to Name Things

Acknowledgments 🙏 👏

Jennifer Bryan
Charlotte Wickham
Hadley Wickham
Marek Gagolewski (Author of stringi)

Contributions

Roland Krause

Thank you for your attention!

String manipulation

Learning objectives

Regular expressions

String examples in Base R

Strings are character objects

Also as atomic vector

Examples of stringr usage

Inputs

Outputs, functions are vectorized

Why use stringr?

Useful additions

Well documented

Matching strings

Return logical

Retrieving (only) matching strings

Inverting in all stringr functions

How long is my string?

Elements of strings

Substrings

Replace

str_replace_all()

Splitting strings

Splitting leads to complex output

Simplification to a matrix

Splitting in a tibble context

Using delim function (with regex)

Using specialised regex function

Getting started with regular expressions (= regex)

Higher aims

Translation as regular expressions

Flexible matching through metacharacters

Metacharacters

Quantifying a number of matches

Quantifiers

Examples

?

+

{3,}

Anchors

^ Start of string

$ End of string

Escaping: to match metacharacters

Strings containing only a full stop

Use an escape character

Character classes

Special characters

Examples

Grouping using parentheses and backreferences

Vertical bar for OR

Backreferences

Helpers

regexplain

Before we stop

Examples of `stringr` usage

Why use `stringr`?

Inverting in all `stringr` functions

`str_replace_all()`

`^` Start of string

`$` End of string

`regexplain`