String manipulation

regular expressions and stringr

Aurélien Ginolhac

University of Luxembourg

Saturday, the 17th of May, 2025

Learning objectives

You will learn to:

  • Perform pattern matching and string manipulation:
    • Detection
    • Extraction
    • Counting
    • Sub-setting
  • Simplifies and unifies string operations
  • Different matching engines providing locale-sensitive matches.

Regular expressions

  • Matching and substituting of strings with meta-characters
  • See R for data science chapter

String examples in Base R

Strings are character objects

# A character object aka "string"
my_string <- "cat"
my_string
[1] "cat"
# single quotes works but doubles are adviced
my_other_string <- 'catastrophe'
my_other_string
[1] "catastrophe"
not_so_numeric <- as.character(3.1415)
not_so_numeric
[1] "3.1415"

Also as atomic vector

my_string_vec <- c("atg", "ttg", "tga")
my_string_vec
[1] "atg" "ttg" "tga"

Examples of stringr usage

Inputs

pattern <- "r"
my_words <- c( "cat", "cart",
               "carrot", "catastrophe",
               "dog", "rat", "bet")

Outputs, functions are vectorized

# indices of matches
str_which(my_words, pattern)
[1] 2 3 4 6
# only elements that match the pattern
str_subset(my_words, pattern)
[1] "cart"        "carrot"      "catastrophe" "rat"        
# return substring of each elements
str_sub(my_words, 1, 3)
[1] "cat" "car" "car" "cat" "dog" "rat" "bet"
# aka Search and Replace
str_replace(my_words, pattern, "R")
[1] "cat"         "caRt"        "caRrot"      "catastRophe" "dog"        
[6] "Rat"         "bet"        

Why use stringr?

  • Consistency

  • Less typing and looking up things

  • All functions in stringr start with str_

  • All take a vector of strings as the first argument

    • (“data first”)
    • Pipes work as expected
  • All functions properly vectorised

Useful additions

Viewing matches rendered in ASCII colors Matches enclosed in chevrons (< >)

str_view(my_words, pattern)
[2] │ ca<r>t
[3] │ ca<r><r>ot
[4] │ catast<r>ophe
[6] │ <r>at

Matching strings

Return logical

my_words
[1] "cat"         "cart"        "carrot"      "catastrophe" "dog"        
[6] "rat"         "bet"        
str_detect(my_words, "a")
[1]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE

Retrieving (only) matching strings

str_subset(my_words, "a")
[1] "cat"         "cart"        "carrot"      "catastrophe" "rat"        

Inverting in all stringr functions

str_subset(my_words, "a", negate = TRUE)
[1] "dog" "bet"
str_detect(my_words, "a", negate = TRUE)
[1] FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE

How long is my string?

Length of items a in character vector

str_length(my_words)
[1]  3  4  6 11  3  3  3

Warning

length(my_words)
[1] 7

length() is the length of the vector. i.e number of items

Elements of strings

Substrings

my_words
[1] "cat"         "cart"        "carrot"      "catastrophe" "dog"        
[6] "rat"         "bet"        
str_sub(my_words, 2, 4)
[1] "at"  "art" "arr" "ata" "og"  "at"  "et" 

Replace

str_replace(my_words, "a", "#")
[1] "c#t"         "c#rt"        "c#rrot"      "c#tastrophe" "dog"        
[6] "r#t"         "bet"        

Check c#tastrophe!

Only one a modified

str_replace_all()

str_replace_all(my_words, "a", "#")
[1] "c#t"         "c#rt"        "c#rrot"      "c#t#strophe" "dog"        
[6] "r#t"         "bet"        

Splitting strings

Splitting leads to complex output

c("2013-06-26_BRAFWTNEGASSAY_Plasmid-Cellline100-1Mutant_A01.csv",
  "2013-06-26_BRAFWTNEGASSAY_Plasmid-Cellline100-1Mutant_H02.csv",
  "2013-06-26_BRAFWTNEGASSAY_Plasmid-Cellline100-1Mutant_H03.csv",
  "2013-06-27_BRAFWTNEGASSAY_FFPEDNA-CRC-1-41_A01.csv",
  "2013-06-27_BRAFWTNEGASSAY_FFPEDNA-CRC-1-41_A02.csv",
  "2013-06-27_BRAFWTNEGASSAY_FFPEDNA-CRC-1-41_A03.csv") -> filenames
str_split(filenames, "_")
[[1]]
[1] "2013-06-26"                  "BRAFWTNEGASSAY"             
[3] "Plasmid-Cellline100-1Mutant" "A01.csv"                    

[[2]]
[1] "2013-06-26"                  "BRAFWTNEGASSAY"             
[3] "Plasmid-Cellline100-1Mutant" "H02.csv"                    

[[3]]
[1] "2013-06-26"                  "BRAFWTNEGASSAY"             
[3] "Plasmid-Cellline100-1Mutant" "H03.csv"                    

[[4]]
[1] "2013-06-27"       "BRAFWTNEGASSAY"   "FFPEDNA-CRC-1-41" "A01.csv"         

[[5]]
[1] "2013-06-27"       "BRAFWTNEGASSAY"   "FFPEDNA-CRC-1-41" "A02.csv"         

[[6]]
[1] "2013-06-27"       "BRAFWTNEGASSAY"   "FFPEDNA-CRC-1-41" "A03.csv"         

str_split() creates lists. Harder to work with

  • Underscores delimit fields
  • Hyphens delimit words within fields

Simplification to a matrix

str_split_fixed(filenames, "[_\\.]", 5)
     [,1]         [,2]             [,3]                          [,4]  [,5] 
[1,] "2013-06-26" "BRAFWTNEGASSAY" "Plasmid-Cellline100-1Mutant" "A01" "csv"
[2,] "2013-06-26" "BRAFWTNEGASSAY" "Plasmid-Cellline100-1Mutant" "H02" "csv"
[3,] "2013-06-26" "BRAFWTNEGASSAY" "Plasmid-Cellline100-1Mutant" "H03" "csv"
[4,] "2013-06-27" "BRAFWTNEGASSAY" "FFPEDNA-CRC-1-41"            "A01" "csv"
[5,] "2013-06-27" "BRAFWTNEGASSAY" "FFPEDNA-CRC-1-41"            "A02" "csv"
[6,] "2013-06-27" "BRAFWTNEGASSAY" "FFPEDNA-CRC-1-41"            "A03" "csv"
  • "[_\\.]" is a regular expression: any underscores or periods
  • \\ to tell literal period, not the metacharacter any character

Splitting in a tibble context

Using delim function (with regex)

tibble(files = filenames) |>
  separate_wider_delim(
    cols = files,
    # regular expression: _ or .
    delim = regex("[_\\.]"), 
    # NA to discard extension .csv
    names = c("date", "assay", "line", "well", NA)
  )
# A tibble: 6 × 4
  date       assay          line                        well 
  <chr>      <chr>          <chr>                       <chr>
1 2013-06-26 BRAFWTNEGASSAY Plasmid-Cellline100-1Mutant A01  
2 2013-06-26 BRAFWTNEGASSAY Plasmid-Cellline100-1Mutant H02  
3 2013-06-26 BRAFWTNEGASSAY Plasmid-Cellline100-1Mutant H03  
4 2013-06-27 BRAFWTNEGASSAY FFPEDNA-CRC-1-41            A01  
5 2013-06-27 BRAFWTNEGASSAY FFPEDNA-CRC-1-41            A02  
6 2013-06-27 BRAFWTNEGASSAY FFPEDNA-CRC-1-41            A03  

Using specialised regex function

tibble(files = filenames) |>
  separate_wider_regex(
    cols = files,
    # a named vector pairing column name and content via regex
    patterns = c(date = "^\\d{4}-\\d{2}-\\d{2}", # date ISO8601
                 "_",   # unnamed so discarded
                 assay = "\\w+", # word letters, at least one
                 "_",   # unnamed so discarded
                 line = "[[:alnum:]\\-]+", # numbers, letters + hyphen
                 "_",   # unnamed so discarded
                 well = "[A-Z]\\d{2}", # could be [A-H]
                 ".+$"  # anything that ends but not empty
  ))
# A tibble: 6 × 4
  date       assay          line                        well 
  <chr>      <chr>          <chr>                       <chr>
1 2013-06-26 BRAFWTNEGASSAY Plasmid-Cellline100-1Mutant A01  
2 2013-06-26 BRAFWTNEGASSAY Plasmid-Cellline100-1Mutant H02  
3 2013-06-26 BRAFWTNEGASSAY Plasmid-Cellline100-1Mutant H03  
4 2013-06-27 BRAFWTNEGASSAY FFPEDNA-CRC-1-41            A01  
5 2013-06-27 BRAFWTNEGASSAY FFPEDNA-CRC-1-41            A02  
6 2013-06-27 BRAFWTNEGASSAY FFPEDNA-CRC-1-41            A03  

Getting started with regular expressions (= regex)

Higher aims

  • Extract particular characters, e.g. numbers only
  • Extract characters, e.g. at the beginning
  • Define the occurrences e.g. at least one digit
  • Express characters following / preceding patterns e.g. followed by _ and a letter
  • Express character case (upper/lower)
  • Matching any character
  • Not matching particular characters

Translation as regular expressions

  • Describe the range of number: [0-9]
  • Anchors: ^[0-9]
  • Occurrence: ^[0-9]+
  • Followed by: ^[0-9]+_[a-z]
  • Case: ^[0-9]+_[a-zA-Z]
  • Any character is .: ^[0-9]+_[a-zA-Z].
  • Not matching characters (don’t followed by lowercase a, b, c):
    • ^[0-9]+_[a-zA-Z].[^abc]
str_view(c("3_a#d", "45_AAA", "45_AAa"), 
         pattern = "^[0-9]+_[a-zA-Z].[^abc]")
[1] │ <3_a#d>
[2] │ <45_AAA>

Flexible matching through metacharacters

Metacharacters

Symbols matching groups of characters

. (dot) represents any character except for the newline character (\n)

str_view(my_words, ".at")
[1] │ <cat>
[4] │ <cat>astrophe
[6] │ <rat>

. (dot) on it’s own matches exactly one occurrence

 str_subset(my_words, "c..t")
[1] "cart"

+ (plus) represents one or more occurrences

str_subset(my_words, "c.r+")
[1] "cart"   "carrot"

* (star) represents zero or more occurrences

str_subset(my_words, "c.r*")
[1] "cat"         "cart"        "carrot"      "catastrophe"

Quantifying a number of matches

Quantifiers

The preceding item will be matched …

  • ? at most once.
  • * matched zero or more times.
  • + one or more times.
  • {n} exactly ‘n’ times.
  • {n,} ‘n’ or more times.
  • {n,m} at least ‘n’ times, but not more than ‘m’ times.

Examples

?

dna <- "ATGGTAACCGGTAGGTAGTAAAGGTCCC"
str_view(dna, "AA?")
[1] │ <A>TGGT<AA>CCGGT<A>GGT<A>GT<AA><A>GGTCCC

+

str_view(dna, "AA+")
[1] │ ATGGT<AA>CCGGTAGGTAGT<AAA>GGTCCC

{3,}

str_view(dna, "A{3,}")
[1] │ ATGGTAACCGGTAGGTAGT<AAA>GGTCCC

Anchors

^ Start of string

my_words
[1] "cat"         "cart"        "carrot"      "catastrophe" "dog"        
[6] "rat"         "bet"        
str_subset(my_words , "^c")
[1] "cat"         "cart"        "carrot"      "catastrophe"

Means:

  • c (lower case) is the first character

$ End of string

str_subset(my_words, "r.$")
[1] "cart"

Means:

  • last to one character is r
  • last character is any character

Escaping: to match metacharacters

Strings containing only a full stop

We saw special characters such as ., +, * or $ having special meaning in regular expressions.

vec2 <- c("YKL045W-A", "12+45=57", "$1200.00", "ID2.2")

str_subset(vec2 , ".")
[1] "YKL045W-A" "12+45=57"  "$1200.00"  "ID2.2"    

Not what we wanted!

Use an escape character

str_subset(vec2, "\.")
Error: '\.' is an unrecognized escape in character string (<input>:1:20)

Error: we don’t want either!

Need to escape with \ twice!

str_subset(vec2, "\\.")
[1] "$1200.00" "ID2.2"   

Implicit conversion

wraps regular expressions as strings without explicit interference of the user. When converting from string to regular expression internally, single backslashes (\) are already converted.

Character classes

Special characters

Pattern Matches Complement Matches
\\d Digit \\D No digit
\\s Whitespace \\S No whitespace
\\w Word chars \\W No word char
\\b Boundaries \\B Within words

Examples

uniprot <- c("Q6QU88_CALBL", "CO1A2_HUMAN", 
             "SAMH1_HUMAN", "CRC4_MYO5B", 
             "NPRL2_DROME", "GLUC_HUMAN")
  • Only the first digit encountered
str_extract(uniprot, "\\d") 
[1] "6" "1" "1" "4" "2" NA 
  • Only first TWO consecutive NON digit characters
str_extract(uniprot, "\\D{2}")
[1] "QU" "CO" "SA" "CR" "NP" "GL"
str_view(uniprot, "\\D{2}")
[1] │ Q6<QU>88<_C><AL><BL>
[2] │ <CO>1A2<_H><UM><AN>
[3] │ <SA><MH>1<_H><UM><AN>
[4] │ <CR>C4<_M><YO>5B
[5] │ <NP><RL>2<_D><RO><ME>
[6] │ <GL><UC><_H><UM><AN>

Grouping using parentheses and backreferences

Vertical bar for OR

str_extract(my_words, "(at)|(og)")
[1] "at" NA   NA   "at" "og" "at" NA  
str_view(my_words, "(at)|(og)")
[1] │ c<at>
[4] │ c<at>astrophe
[5] │ d<og>
[6] │ r<at>

Backreferences

\1, \2 etc. refer to groups matched with ().

Here exchange both sub-strings by underscore:

uniprot
[1] "Q6QU88_CALBL" "CO1A2_HUMAN"  "SAMH1_HUMAN"  "CRC4_MYO5B"   "NPRL2_DROME" 
[6] "GLUC_HUMAN"  
str_replace(uniprot, "(\\S+)_(\\S+)", "\\2: \\1")
[1] "CALBL: Q6QU88" "HUMAN: CO1A2"  "HUMAN: SAMH1"  "MYO5B: CRC4"  
[5] "DROME: NPRL2"  "HUMAN: GLUC"  

Helpers

regexplain

Interactive addin for RStudio by Garrick Aden-Buie

  • Test regular expressions on the fly
  • Reference library
  • Cheatsheet
  • Test it live
remotes::install_github("gadenbuie/regexplain")

regexplain::regexplain_gadget()

Before we stop

You learned to:

  • What is a regular expression
  • String manipulations

Further reading

Acknowledgments 🙏 👏

  • Jennifer Bryan
  • Charlotte Wickham
  • Hadley Wickham
  • Marek Gagolewski (Author of stringi)

Contributions

  • Roland Krause

Thank you for your attention!