Data visualisation

ggplot2

Author

Aurelien Ginolhac

Published

May 8, 2025

Objective

This practical aims at performing exploratory plots and how-to build layer by layer to be familiar with the grammar of graphics. Moreover, you will practice sorting/collapsing levels of factors that are essentials for ggplot2 categorical variable display.

Those questions are optional

Scatter plots of penguins

The penguins dataset is provided by the palmerpenguins R package.

If not done already, install the package palmerpenguins and load it.
Warning once installed

Either use the chunk option #| eval: false or comment the installation with a leading # otherwise the package gets installed each time you render the Quarto document

Plot the body mass on the y axis and the bill length on the x axis.
penguins |>
  ggplot(aes(x = bill_length_mm, 
             y = body_mass_g)) +
  geom_point()
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

Plot again the body mass on the y axis and the bill length on the x axis, but with colour by species
penguins |>
  ggplot(aes(x = bill_length_mm, 
             y = body_mass_g, 
             colour = species)) +
  geom_point()
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

The geom_smooth() layer can be used to add a trend line. Try to overlay it to your scatter plot.
Tip

By default geom_smooth is using a loess regression (< 1,000 points) and adds standard error intervals.

  • The method argument can be used to change the regression to a linear one: method = "lm"
  • to disable the ribbon of standard errors, set se = FALSE

Be careful where the aesthetics are located, so the trend linear lines are also colored per species.

penguins |>
  ggplot(aes(x = bill_length_mm, 
             y = body_mass_g, 
             colour = species)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

Adjust the aesthetics of point in order to
  • The shape map to the originated island
  • A fixed size of 3
  • A transparency of 40%
Tip

You should still have only 3 coloured linear trend lines. Otherwise check to which layer your are adding the aesthetic shape. Remember that fixed parameters are to be defined outside aes()

penguins |>
  ggplot(aes(x = bill_length_mm, y = body_mass_g, 
             colour = species)) +
  geom_point(aes(shape = island), size = 3, alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE)
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

Ajust the colour aesthetic to the ggplot() call to propagate it to both point and regression line.
  • Try the scale colour viridis for discrete scale (scale_colour_viridis_d())
  • Try to change the default theme to theme_bw()
penguins |>
  ggplot(aes(x = bill_length_mm, y = body_mass_g, 
             colour = species)) +
  geom_point(aes(shape = island), size = 3, alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE, formula = "y ~ x") +
  scale_colour_viridis_d() +
  theme_bw()
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

Reproduce the following plot:

penguins |>
  # avoid the missing / non finite values
  drop_na(bill_length_mm, body_mass_g) |> 
  ggplot(aes(x = bill_length_mm, y = body_mass_g, 
             colour = species)) +
  geom_point(aes(shape = island), size = 3, alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE, formula = "y ~ x") +
  scale_colour_viridis_d() +
  theme_bw(14) +
  theme(plot.caption.position = "plot",
        plot.caption = element_text(face = "italic"),
        plot.subtitle = element_text(size = 11)) +
  scale_y_continuous(labels = scales::comma) +
  labs(title = "Penguin bills and body mass",
       caption = "Horst AM, Hill AP, Gorman KB (2020)",
       subtitle = "Dimensions for male/female Adelie, Chinstrap and Gentoo Penguins\nat Palmer Station LTER",
        x = "Bill length (mm)",
       y = "Body mass (g)",
       color = "Penguin species")

Tip

Remember that:

  • All aesthetics defined in the ggplot(aes()) command will be inherited by all following layers
  • aes() of individual geoms are specific (and overwrite the global definition if present).
  • labs() controls of plot annotations
  • theme() allows to tweak the plot like theme(plot.caption = element_text(face = "italic")) to render in italic the caption

Categorical data

We are going to use a dataset from the TidyTuesday initiative. Several dataset about the theme deforestation on April 2021 were released, we will focus on the csv called brazil_loss.csv. The dataset columns are described in the linked README and the csv is directly available at this url

Load the brazil_loss.csv file, remove the 2 first columns (entity and code since it is all Brazil) and assign the name brazil_loss
Tip

Set the data type of year to character()

brazil_url <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-04-06/brazil_loss.csv"
brazil_loss <- read_csv(brazil_url, 
                        col_select = -c(entity, code),
                        col_types = cols(year = col_character()))
brazil_loss
# A tibble: 13 × 12
   year  commercial_crops flooding_due_to_dams natural_disturbances pasture
   <chr>            <dbl>                <dbl>                <dbl>   <dbl>
 1 2001            280000                    0                    0 1520000
 2 2002            415000                79000                35000 2568000
 3 2003            550000                    0                35000 2761000
 4 2004            747000                26000                22000 2564000
 5 2005            328000                17000                26000 2665000
 6 2006            188000                17000                26000 1861000
 7 2007             79000                 9000                22000 1577000
 8 2008             52000                    0                17000 1345000
 9 2009             57000                 9000                31000  847000
10 2010            100000                    0                44000  616000
11 2011             52000                17000                87000  738000
12 2012            118000                17000                52000  546000
13 2013             87000                    0                13000  695000
# ℹ 7 more variables: selective_logging <dbl>, fire <dbl>, mining <dbl>,
#   other_infrastructure <dbl>, roads <dbl>,
#   tree_plantations_including_palm <dbl>, small_scale_clearing <dbl>
Is this dataset tidy?

No, the reason for deforestation are in the wide format. Columns commercial_crops to small_scale_clearing should be in the long format

Pivot the deforestation reasons (columns commercial_crops to small_scale_clearing) to the long format. Values are areas in hectares (area_ha is a good column name). Save as brazil_loss_long
pivot_longer(brazil_loss,
             cols = commercial_crops:small_scale_clearing,
             names_to = "reasons",
             values_to = "area_ha") -> brazil_loss_long
brazil_loss_long
# A tibble: 143 × 3
   year  reasons                         area_ha
   <chr> <chr>                             <dbl>
 1 2001  commercial_crops                 280000
 2 2001  flooding_due_to_dams                  0
 3 2001  natural_disturbances                  0
 4 2001  pasture                         1520000
 5 2001  selective_logging                 96000
 6 2001  fire                              26000
 7 2001  mining                             9000
 8 2001  other_infrastructure               9000
 9 2001  roads                             13000
10 2001  tree_plantations_including_palm   44000
# ℹ 133 more rows
Plot the deforestation areas per year as bars
Tip
  • year needs to be a categorical data. If you didn’t read the data as character for this column, you can convert it with factor()
  • geom_col() requires 2 aesthetics
    • x must be categorical / discrete (see first item)
    • y must be continuous
brazil_loss_long |>
  ggplot(aes(x = year, y = area_ha)) +
  geom_col()

Same as the plot above but bar filled by the reasons for deforestation
brazil_loss_long |>
  ggplot(aes(x = year, y = area_ha, fill = reasons)) +
  geom_col()

Even if we have too many categories, we can appreciate the amount of natural_disturbances versus the reasons induced by humans.

Lump the reasons for deforestations, keeping only the top 5 reasons, lumping as “Other” the rest
Tip
  • Use the function fct_lump_n() for this operation. Be careful to weight the categories with the appropriate continuous variable.
  • The legend of filled colours could be renamed and suppress if the title is explicit
brazil_loss_long |>
  ggplot(aes(x = year, y = area_ha, 
             fill = fct_lump_n(reasons, n = 5, w = area_ha))) +
  geom_col() +
  labs(title = "The 5 main reasons for deforestation in Brazil",
       fill = NULL)

Optimize the previous plot by sorting the main deforestation reasons
Tip

Since v1.0.0, fct_infreq() does have a weight argument.

you can play with the ordered argument to get a viridis binned color scale

brazil_loss_long |>
  ggplot(aes(x = year, y = area_ha, 
             fill = fct_infreq(reasons, w = area_ha, ordered = TRUE))) +
  geom_col() +
  labs(title = "Sorted reasons for deforestation in Brazil",
       fill = NULL)

Optimize the previous plot by sorting the 5 main deforestation reasons
Tip

One solution would be extract the top 5 main reasons using dplyr statements.

Then use this vector to recode the reasons with the reason name when part of the top5 or other if not. Then fct_reorder(reasons2, area_ha) does the correct reordering. You might want to use fct_rev() to have the sorting from top to bottom in the legend.

brazil_loss_long |>
  summarise(sum = sum(area_ha), .by = reasons) |>
  arrange(desc(sum)) |>
  slice_head(n = 5) |>
  pull(reasons) -> top_5_reasons
top_5_reasons
[1] "pasture"              "small_scale_clearing" "commercial_crops"    
[4] "fire"                 "selective_logging"   
brazil_loss_long |>
  mutate(reasons2 = if_else(reasons %in% top_5_reasons, reasons, "other")) |>
  ggplot(aes(x = year, y = area_ha, 
             fill = fct_reorder(reasons2, area_ha) |> fct_rev())) +
  geom_col() +
  scale_fill_brewer(type = "qual", palette = "Set1") +
  labs(title = "The 5 main reasons for deforestation in Brazil",
       fill = NULL)