Data visualisation

ggplot2

Author

Aurelien Ginolhac

Published

May 8, 2025

Objective

This practical aims at performing exploratory plots and how-to build layer by layer to be familiar with the grammar of graphics. Moreover, you will practice sorting/collapsing levels of factors that are essentials for ggplot2 categorical variable display.

Those questions are optional

Scatter plots of penguins

The penguins dataset is provided by the palmerpenguins R package.

If not done already, install the package `palmerpenguins` and load it.

Warning once installed

Either use the chunk option #| eval: false or comment the installation with a leading # otherwise the package gets installed each time you render the Quarto document

Plot the body mass on the y axis and the bill length on the x axis.

Solution

penguins |>
  ggplot(aes(x = bill_length_mm, 
             y = body_mass_g)) +
  geom_point()

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

Plot again the body mass on the y axis and the bill length on the x axis, but with colour by `species`

Solution

penguins |>
  ggplot(aes(x = bill_length_mm, 
             y = body_mass_g, 
             colour = species)) +
  geom_point()

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

The `geom_smooth()` layer can be used to add a trend line. Try to overlay it to your scatter plot.

Tip

By default geom_smooth is using a loess regression (< 1,000 points) and adds standard error intervals.

The method argument can be used to change the regression to a linear one: method = "lm"
to disable the ribbon of standard errors, set se = FALSE

Be careful where the aesthetics are located, so the trend linear lines are also colored per species.

Solution

penguins |>
  ggplot(aes(x = bill_length_mm, 
             y = body_mass_g, 
             colour = species)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

Adjust the aesthetics of point in order to

The shape map to the originated island
A fixed size of 3
A transparency of 40%

Tip

You should still have only 3 coloured linear trend lines. Otherwise check to which layer your are adding the aesthetic shape. Remember that fixed parameters are to be defined outside aes()

Solution

penguins |>
  ggplot(aes(x = bill_length_mm, y = body_mass_g, 
             colour = species)) +
  geom_point(aes(shape = island), size = 3, alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE)

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

Ajust the colour aesthetic to the `ggplot()` call to propagate it to both point and regression line.

Try the scale colour viridis for discrete scale (scale_colour_viridis_d())
Try to change the default theme to theme_bw()

Solution

penguins |>
  ggplot(aes(x = bill_length_mm, y = body_mass_g, 
             colour = species)) +
  geom_point(aes(shape = island), size = 3, alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE, formula = "y ~ x") +
  scale_colour_viridis_d() +
  theme_bw()

Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

Reproduce the following plot:

Solution

penguins |>
  # avoid the missing / non finite values
  drop_na(bill_length_mm, body_mass_g) |> 
  ggplot(aes(x = bill_length_mm, y = body_mass_g, 
             colour = species)) +
  geom_point(aes(shape = island), size = 3, alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE, formula = "y ~ x") +
  scale_colour_viridis_d() +
  theme_bw(14) +
  theme(plot.caption.position = "plot",
        plot.caption = element_text(face = "italic"),
        plot.subtitle = element_text(size = 11)) +
  scale_y_continuous(labels = scales::comma) +
  labs(title = "Penguin bills and body mass",
       caption = "Horst AM, Hill AP, Gorman KB (2020)",
       subtitle = "Dimensions for male/female Adelie, Chinstrap and Gentoo Penguins\nat Palmer Station LTER",
        x = "Bill length (mm)",
       y = "Body mass (g)",
       color = "Penguin species")

Tip

Remember that:

All aesthetics defined in the ggplot(aes()) command will be inherited by all following layers
aes() of individual geoms are specific (and overwrite the global definition if present).
labs() controls of plot annotations
theme() allows to tweak the plot like theme(plot.caption = element_text(face = "italic")) to render in italic the caption

Categorical data

We are going to use a dataset from the TidyTuesday initiative. Several dataset about the theme deforestation on April 2021 were released, we will focus on the csv called brazil_loss.csv. The dataset columns are described in the linked README and the csv is directly available at this url

Load the `brazil_loss.csv` file, remove the 2 first columns (`entity` and `code` since it is all Brazil) and assign the name `brazil_loss`

Tip

Set the data type of year to character()

Solution

brazil_url <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-04-06/brazil_loss.csv"
brazil_loss <- read_csv(brazil_url, 
                        col_select = -c(entity, code),
                        col_types = cols(year = col_character()))
brazil_loss

# A tibble: 13 × 12
   year  commercial_crops flooding_due_to_dams natural_disturbances pasture
   <chr>            <dbl>                <dbl>                <dbl>   <dbl>
 1 2001            280000                    0                    0 1520000
 2 2002            415000                79000                35000 2568000
 3 2003            550000                    0                35000 2761000
 4 2004            747000                26000                22000 2564000
 5 2005            328000                17000                26000 2665000
 6 2006            188000                17000                26000 1861000
 7 2007             79000                 9000                22000 1577000
 8 2008             52000                    0                17000 1345000
 9 2009             57000                 9000                31000  847000
10 2010            100000                    0                44000  616000
11 2011             52000                17000                87000  738000
12 2012            118000                17000                52000  546000
13 2013             87000                    0                13000  695000
# ℹ 7 more variables: selective_logging <dbl>, fire <dbl>, mining <dbl>,
#   other_infrastructure <dbl>, roads <dbl>,
#   tree_plantations_including_palm <dbl>, small_scale_clearing <dbl>

Is this dataset tidy?

Solution

No, the reason for deforestation are in the wide format. Columns commercial_crops to small_scale_clearing should be in the long format

Pivot the deforestation reasons (columns `commercial_crops` to `small_scale_clearing`) to the long format. Values are areas in hectares (`area_ha` is a good column name). Save as `brazil_loss_long`

Solution

pivot_longer(brazil_loss,
             cols = commercial_crops:small_scale_clearing,
             names_to = "reasons",
             values_to = "area_ha") -> brazil_loss_long
brazil_loss_long

# A tibble: 143 × 3
   year  reasons                         area_ha
   <chr> <chr>                             <dbl>
 1 2001  commercial_crops                 280000
 2 2001  flooding_due_to_dams                  0
 3 2001  natural_disturbances                  0
 4 2001  pasture                         1520000
 5 2001  selective_logging                 96000
 6 2001  fire                              26000
 7 2001  mining                             9000
 8 2001  other_infrastructure               9000
 9 2001  roads                             13000
10 2001  tree_plantations_including_palm   44000
# ℹ 133 more rows

Plot the deforestation areas per year as bars

Tip

year needs to be a categorical data. If you didn’t read the data as character for this column, you can convert it with factor()
geom_col() requires 2 aesthetics
- x must be categorical / discrete (see first item)
- y must be continuous

Solution

brazil_loss_long |>
  ggplot(aes(x = year, y = area_ha)) +
  geom_col()

Same as the plot above but bar filled by the reasons for deforestation

Solution

brazil_loss_long |>
  ggplot(aes(x = year, y = area_ha, fill = reasons)) +
  geom_col()

Even if we have too many categories, we can appreciate the amount of natural_disturbances versus the reasons induced by humans.

Lump the reasons for deforestations, keeping only the top 5 reasons, lumping as “Other” the rest

Tip

Use the function fct_lump_n() for this operation. Be careful to weight the categories with the appropriate continuous variable.
The legend of filled colours could be renamed and suppress if the title is explicit

Solution

brazil_loss_long |>
  ggplot(aes(x = year, y = area_ha, 
             fill = fct_lump_n(reasons, n = 5, w = area_ha))) +
  geom_col() +
  labs(title = "The 5 main reasons for deforestation in Brazil",
       fill = NULL)

Optimize the previous plot by sorting the main deforestation reasons

Tip

Since v1.0.0, fct_infreq() does have a weight argument.

you can play with the ordered argument to get a viridis binned color scale

Solution

brazil_loss_long |>
  ggplot(aes(x = year, y = area_ha, 
             fill = fct_infreq(reasons, w = area_ha, ordered = TRUE))) +
  geom_col() +
  labs(title = "Sorted reasons for deforestation in Brazil",
       fill = NULL)

Optimize the previous plot by sorting the 5 main deforestation reasons

Tip

One solution would be extract the top 5 main reasons using dplyr statements.

Then use this vector to recode the reasons with the reason name when part of the top5 or other if not. Then fct_reorder(reasons2, area_ha) does the correct reordering. You might want to use fct_rev() to have the sorting from top to bottom in the legend.

Solution

brazil_loss_long |>
  summarise(sum = sum(area_ha), .by = reasons) |>
  arrange(desc(sum)) |>
  slice_head(n = 5) |>
  pull(reasons) -> top_5_reasons
top_5_reasons

[1] "pasture"              "small_scale_clearing" "commercial_crops"    
[4] "fire"                 "selective_logging"

brazil_loss_long |>
  mutate(reasons2 = if_else(reasons %in% top_5_reasons, reasons, "other")) |>
  ggplot(aes(x = year, y = area_ha, 
             fill = fct_reorder(reasons2, area_ha) |> fct_rev())) +
  geom_col() +
  scale_fill_brewer(type = "qual", palette = "Set1") +
  labs(title = "The 5 main reasons for deforestation in Brazil",
       fill = NULL)

Those questions are optional

Scatter plots of penguins

If not done already, install the package palmerpenguins and load it.

Plot the body mass on the y axis and the bill length on the x axis.

Plot again the body mass on the y axis and the bill length on the x axis, but with colour by species

The geom_smooth() layer can be used to add a trend line. Try to overlay it to your scatter plot.

Adjust the aesthetics of point in order to

Ajust the colour aesthetic to the ggplot() call to propagate it to both point and regression line.

Reproduce the following plot:

Categorical data

Load the brazil_loss.csv file, remove the 2 first columns (entity and code since it is all Brazil) and assign the name brazil_loss

Is this dataset tidy?

Pivot the deforestation reasons (columns commercial_crops to small_scale_clearing) to the long format. Values are areas in hectares (area_ha is a good column name). Save as brazil_loss_long

Plot the deforestation areas per year as bars

Same as the plot above but bar filled by the reasons for deforestation

Lump the reasons for deforestations, keeping only the top 5 reasons, lumping as “Other” the rest

Optimize the previous plot by sorting the main deforestation reasons

Optimize the previous plot by sorting the 5 main deforestation reasons

If not done already, install the package `palmerpenguins` and load it.

Plot again the body mass on the y axis and the bill length on the x axis, but with colour by `species`

The `geom_smooth()` layer can be used to add a trend line. Try to overlay it to your scatter plot.

Ajust the colour aesthetic to the `ggplot()` call to propagate it to both point and regression line.

Load the `brazil_loss.csv` file, remove the 2 first columns (`entity` and `code` since it is all Brazil) and assign the name `brazil_loss`

Pivot the deforestation reasons (columns `commercial_crops` to `small_scale_clearing`) to the long format. Values are areas in hectares (`area_ha` is a good column name). Save as `brazil_loss_long`