|>
penguins ggplot(aes(x = bill_length_mm,
y = body_mass_g)) +
geom_point()
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
ggplot2
Aurelien Ginolhac
May 8, 2025
This practical aims at performing exploratory plots and how-to build layer by layer to be familiar with the grammar of graphics. Moreover, you will practice sorting/collapsing levels of factors that are essentials for ggplot2
categorical variable display.
The penguins
dataset is provided by the palmerpenguins
R package.
palmerpenguins
and load it.Either use the chunk option #| eval: false
or comment the installation with a leading #
otherwise the package gets installed each time you render the Quarto document
species
geom_smooth()
layer can be used to add a trend line. Try to overlay it to your scatter plot.By default geom_smooth
is using a loess regression (< 1,000 points) and adds standard error intervals.
method
argument can be used to change the regression to a linear one: method = "lm"
se = FALSE
Be careful where the aesthetics are located, so the trend linear lines are also colored per species.
penguins |>
ggplot(aes(x = bill_length_mm,
y = body_mass_g,
colour = species)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
shape
map to the originated island
3
You should still have only 3 coloured linear trend lines. Otherwise check to which layer your are adding the aesthetic shape
. Remember that fixed parameters are to be defined outside aes()
penguins |>
ggplot(aes(x = bill_length_mm, y = body_mass_g,
colour = species)) +
geom_point(aes(shape = island), size = 3, alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE)
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
ggplot()
call to propagate it to both point and regression line.scale_colour_viridis_d()
)theme_bw()
penguins |>
ggplot(aes(x = bill_length_mm, y = body_mass_g,
colour = species)) +
geom_point(aes(shape = island), size = 3, alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE, formula = "y ~ x") +
scale_colour_viridis_d() +
theme_bw()
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
penguins |>
# avoid the missing / non finite values
drop_na(bill_length_mm, body_mass_g) |>
ggplot(aes(x = bill_length_mm, y = body_mass_g,
colour = species)) +
geom_point(aes(shape = island), size = 3, alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE, formula = "y ~ x") +
scale_colour_viridis_d() +
theme_bw(14) +
theme(plot.caption.position = "plot",
plot.caption = element_text(face = "italic"),
plot.subtitle = element_text(size = 11)) +
scale_y_continuous(labels = scales::comma) +
labs(title = "Penguin bills and body mass",
caption = "Horst AM, Hill AP, Gorman KB (2020)",
subtitle = "Dimensions for male/female Adelie, Chinstrap and Gentoo Penguins\nat Palmer Station LTER",
x = "Bill length (mm)",
y = "Body mass (g)",
color = "Penguin species")
Remember that:
ggplot(aes())
command will be inherited by all following layersaes()
of individual geoms are specific (and overwrite the global definition if present).labs()
controls of plot annotationstheme()
allows to tweak the plot like theme(plot.caption = element_text(face = "italic"))
to render in italic the captionWe are going to use a dataset from the TidyTuesday initiative. Several dataset about the theme deforestation on April 2021 were released, we will focus on the csv called brazil_loss.csv
. The dataset columns are described in the linked README and the csv is directly available at this url
brazil_loss.csv
file, remove the 2 first columns (entity
and code
since it is all Brazil) and assign the name brazil_loss
Set the data type of year
to character()
brazil_url <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-04-06/brazil_loss.csv"
brazil_loss <- read_csv(brazil_url,
col_select = -c(entity, code),
col_types = cols(year = col_character()))
brazil_loss
# A tibble: 13 × 12
year commercial_crops flooding_due_to_dams natural_disturbances pasture
<chr> <dbl> <dbl> <dbl> <dbl>
1 2001 280000 0 0 1520000
2 2002 415000 79000 35000 2568000
3 2003 550000 0 35000 2761000
4 2004 747000 26000 22000 2564000
5 2005 328000 17000 26000 2665000
6 2006 188000 17000 26000 1861000
7 2007 79000 9000 22000 1577000
8 2008 52000 0 17000 1345000
9 2009 57000 9000 31000 847000
10 2010 100000 0 44000 616000
11 2011 52000 17000 87000 738000
12 2012 118000 17000 52000 546000
13 2013 87000 0 13000 695000
# ℹ 7 more variables: selective_logging <dbl>, fire <dbl>, mining <dbl>,
# other_infrastructure <dbl>, roads <dbl>,
# tree_plantations_including_palm <dbl>, small_scale_clearing <dbl>
No, the reason for deforestation are in the wide format. Columns commercial_crops to small_scale_clearing should be in the long format
commercial_crops
to small_scale_clearing
) to the long format. Values are areas in hectares (area_ha
is a good column name). Save as brazil_loss_long
pivot_longer(brazil_loss,
cols = commercial_crops:small_scale_clearing,
names_to = "reasons",
values_to = "area_ha") -> brazil_loss_long
brazil_loss_long
# A tibble: 143 × 3
year reasons area_ha
<chr> <chr> <dbl>
1 2001 commercial_crops 280000
2 2001 flooding_due_to_dams 0
3 2001 natural_disturbances 0
4 2001 pasture 1520000
5 2001 selective_logging 96000
6 2001 fire 26000
7 2001 mining 9000
8 2001 other_infrastructure 9000
9 2001 roads 13000
10 2001 tree_plantations_including_palm 44000
# ℹ 133 more rows
year
needs to be a categorical data. If you didn’t read the data as character for this column, you can convert it with factor()
geom_col()
requires 2 aesthetics
x
must be categorical / discrete (see first item)y
must be continuousEven if we have too many categories, we can appreciate the amount of natural_disturbances
versus the reasons induced by humans.
fct_lump_n()
for this operation. Be careful to weight the categories with the appropriate continuous variable.Since v1.0.0, fct_infreq()
does have a weight argument.
you can play with the ordered
argument to get a viridis binned color scale
One solution would be extract the top 5 main reasons using dplyr
statements.
Then use this vector to recode the reasons
with the reason name when part of the top5 or other
if not. Then fct_reorder(reasons2, area_ha)
does the correct reordering. You might want to use fct_rev()
to have the sorting from top to bottom in the legend.
brazil_loss_long |>
summarise(sum = sum(area_ha), .by = reasons) |>
arrange(desc(sum)) |>
slice_head(n = 5) |>
pull(reasons) -> top_5_reasons
top_5_reasons
[1] "pasture" "small_scale_clearing" "commercial_crops"
[4] "fire" "selective_logging"
brazil_loss_long |>
mutate(reasons2 = if_else(reasons %in% top_5_reasons, reasons, "other")) |>
ggplot(aes(x = year, y = area_ha,
fill = fct_reorder(reasons2, area_ha) |> fct_rev())) +
geom_col() +
scale_fill_brewer(type = "qual", palette = "Set1") +
labs(title = "The 5 main reasons for deforestation in Brazil",
fill = NULL)