Hands-on DuckDB & dplyr

The dataset

ARCTIC SHOREBIRD DEMOGRAPHICS NETWORK https://doi.org/10.18739/A2222R68W

Data set hosted by the NSF Arctic Data Center (https://arcticdata.io)

Field data on shorebird ecology and environmental conditions were collected from 1993-2014 at 16 field sites in Alaska, Canada, and Russia.

Data were not collected in every year at all sites. Studies of the population ecology of these birds included nest-monitoring to determine timing of reproduction and reproductive success; live capture of birds to collect blood samples, feathers, and fecal samples for investigations of population structure and pathogens; banding of birds to determine annual survival rates; resighting of color-banded birds to determine space use and site fidelity; and use of light-sensitive geolocators to investigate migratory movements. Data on climatic conditions, prey abundance, and predators were also collected. Environmental data included weather stations that recorded daily climatic conditions, surveys of seasonal snowmelt, weekly sampling of terrestrial and aquatic invertebrates that are prey of shorebirds, live trapping of small mammals (alternate prey for shorebird predators), and daily counts of potential predators (jaegers, falcons, foxes). Detailed field methods for each year are available in the ASDN_protocol_201X.pdf files. All research was conducted under permits from relevant federal, state and university authorities.

See 01_ASDN_Readme.txt provided in the data folder for full metadata information about this data set.

Analyzing the bird dataset using csv files (raw data)

Loading the necessary packages. DuckDB has its own R package that is mostly a wrapper around dbplyr and DBI.

library(tidyverse)

Import the csv files with the bird species information:

# Import the species
species_csv <- read_csv("data/species.csv")

glimpse(species_csv)

Rows: 99
Columns: 4
$ Code            <chr> "agsq", "amcr", "amgp", "arfo", "arte", "basa", "bbis"…
$ Common_name     <chr> "Arctic ground squirrel", "American Crow", "American G…
$ Scientific_name <chr> "Spermophilus parryii", "Corvus brachyrhynchos", "Pluv…
$ Relevance       <chr> "Potential predator (eggs; mammal)", "Potential predat…

Let’s explore what is in the Relevance attribute/column:

species_csv %>% 
  group_by(Relevance) %>%
  summarize(num_species = n())

# A tibble: 7 × 2
  Relevance                                num_species
  <chr>                                          <int>
1 Incidental monitoring                             18
2 Microtine (alternate prey for predators)           5
3 Potential predator (avian)                        25
4 Potential predator (eggs; mammal)                  2
5 Potential predator (mammal)                        6
6 Study species                                     41
7 Study species; potential predator (eggs)           2

We are interested in the Study species because according to the metadata, they are the species that are included in the data sets for banding, resighting, and/or nest monitoring.

Let us extract the species and sort them in alphabetical order:

# list of the bird species included in the study
species_study <- species_csv %>%
  filter(Relevance=="Study species") %>%
  select(Scientific_name, Code) %>%
  arrange(Scientific_name)

species_study

# A tibble: 41 × 2
   Scientific_name      Code 
   <chr>                <chr>
 1 Actitis macularius   spsa 
 2 Calidris acuminata   spts 
 3 Calidris alba        sand 
 4 Calidris alpina      dunl 
 5 Calidris bairdii     basa 
 6 Calidris canutus     rekn 
 7 Calidris falcinellus bbis 
 8 Calidris ferruginea  cusa 
 9 Calidris fuscicollis wrsa 
10 Calidris himantopus  stsa 
# ℹ 31 more rows

Average egg volume

Analysis

We would like to know what is the average egg size for each of those bird species. How would we do that?

We will need more information that what we have in our species table. Actually we will need to also retrieve information from the nests and eggs monitoring table.

An egg is in a nest, and a nest is associated with a species

# information about the nests
nests_csv <- read_csv("data/ASDN_Bird_nests.csv")

# information about the 
eggs_csv <- read_csv("data/ASDN_Bird_eggs.csv")

How do we join those tables?

glimpse(eggs_csv)

Rows: 406
Columns: 7
$ Book_page <chr> "b14.6", "b14.6", "b14.6", "b14.6", "b14.6", "b14.6", "b14.6…
$ Year      <dbl> 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, …
$ Site      <chr> "eaba", "eaba", "eaba", "eaba", "eaba", "eaba", "eaba", "eab…
$ Nest_ID   <chr> "14eabaage01", "14eabaage01", "14eabaage01", "14eabaagl01", …
$ Egg_num   <dbl> 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, …
$ Length    <dbl> 39.14, 41.51, 48.29, 47.56, 48.13, 49.62, 47.16, 51.07, 48.7…
$ Width     <dbl> 33.00, 33.39, 33.40, 32.36, 32.40, 32.40, 34.44, 32.22, 34.6…

Nest_Id seems like promising as a foreign key!!

glimpse(nests_csv)

Rows: 1,547
Columns: 11
$ Book_page  <chr> "b14.6", "b11.7", "b11.6", "b11.6", "b11.6", "b11.7", "b11.…
$ Year       <dbl> 2014, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011,…
$ Site       <chr> "chur", "eaba", "eaba", "eaba", "eaba", "eaba", "eaba", "ea…
$ Nest_ID    <chr> "14HPE1", "11eaba", "11eabaagc01", "11eabaagv01", "11eababb…
$ Species    <chr> "sepl", "wrsa", "amgp", "amgp", "bbpl", "wrsa", "dunl", "du…
$ Observer   <chr> "vloverti", "bhill", "dkessler", "dkessler", "dkessler", "b…
$ Date_found <date> 2014-06-14, 2011-07-10, 2011-06-24, 2011-06-25, 2011-06-24…
$ how_found  <chr> NA, "searcher", "searcher", "searcher", "searcher", "search…
$ Clutch_max <dbl> 3, 4, 4, 3, 4, 4, 3, 4, 4, 3, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4,…
$ floatAge   <dbl> NA, NA, 6, 3, 4, 2, 2, 5, 4, 4, 6, 5, 4, 4, 4, 5, 4, 12, 3,…
$ ageMethod  <chr> NA, NA, "float", "float", "float", "float", "float", "float…

Species is probably the field we will use to join nest to the species

OK let’s do it:

First, we need to compute the average of the volume of an egg. We can use the following formula:

\(Volume=\frac{\Pi}6W^2L\)

Where W is the width and L the length of the egg

We can use mutate to do so:

eggs_volume_df <- eggs_csv %>%
  mutate(egg_volume = pi/6*Width^2*Length)

Now let’s join this information to the nest table, and average by species

species_egg_volume_avg <- left_join(eggs_volume_df, nests_csv, by="Nest_ID") %>%
  group_by(Species) %>%
  summarise(egg_volume_avg = mean(egg_volume, na.rm = TRUE)) %>%
  arrange(desc(egg_volume_avg)) %>%
  drop_na()

species_egg_volume_avg

# A tibble: 7 × 2
  Species egg_volume_avg
  <chr>            <dbl>
1 bbpl            33975.
2 amgp            28545.
3 rutu            18094.
4 dunl            11777.
5 wrsa            10111.
6 sepl             9903.
7 reph             8444.

Ideally we would like the scientific names…

species_egg_vol_avg <- species_study %>%
  inner_join(species_egg_volume_avg, by = join_by(Code == Species)) 

species_egg_vol_avg

# A tibble: 6 × 3
  Scientific_name         Code  egg_volume_avg
  <chr>                   <chr>          <dbl>
1 Calidris alpina         dunl          11777.
2 Calidris fuscicollis    wrsa          10111.
3 Charadrius semipalmatus sepl           9903.
4 Phalaropus fulicarius   reph           8444.
5 Pluvialis dominica      amgp          28545.
6 Pluvialis squatarola    bbpl          33975.

Let’s connect to our first database

library(dbplyr)       # to query databases in a tidyverse style manner
library(DBI)          # to connect to databases
# install.packages("duckdb")  # install this package to get duckDB API
library(duckdb)       # Specific to duckDB

Load the bird database

This database has been built from the csv files we just analyzed, so the data should be very similar - note we did not say identical more on this in the last section:

conn <- dbConnect(duckdb::duckdb(), dbdir = "./data/bird_database.duckdb")

List all the tables present in the database:

dbListTables(conn)

[1] "Bird_eggs"       "Bird_nests"      "Camp_assignment" "Personnel"      
[5] "Site"            "Species"

Let’s have a look at the Species table

species_db <- tbl(conn, "Species")
species_db

# Source:   table<Species> [?? x 4]
# Database: DuckDB v1.1.3 [unknown@Linux 6.5.0-1025-azure:R 4.3.3//home/runner/work/intro-database-r/intro-database-r/data/bird_database.duckdb]
   Code  Common_name             Scientific_name        Relevance               
   <chr> <chr>                   <chr>                  <chr>                   
 1 agsq  Arctic ground squirrel  Spermophilus parryii   Potential predator (egg…
 2 amcr  American Crow           Corvus brachyrhynchos  Potential predator (avi…
 3 amgp  American Golden-Plover  Pluvialis dominica     Study species           
 4 arfo  Arctic fox              Alopex lagopus         Potential predator (mam…
 5 arte  Arctic Tern             Sterna paradisaea      Incidental monitoring   
 6 basa  Baird's Sandpiper       Calidris bairdii       Study species           
 7 bbis  Broad-billed Sandpiper  Calidris falcinellus   Study species           
 8 bbpl  Black-bellied Plover    Pluvialis squatarola   Study species           
 9 bbsa  Buff-breasted Sandpiper Calidris subruficollis Study species           
10 besw  Bewick's Swan           Cygnus columbianus     Incidental monitoring   
# ℹ more rows

You can filter the data and select columns:

species_db %>%
  filter(Relevance=="Study species") %>%
  select(Scientific_name) %>%
  arrange(Scientific_name) %>%
  head(3)

# Source:     SQL [3 x 1]
# Database:   DuckDB v1.1.3 [unknown@Linux 6.5.0-1025-azure:R 4.3.3//home/runner/work/intro-database-r/intro-database-r/data/bird_database.duckdb]
# Ordered by: Scientific_name
  Scientific_name   
  <chr>             
1 Actitis macularius
2 Calidris acuminata
3 Calidris alba

Note

Note that those are not data frames but tables. What dbplyr is actually doing behind the scenes is translating all those dplyr operations into SQL, sending the SQL code to query the database, retrieving results, etc.

How can I get a “real” data frame?

You add collect() to your query.

species_db %>%
  filter(Relevance=="Study species") %>%
  select(Scientific_name) %>%
  arrange(Scientific_name) %>%
  head(3) %>% 
  collect()

# A tibble: 3 × 1
  Scientific_name   
  <chr>             
1 Actitis macularius
2 Calidris acuminata
3 Calidris alba

Note it means the full query is going to be ran and save in your R environment. This might slow things down, so you generally want to collect on the smallest data frame you can.

How can you see the SQL query?

Adding show_query() at the end of your code block will let you see the SQL code that has been used to query the database.

# Add show_query() to the end to see what SQL it is sending!
species_db %>%
  filter(Relevance=="Study species") %>%
  select(Scientific_name) %>%
  arrange(Scientific_name) %>%
  head(3) %>% 
  show_query()

<SQL>
SELECT Scientific_name
FROM Species
WHERE (Relevance = 'Study species')
ORDER BY Scientific_name
LIMIT 3

This is a great way to start getting familiar with the SQL syntax, because although you can do a lot with dbplyr you can not do everything that SQL can do. So at some point you might want to start using SQL directly.

Here is how you could run the query using the SQL code directly:

# query the database using SQL
dbGetQuery(conn, "SELECT Scientific_name FROM Species WHERE (Relevance = 'Study species') ORDER BY Scientific_name LIMIT 3")

     Scientific_name
1 Actitis macularius
2 Calidris acuminata
3      Calidris alba

You can do pretty much anything with these quasi-tables, including grouping, summarization, joins, etc.

Let’s count how many species there are per Relevance categories:

species_db %>%
  group_by(Relevance) %>%
  summarize(num_species = n())

# Source:   SQL [7 x 2]
# Database: DuckDB v1.1.3 [unknown@Linux 6.5.0-1025-azure:R 4.3.3//home/runner/work/intro-database-r/intro-database-r/data/bird_database.duckdb]
  Relevance                                num_species
  <chr>                                          <dbl>
1 Incidental monitoring                             18
2 Potential predator (eggs; mammal)                  2
3 Microtine (alternate prey for predators)           5
4 Study species                                     41
5 Potential predator (avian)                        25
6 Potential predator (mammal)                        6
7 Study species; potential predator (eggs)           2

Does that code looks familiar? But this time, here is really the query that was used to retrieve this information:

species_db %>%
  group_by(Relevance) %>%
  summarize(num_species = n()) %>%
  show_query()

<SQL>
SELECT Relevance, COUNT(*) AS num_species
FROM Species
GROUP BY Relevance

Average egg volume analysis

Let’s reproduce the egg volume analysis we just did. We can calculate the average bird eggs volume per species directly on the database:

# loading all the necessary tables
eggs_db <- tbl(conn, "Bird_eggs")
nests_db <- tbl(conn, "Bird_nests")

Compute the volume using the same code as previously!! Yes, you can use mutate to create new columns on the tables object

# Compute the egg volume
eggs_volume_db <- eggs_db %>%
  mutate(egg_volume = pi/6*Width^2*Length)

Caution

Limitation: no way to add or update data in the database, dbplyr is view only. If you want to add or update data, you’ll need to use the DBI package functions.

Now let’s join this information to the nest table, and average by species

# Join the egg and nest tables to compute average
species_egg_volume_avg_db <- left_join(nests_db, eggs_volume_db, by="Nest_ID") %>%
  group_by(Species) %>%
  summarise(egg_volume_avg = mean(egg_volume, na.rm = TRUE)) %>%
  arrange(desc(egg_volume_avg)) %>% 
  collect() %>%
  drop_na()

species_egg_volume_avg_db

# A tibble: 7 × 2
  Species egg_volume_avg
  <chr>            <dbl>
1 bbpl            33975.
2 amgp            28545.
3 rutu            18094.
4 dunl            11777.
5 wrsa            10111.
6 sepl             9903.
7 reph             8444.

What does this SQL query looks like?

species_egg_volume_avg_db <- left_join(eggs_volume_db, nests_db, by="Nest_ID") %>%
  group_by(Species) %>%
  summarise(egg_volume_avg = mean(egg_volume, na.rm = TRUE)) %>%
  arrange(desc(egg_volume_avg)) %>% 
  show_query()

<SQL>
SELECT Species, AVG(egg_volume) AS egg_volume_avg
FROM (
  SELECT
    LHS.Book_page AS "Book_page.x",
    LHS."Year" AS "Year.x",
    LHS.Site AS "Site.x",
    LHS.Nest_ID AS Nest_ID,
    Egg_num,
    Length,
    Width,
    egg_volume,
    Bird_nests.Book_page AS "Book_page.y",
    Bird_nests."Year" AS "Year.y",
    Bird_nests.Site AS "Site.y",
    Species,
    Observer,
    Date_found,
    how_found,
    Clutch_max,
    floatAge,
    ageMethod
  FROM (
    SELECT
      Bird_eggs.*,
      ((3.14159265358979 / 6.0) * (POW(Width, 2.0))) * Length AS egg_volume
    FROM Bird_eggs
  ) LHS
  LEFT JOIN Bird_nests
    ON (LHS.Nest_ID = Bird_nests.Nest_ID)
) q01
GROUP BY Species
ORDER BY egg_volume_avg DESC

Question

Why does the SQL query include the volume computation?

Disconnecting from the database

Before we close our session, it is good practice to disconnect from the database first

DBI::dbDisconnect(conn, shutdown = TRUE)

How did we create this database

You might be wondering how we created this database from our csv files. Most databases provide functions to import data from csv and other types of files. It is also possible to load data into the database programmatically from within R, one row at a time, using insert statements, but it is more common to load data from csv files. Note that since there is little data modeling within a csv file (the data does not have to be normalized or tidy), and no data type or value constraints can be enforced, a lot things can go wrong. Putting data in a database is thus a great opportunity to implement QA/QC and help you keep your data clean and tidy moving forward as new data are collected.

To look at one example, below is the SQL code that was used to create the Bird_eggs table:

CREATE TABLE Bird_eggs (
    Book_page VARCHAR,
    Year INTEGER NOT NULL CHECK (Year BETWEEN 1950 AND 2015),
    Site VARCHAR NOT NULL,
    FOREIGN KEY (Site) REFERENCES Site (Code),
    Nest_ID VARCHAR NOT NULL,
    FOREIGN KEY (Nest_ID) REFERENCES Bird_nests (Nest_ID),
    Egg_num INTEGER NOT NULL CHECK (Egg_num BETWEEN 1 AND 20),
    Length FLOAT NOT NULL CHECK (Length > 0 AND Length < 100),
    Width FLOAT NOT NULL CHECK (Width > 0 AND Width < 100),
    PRIMARY KEY (Nest_ID, Egg_num)
);

COPY Bird_eggs FROM 'ASDN_Bird_eggs.csv' (header TRUE);

DuckDB’s COPY SQL command reads a csv file into a database table. Had we not already created the table in the previous statement, DuckDB would have created it automatically and guessed at column names and data types. But by explicitly declaring the table, we are able to add more characterization to the data. Notable in the above:

NOT NULL indicates that missing values are not allowed.
Constraints (e.g., Egg_num BETWEEN 1 and 20) express our expectations about the data.
A FOREIGN KEY declares that a value must refer to an existing value in another table, i.e., it must be a reference.
A PRIMARY KEY identifies a quantity that should be unique within each row, and that serves as a row identifier.

Understand that a table declaration serves as more than documentation; the database actually enforces constraints.