library(tidyverse)
Hands-on DuckDB & dplyr
The dataset
ARCTIC SHOREBIRD DEMOGRAPHICS NETWORK https://doi.org/10.18739/A2222R68W
Data set hosted by the NSF Arctic Data Center (https://arcticdata.io)
Field data on shorebird ecology and environmental conditions were collected from 1993-2014 at 16 field sites in Alaska, Canada, and Russia.
Data were not collected in every year at all sites. Studies of the population ecology of these birds included nest-monitoring to determine timing of reproduction and reproductive success; live capture of birds to collect blood samples, feathers, and fecal samples for investigations of population structure and pathogens; banding of birds to determine annual survival rates; resighting of color-banded birds to determine space use and site fidelity; and use of light-sensitive geolocators to investigate migratory movements. Data on climatic conditions, prey abundance, and predators were also collected. Environmental data included weather stations that recorded daily climatic conditions, surveys of seasonal snowmelt, weekly sampling of terrestrial and aquatic invertebrates that are prey of shorebirds, live trapping of small mammals (alternate prey for shorebird predators), and daily counts of potential predators (jaegers, falcons, foxes). Detailed field methods for each year are available in the ASDN_protocol_201X.pdf files. All research was conducted under permits from relevant federal, state and university authorities.
See 01_ASDN_Readme.txt
provided in the data
folder for full metadata information about this data set.
Analyzing the bird dataset using csv files (raw data)
Loading the necessary packages. DuckDB has its own R package that is mostly a wrapper around dbplyr and DBI.
Import the csv files with the bird species information:
# Import the species
<- read_csv("data/species.csv")
species_csv
glimpse(species_csv)
Rows: 99
Columns: 4
$ Code <chr> "agsq", "amcr", "amgp", "arfo", "arte", "basa", "bbis"…
$ Common_name <chr> "Arctic ground squirrel", "American Crow", "American G…
$ Scientific_name <chr> "Spermophilus parryii", "Corvus brachyrhynchos", "Pluv…
$ Relevance <chr> "Potential predator (eggs; mammal)", "Potential predat…
Let’s explore what is in the Relevance
attribute/column:
%>%
species_csv group_by(Relevance) %>%
summarize(num_species = n())
# A tibble: 7 × 2
Relevance num_species
<chr> <int>
1 Incidental monitoring 18
2 Microtine (alternate prey for predators) 5
3 Potential predator (avian) 25
4 Potential predator (eggs; mammal) 2
5 Potential predator (mammal) 6
6 Study species 41
7 Study species; potential predator (eggs) 2
We are interested in the Study species
because according to the metadata, they are the species that are included in the data sets for banding, resighting, and/or nest monitoring.
Let us extract the species and sort them in alphabetical order:
# list of the bird species included in the study
<- species_csv %>%
species_study filter(Relevance=="Study species") %>%
select(Scientific_name, Code) %>%
arrange(Scientific_name)
species_study
# A tibble: 41 × 2
Scientific_name Code
<chr> <chr>
1 Actitis macularius spsa
2 Calidris acuminata spts
3 Calidris alba sand
4 Calidris alpina dunl
5 Calidris bairdii basa
6 Calidris canutus rekn
7 Calidris falcinellus bbis
8 Calidris ferruginea cusa
9 Calidris fuscicollis wrsa
10 Calidris himantopus stsa
# ℹ 31 more rows
Average egg volume
We would like to know what is the average egg size for each of those bird species. How would we do that?
We will need more information that what we have in our species table. Actually we will need to also retrieve information from the nests and eggs monitoring table.
An egg is in a nest, and a nest is associated with a species
# information about the nests
<- read_csv("data/ASDN_Bird_nests.csv")
nests_csv
# information about the
<- read_csv("data/ASDN_Bird_eggs.csv") eggs_csv
How do we join those tables?
glimpse(eggs_csv)
Rows: 406
Columns: 7
$ Book_page <chr> "b14.6", "b14.6", "b14.6", "b14.6", "b14.6", "b14.6", "b14.6…
$ Year <dbl> 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, …
$ Site <chr> "eaba", "eaba", "eaba", "eaba", "eaba", "eaba", "eaba", "eab…
$ Nest_ID <chr> "14eabaage01", "14eabaage01", "14eabaage01", "14eabaagl01", …
$ Egg_num <dbl> 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, …
$ Length <dbl> 39.14, 41.51, 48.29, 47.56, 48.13, 49.62, 47.16, 51.07, 48.7…
$ Width <dbl> 33.00, 33.39, 33.40, 32.36, 32.40, 32.40, 34.44, 32.22, 34.6…
Nest_Id
seems like promising as a foreign key!!
glimpse(nests_csv)
Rows: 1,547
Columns: 11
$ Book_page <chr> "b14.6", "b11.7", "b11.6", "b11.6", "b11.6", "b11.7", "b11.…
$ Year <dbl> 2014, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011,…
$ Site <chr> "chur", "eaba", "eaba", "eaba", "eaba", "eaba", "eaba", "ea…
$ Nest_ID <chr> "14HPE1", "11eaba", "11eabaagc01", "11eabaagv01", "11eababb…
$ Species <chr> "sepl", "wrsa", "amgp", "amgp", "bbpl", "wrsa", "dunl", "du…
$ Observer <chr> "vloverti", "bhill", "dkessler", "dkessler", "dkessler", "b…
$ Date_found <date> 2014-06-14, 2011-07-10, 2011-06-24, 2011-06-25, 2011-06-24…
$ how_found <chr> NA, "searcher", "searcher", "searcher", "searcher", "search…
$ Clutch_max <dbl> 3, 4, 4, 3, 4, 4, 3, 4, 4, 3, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4,…
$ floatAge <dbl> NA, NA, 6, 3, 4, 2, 2, 5, 4, 4, 6, 5, 4, 4, 4, 5, 4, 12, 3,…
$ ageMethod <chr> NA, NA, "float", "float", "float", "float", "float", "float…
Species
is probably the field we will use to join nest to the species
OK let’s do it:
First, we need to compute the average of the volume of an egg. We can use the following formula:
\(Volume=\frac{\Pi}6W^2L\)
Where W is the width and L the length of the egg
We can use mutate to do so:
<- eggs_csv %>%
eggs_volume_df mutate(egg_volume = pi/6*Width^2*Length)
Now let’s join this information to the nest table, and average by species
<- left_join(eggs_volume_df, nests_csv, by="Nest_ID") %>%
species_egg_volume_avg group_by(Species) %>%
summarise(egg_volume_avg = mean(egg_volume, na.rm = TRUE)) %>%
arrange(desc(egg_volume_avg)) %>%
drop_na()
species_egg_volume_avg
# A tibble: 7 × 2
Species egg_volume_avg
<chr> <dbl>
1 bbpl 33975.
2 amgp 28545.
3 rutu 18094.
4 dunl 11777.
5 wrsa 10111.
6 sepl 9903.
7 reph 8444.
Ideally we would like the scientific names…
<- species_study %>%
species_egg_vol_avg inner_join(species_egg_volume_avg, by = join_by(Code == Species))
species_egg_vol_avg
# A tibble: 6 × 3
Scientific_name Code egg_volume_avg
<chr> <chr> <dbl>
1 Calidris alpina dunl 11777.
2 Calidris fuscicollis wrsa 10111.
3 Charadrius semipalmatus sepl 9903.
4 Phalaropus fulicarius reph 8444.
5 Pluvialis dominica amgp 28545.
6 Pluvialis squatarola bbpl 33975.
Let’s connect to our first database
library(dbplyr) # to query databases in a tidyverse style manner
library(DBI) # to connect to databases
# install.packages("duckdb") # install this package to get duckDB API
library(duckdb) # Specific to duckDB
Load the bird database
This database has been built from the csv files we just analyzed, so the data should be very similar - note we did not say identical more on this in the last section:
<- dbConnect(duckdb::duckdb(), dbdir = "./data/bird_database.duckdb") conn
List all the tables present in the database:
dbListTables(conn)
[1] "Bird_eggs" "Bird_nests" "Camp_assignment" "Personnel"
[5] "Site" "Species"
Let’s have a look at the Species table
<- tbl(conn, "Species")
species_db species_db
# Source: table<Species> [?? x 4]
# Database: DuckDB v1.0.0 [unknown@Linux 6.5.0-1022-azure:R 4.3.3//home/runner/work/intro-database-r/intro-database-r/data/bird_database.duckdb]
Code Common_name Scientific_name Relevance
<chr> <chr> <chr> <chr>
1 agsq Arctic ground squirrel Spermophilus parryii Potential predator (egg…
2 amcr American Crow Corvus brachyrhynchos Potential predator (avi…
3 amgp American Golden-Plover Pluvialis dominica Study species
4 arfo Arctic fox Alopex lagopus Potential predator (mam…
5 arte Arctic Tern Sterna paradisaea Incidental monitoring
6 basa Baird's Sandpiper Calidris bairdii Study species
7 bbis Broad-billed Sandpiper Calidris falcinellus Study species
8 bbpl Black-bellied Plover Pluvialis squatarola Study species
9 bbsa Buff-breasted Sandpiper Calidris subruficollis Study species
10 besw Bewick's Swan Cygnus columbianus Incidental monitoring
# ℹ more rows
You can filter the data and select columns:
%>%
species_db filter(Relevance=="Study species") %>%
select(Scientific_name) %>%
arrange(Scientific_name) %>%
head(3)
# Source: SQL [3 x 1]
# Database: DuckDB v1.0.0 [unknown@Linux 6.5.0-1022-azure:R 4.3.3//home/runner/work/intro-database-r/intro-database-r/data/bird_database.duckdb]
# Ordered by: Scientific_name
Scientific_name
<chr>
1 Actitis macularius
2 Calidris acuminata
3 Calidris alba
Note that those are not data frames but tables. What dbplyr
is actually doing behind the scenes is translating all those dplyr operations into SQL, sending the SQL code to query the database, retrieving results, etc.
How can I get a “real” data frame?
You add collect()
to your query.
%>%
species_db filter(Relevance=="Study species") %>%
select(Scientific_name) %>%
arrange(Scientific_name) %>%
head(3) %>%
collect()
# A tibble: 3 × 1
Scientific_name
<chr>
1 Actitis macularius
2 Calidris acuminata
3 Calidris alba
Note it means the full query is going to be ran and save in your R environment. This might slow things down, so you generally want to collect on the smallest data frame you can.
How can you see the SQL query?
Adding show_query()
at the end of your code block will let you see the SQL code that has been used to query the database.
# Add show_query() to the end to see what SQL it is sending!
%>%
species_db filter(Relevance=="Study species") %>%
select(Scientific_name) %>%
arrange(Scientific_name) %>%
head(3) %>%
show_query()
<SQL>
SELECT Scientific_name
FROM Species
WHERE (Relevance = 'Study species')
ORDER BY Scientific_name
LIMIT 3
This is a great way to start getting familiar with the SQL syntax, because although you can do a lot with dbplyr
you can not do everything that SQL can do. So at some point you might want to start using SQL directly.
Here is how you could run the query using the SQL code directly:
# query the database using SQL
dbGetQuery(conn, "SELECT Scientific_name FROM Species WHERE (Relevance = 'Study species') ORDER BY Scientific_name LIMIT 3")
Scientific_name
1 Actitis macularius
2 Calidris acuminata
3 Calidris alba
You can do pretty much anything with these quasi-tables, including grouping, summarization, joins, etc.
Let’s count how many species there are per Relevance categories:
%>%
species_db group_by(Relevance) %>%
summarize(num_species = n())
# Source: SQL [7 x 2]
# Database: DuckDB v1.0.0 [unknown@Linux 6.5.0-1022-azure:R 4.3.3//home/runner/work/intro-database-r/intro-database-r/data/bird_database.duckdb]
Relevance num_species
<chr> <dbl>
1 Potential predator (eggs; mammal) 2
2 Microtine (alternate prey for predators) 5
3 Incidental monitoring 18
4 Study species 41
5 Potential predator (avian) 25
6 Potential predator (mammal) 6
7 Study species; potential predator (eggs) 2
Does that code looks familiar? But this time, here is really the query that was used to retrieve this information:
%>%
species_db group_by(Relevance) %>%
summarize(num_species = n()) %>%
show_query()
<SQL>
SELECT Relevance, COUNT(*) AS num_species
FROM Species
GROUP BY Relevance
Average egg volume analysis
Let’s reproduce the egg volume analysis we just did. We can calculate the average bird eggs volume per species directly on the database:
# loading all the necessary tables
<- tbl(conn, "Bird_eggs")
eggs_db <- tbl(conn, "Bird_nests") nests_db
Compute the volume using the same code as previously!! Yes, you can use mutate to create new columns on the tables object
# Compute the egg volume
<- eggs_db %>%
eggs_volume_db mutate(egg_volume = pi/6*Width^2*Length)
Limitation: no way to add or update data in the database, dbplyr
is view only. If you want to add or update data, you’ll need to use the DBI
package functions.
Now let’s join this information to the nest table, and average by species
# Join the egg and nest tables to compute average
<- left_join(nests_db, eggs_volume_db, by="Nest_ID") %>%
species_egg_volume_avg_db group_by(Species) %>%
summarise(egg_volume_avg = mean(egg_volume, na.rm = TRUE)) %>%
arrange(desc(egg_volume_avg)) %>%
collect() %>%
drop_na()
species_egg_volume_avg_db
# A tibble: 7 × 2
Species egg_volume_avg
<chr> <dbl>
1 bbpl 33975.
2 amgp 28545.
3 rutu 18094.
4 dunl 11777.
5 wrsa 10111.
6 sepl 9903.
7 reph 8444.
What does this SQL query looks like?
<- left_join(eggs_volume_db, nests_db, by="Nest_ID") %>%
species_egg_volume_avg_db group_by(Species) %>%
summarise(egg_volume_avg = mean(egg_volume, na.rm = TRUE)) %>%
arrange(desc(egg_volume_avg)) %>%
show_query()
<SQL>
SELECT Species, AVG(egg_volume) AS egg_volume_avg
FROM (
SELECT
LHS.Book_page AS "Book_page.x",
LHS."Year" AS "Year.x",
LHS.Site AS "Site.x",
LHS.Nest_ID AS Nest_ID,
Egg_num,
Length,
Width,
egg_volume,
Bird_nests.Book_page AS "Book_page.y",
Bird_nests."Year" AS "Year.y",
Bird_nests.Site AS "Site.y",
Species,
Observer,
Date_found,
how_found,
Clutch_max,
floatAge,
ageMethod
FROM (
SELECT
Bird_eggs.*,
((3.14159265358979 / 6.0) * (POW(Width, 2.0))) * Length AS egg_volume
FROM Bird_eggs
) LHS
LEFT JOIN Bird_nests
ON (LHS.Nest_ID = Bird_nests.Nest_ID)
) q01
GROUP BY Species
ORDER BY egg_volume_avg DESC
Why does the SQL query include the volume computation?
Disconnecting from the database
Before we close our session, it is good practice to disconnect from the database first
::dbDisconnect(conn, shutdown = TRUE) DBI
How did we create this database
You might be wondering how we created this database from our csv files. Most databases provide functions to import data from csv and other types of files. It is also possible to load data into the database programmatically from within R, one row at a time, using insert statements, but it is more common to load data from csv files. Note that since there is little data modeling within a csv file (the data does not have to be normalized or tidy), and no data type or value constraints can be enforced, a lot things can go wrong. Putting data in a database is thus a great opportunity to implement QA/QC and help you keep your data clean and tidy moving forward as new data are collected.
To look at one example, below is the SQL code that was used to create the Bird_eggs
table:
CREATE TABLE Bird_eggs (
VARCHAR,
Book_page Year INTEGER NOT NULL CHECK (Year BETWEEN 1950 AND 2015),
VARCHAR NOT NULL,
Site FOREIGN KEY (Site) REFERENCES Site (Code),
VARCHAR NOT NULL,
Nest_ID FOREIGN KEY (Nest_ID) REFERENCES Bird_nests (Nest_ID),
INTEGER NOT NULL CHECK (Egg_num BETWEEN 1 AND 20),
Egg_num Length FLOAT NOT NULL CHECK (Length > 0 AND Length < 100),
FLOAT NOT NULL CHECK (Width > 0 AND Width < 100),
Width PRIMARY KEY (Nest_ID, Egg_num)
);
COPY Bird_eggs FROM 'ASDN_Bird_eggs.csv' (header TRUE);
DuckDB’s COPY
SQL command reads a csv file into a database table. Had we not already created the table in the previous statement, DuckDB would have created it automatically and guessed at column names and data types. But by explicitly declaring the table, we are able to add more characterization to the data. Notable in the above:
NOT NULL
indicates that missing values are not allowed.- Constraints (e.g.,
Egg_num BETWEEN 1 and 20
) express our expectations about the data. - A
FOREIGN KEY
declares that a value must refer to an existing value in another table, i.e., it must be a reference. - A
PRIMARY KEY
identifies a quantity that should be unique within each row, and that serves as a row identifier.
Understand that a table declaration serves as more than documentation; the database actually enforces constraints.