Week 4 - SQL and DuckDB

Learning objectives

Continued exploration of SQL concepts, including:

nesting SQL queries
different type of joins

Monday Session

Let’s open our ASDN database:

duckdb database.duckdb

Let’s check we are in the right database

D .table

┌────────────────────┐┌───────────────────┐┌───────────────────┐┌──────────────────┐┌─────────────────────────┐
│     Bird_nests     ││     Bird_eggs     ││       Site        ││ Camp_assignment  ││         Species         │
│                    ││                   ││                   ││                  ││                         │
│ Book_page  varchar ││ Book_page varchar ││ Code      varchar ││ Year     integer ││ Code            varchar │
│ Year       integer ││ Year      integer ││ Site_name varchar ││ Site     varchar ││ Common_name     varchar │
│ Site       varchar ││ Site      varchar ││ Location  varchar ││ Observer varchar ││ Scientific_name varchar │
│ Nest_ID    varchar ││ Nest_ID   varchar ││ Latitude  float   ││ Start    date    ││ Relevance       varchar │
│ Species    varchar ││ Egg_num   integer ││ Longitude float   ││ End      date    ││                         │
│ Observer   varchar ││ Length    float   ││ Area      float   ││                  ││         99 rows         │
│ Date_found date    ││ Width     float   ││                   ││     441 rows     │└─────────────────────────┘
│ how_found  varchar ││                   ││      16 rows      │└──────────────────┘
│ Clutch_max integer ││     406 rows      │└───────────────────┘                    
│ floatAge   float   │└───────────────────┘                                         
│ ageMethod  varchar │                                                              
│                    │                                                              
│     1547 rows      │                                                              
└────────────────────┘                                                              
┌──────────────────────┐
│      Personnel       │
│                      │
│ Abbreviation varchar │
│ Name         varchar │
│                      │
│       269 rows       │
└──────────────────────┘

Tri-value logic

Expressions can return TRUE, FALSE, or NULL
NULL is infectious
In filtering, only TRUE counts

SELECT COUNT(*) FROM Bird_nests
  WHERE floatAge < 7 OR floatAge >= 7;

Only returns a fraction of the rows due to floatAge being NULL

In SQL, if you want to ask (or not ask) for NULLs: IS NULL, or IS NOT NULL

Nesting queries for more complex analysis

Everything returned by an SQL query is a table
Even SELECT COUNT(*) returns a table
Wherever a table is requested, you can insert a parenthesized query; similarly to the tidyverse piping

What species are not represented in the Bird_nests table?

SELECT * FROM Species WHERE
  Code NOT IN (SELECT DISTINCT Species FROM Bird_nests);

Which site are located in Canada and have an area larger than 200?

SELECT Location, MAX(Area) AS Max_area
  FROM Site
  WHERE Location LIKE '%Canada'
  GROUP BY Location
  HAVING Max_area > 200;

Note: HAVING is used to filter data on condition in combination of GROUP BY, whereas WHERE is used to filter rows.

Now let’s imagine there was no HAVING clause. How would you answer the above question?

SELECT Location, Max_area FROM
    (SELECT Location, MAX(Area) AS Max_area
      FROM Site
      WHERE Location LIKE '%Canada'
      GROUP BY Location)
  WHERE Max_area > 200;

Joins

For this part, we are using the toy.duckdb. Make sure you sync your fork and pull the latest version of our data repository.

duckdb toy.duckdb

D .table

 ─────────────────────────────────── toy ──────────────────────────────────── 
 ─────────────────────────────────── main ─────────────────────────────────── 
┌────────────────────┐┌────────────────────┐┌───────────────┐┌───────────────┐
│      Student       ││       House        ││       B       ││       A       │
│                    ││                    ││               ││               │
│ Student_ID integer ││ House_ID   integer ││ bcol1 integer ││ acol1 integer │
│ Name       varchar ││ Name       varchar ││ bcol2 varchar ││ acol2 varchar │
│ Birthdate  date    ││ Headmaster varchar ││ bcol3 varchar ││               │
│ House_ID   integer ││                    ││               ││    3 rows     │
│                    ││       3 rows       ││    3 rows     │└───────────────┘
│       4 rows       │└────────────────────┘└───────────────┘
└────────────────────┘

Cartesian product

Conceptually, the database performs a Cartesian product of the tables and then applies the join condition to filter
In some databases, can do a Cartesian product saying JOIN with no join condition
But in DuckDB, need to use CROSS JOIN

Let’s have a look at out tables

SELECT * FROM A;

┌───────┬───────────┐
│ acol1 │   acol2   │
│ int32 │  varchar  │
├───────┼───────────┤
│     1 │ apple     │
│     2 │ banana    │
│     3 │ cranberry │
└───────┴───────────┘

SELECT * FROM B;

┌───────┬─────────┬──────────┐
│ bcol1 │  bcol2  │  bcol3   │
│ int32 │ varchar │ varchar  │
├───────┼─────────┼──────────┤
│     1 │ pie     │ streusel │
│     2 │ cake    │ frosting │
│     3 │ cookie  │ NULL     │
└───────┴─────────┴──────────┘

SELECT * FROM A CROSS JOIN B;

┌───────┬───────────┬───────┬─────────┬──────────┐
│ acol1 │   acol2   │ bcol1 │  bcol2  │  bcol3   │
│ int32 │  varchar  │ int32 │ varchar │ varchar  │
├───────┼───────────┼───────┼─────────┼──────────┤
│     1 │ apple     │     1 │ pie     │ streusel │
│     1 │ apple     │     2 │ cake    │ frosting │
│     1 │ apple     │     3 │ cookie  │ NULL     │
│     2 │ banana    │     1 │ pie     │ streusel │
│     2 │ banana    │     2 │ cake    │ frosting │
│     2 │ banana    │     3 │ cookie  │ NULL     │
│     3 │ cranberry │     1 │ pie     │ streusel │
│     3 │ cranberry │     2 │ cake    │ frosting │
│     3 │ cranberry │     3 │ cookie  │ NULL     │
└───────┴───────────┴───────┴─────────┴──────────┘

As a result of a cross-join between to table you have a table that has the following characteristics:

Number of rows: product of number of rows from tables A & B
Number of columns: sum of number of columns from tables A & B

Cross-joins with a condition

When there is a condition added, JOIN defaults to inner join

SELECT * FROM A JOIN B ON acol1 < bcol1;

┌───────┬─────────┬───────┬─────────┬──────────┐
│ acol1 │  acol2  │ bcol1 │  bcol2  │  bcol3   │
│ int32 │ varchar │ int32 │ varchar │ varchar  │
├───────┼─────────┼───────┼─────────┼──────────┤
│     1 │ apple   │     2 │ cake    │ frosting │
│     1 │ apple   │     3 │ cookie  │ NULL     │
│     2 │ banana  │     3 │ cookie  │ NULL     │
└───────┴─────────┴───────┴─────────┴──────────┘

Left or right outer join: selects rows not included, supplies NULL for values in other table

SELECT * FROM A RIGHT JOIN B ON acol1 < bcol1;

───────┬─────────┬───────┬─────────┬──────────┐
│ acol1 │  acol2  │ bcol1 │  bcol2  │  bcol3   │
│ int32 │ varchar │ int32 │ varchar │ varchar  │
├───────┼─────────┼───────┼─────────┼──────────┤
│     1 │ apple   │     2 │ cake    │ frosting │
│     1 │ apple   │     3 │ cookie  │ NULL     │
│     2 │ banana  │     3 │ cookie  │ NULL     │
│  NULL │ NULL    │     1 │ pie     │ streusel │
└───────┴─────────┴───────┴─────────┴──────────┘

Notice how get B’s row 1 now (because B is the “right table”)
Similar for left side

SELECT * FROM A LEFT JOIN B ON acol1 < bcol1;

┌───────┬───────────┬───────┬─────────┬──────────┐
│ acol1 │   acol2   │ bcol1 │  bcol2  │  bcol3   │
│ int32 │  varchar  │ int32 │ varchar │ varchar  │
├───────┼───────────┼───────┼─────────┼──────────┤
│     1 │ apple     │     2 │ cake    │ frosting │
│     1 │ apple     │     3 │ cookie  │ NULL     │
│     2 │ banana    │     3 │ cookie  │ NULL     │
│     3 │ cranberry │  NULL │ NULL    │ NULL     │
└───────┴───────────┴───────┴─────────┴──────────┘

Now we we have the missing row from the A (left) table
Do a FULL OUTER JOIN to get both missing sides (rare)

Relational join

In the case of a join on a foreign key, the effect is to add columns to the “many” table

SELECT * FROM House;

┌──────────┬────────────┬────────────────────┐
│ House_ID │    Name    │     Headmaster     │
│  int32   │  varchar   │      varchar       │
├──────────┼────────────┼────────────────────┤
│        1 │ Gryffindor │ Minerva McGonagall │
│        2 │ Hufflepuff │ Pomona Sprout      │
│        3 │ Slytherin  │ Severus Snape      │
└──────────┴────────────┴────────────────────┘

SELECT * FROM Student;

┌────────────┬──────────────────┬────────────┬──────────┐
│ Student_ID │       Name       │ Birthdate  │ House_ID │
│   int32    │     varchar      │    date    │  int32   │
├────────────┼──────────────────┼────────────┼──────────┤
│          1 │ Harry Potter     │ 1980-07-31 │        1 │
│          2 │ Hermione Granger │ 1979-09-19 │        1 │
│          3 │ Draco Malfoy     │ 1980-06-05 │        3 │
│          4 │ Ginny Weasley    │ 1981-08-11 │        1 │
└────────────┴──────────────────┴────────────┴──────────┘

D .schema

SELECT * FROM Student AS S JOIN House AS H ON S.House_ID = H.House_ID;

┌────────────┬──────────────────┬────────────┬──────────┬──────────┬────────────┬────────────────────┐
│ Student_ID │       Name       │ Birthdate  │ House_ID │ House_ID │    Name    │     Headmaster     │
│   int32    │     varchar      │    date    │  int32   │  int32   │  varchar   │      varchar       │
├────────────┼──────────────────┼────────────┼──────────┼──────────┼────────────┼────────────────────┤
│          1 │ Harry Potter     │ 1980-07-31 │        1 │        1 │ Gryffindor │ Minerva McGonagall │
│          2 │ Hermione Granger │ 1979-09-19 │        1 │        1 │ Gryffindor │ Minerva McGonagall │
│          3 │ Draco Malfoy     │ 1980-06-05 │        3 │        3 │ Slytherin  │ Severus Snape      │
│          4 │ Ginny Weasley    │ 1981-08-11 │        1 │        1 │ Gryffindor │ Minerva McGonagall │
└────────────┴──────────────────┴────────────┴──────────┴──────────┴────────────┴────────────────────┘

more compact syntax, since the columns used to join have the same name in each table:

SELECT * FROM Student JOIN House USING (House_ID);

┌────────────┬──────────────────┬────────────┬──────────┬────────────┬────────────────────┐
│ Student_ID │       Name       │ Birthdate  │ House_ID │    Name    │     Headmaster     │
│   int32    │     varchar      │    date    │  int32   │  varchar   │      varchar       │
├────────────┼──────────────────┼────────────┼──────────┼────────────┼────────────────────┤
│          1 │ Harry Potter     │ 1980-07-31 │        1 │ Gryffindor │ Minerva McGonagall │
│          2 │ Hermione Granger │ 1979-09-19 │        1 │ Gryffindor │ Minerva McGonagall │
│          3 │ Draco Malfoy     │ 1980-06-05 │        3 │ Slytherin  │ Severus Snape      │
│          4 │ Ginny Weasley    │ 1981-08-11 │        1 │ Gryffindor │ Minerva McGonagall │
└────────────┴──────────────────┴────────────┴──────────┴────────────┴────────────────────┘

Note that the USING clause also keep only one of key columns!

Let’s switch back to our ASDN database

duckdb database.duckdb

D .mode line

DuckDB Modes

DuckDB’s different output formats: https://duckdb.org/docs/current/clients/cli/output_formats

Let’s have a look at the Bird_eggs and Bird_nests tables:

SELECT COUNT(*) FROM Bird_eggs;
SELECT * FROM Bird_eggs LIMIT 1;
SELECT * FROM Bird_nests LIMIT 1;

We can use Nest_ID to join those two tables

SELECT * FROM Bird_eggs
  JOIN Bird_nests USING (Nest_ID)
  LIMIT 1;

Let’s also join the Species table to bring more readable names:

SELECT * FROM Bird_eggs
  JOIN Bird_nests USING (Nest_ID)
  JOIN Species ON Species = Code
  LIMIT 1;

When to order rows when joining tables?

Ordering is assuredly lost in a join. An entirely new table is formed.

This is pointless:

SELECT * FROM
  (SELECT * FROM Bird_eggs ORDER BY Width)
  JOIN Bird_nests
  USING (Nest_ID);

GOTCHA WITH GROUPING

DuckDB not smart enough to recognize primary keys uniqueness rule in joins

SELECT Nest_ID, COUNT(*)
  FROM Bird_nests JOIN Bird_eggs USING (Nest_ID)
  GR`OUP BY Nest_ID;

Can we add another column from Bird_nests? Other RDBMs’s say yes
DuckDB: no
So follow DuckDB’s suggestions

SELECT Nest_ID, ANY_VALUE(Species), COUNT(*)
  FROM Bird_nests JOIN Bird_eggs USING (Nest_ID)
  GROUP BY Nest_ID;

SELECT Nest_ID, Species, COUNT(*)
 FROM Bird_nests JOIN Bird_eggs USING (Nest_ID)
 GROUP BY Nest_ID, Species;

Wednesday session

POSIT WORKBENCH

Beware of the lock with active sessions!!
Always quit DuckDB before leaving workbench

If you can not find the session, you can kill the process running it on the server. In the DuckDB error message, the PID of that process owning the lock will be provided. You can use it to kill this process and remove the lock:

kill PID

Monday recap

Let’s go back to our toy database to review the different types of joins we talked about on Monday.

duckdb toy.duckdb

Let’s check we are in the right database

D .table

Quick look at our A & B Tables

SELECT * FROM A;
SELECT * FROM B;

Cross joins

A cross join will compute the Cartesian product of all the possible combinations of rows between two tables. The resulting table will have the sum of the columns and the product of the row of two tables we are joining.

SELECT * FROM A CROSS JOIN B;

Side notes

SELECT always select from the output of the query on the right side of the FROM
```
SELECT acol1, acol2 FROM (SELECT * FROM A CROSS JOIN B);
```

COUNT(*) vs COUNT(column) Although those functions look similar, they behave very differently; COUNT(*) count the number of rows in a table, group, … without caring about the values. COUNT(col) does look at the values and will thus ignore NULL values.

--- Remember that DuckDB does not like the missmatch of number of values even if it is the same value; 
--- We need to add the any_value or add the column in the Group by clause
SELECT acol1, acol2, COUNT(*) FROM ( SELECT * FROM A CROSS JOIN B ) GROUP BY acol1; 

--- This will count the number of rows in each group
SELECT acol1, ANY_VALUE(acol2),  COUNT(*) FROM ( SELECT * FROM A CROSS JOIN B ) GROUP BY acol1; 

--- This will count the number of non-NULL values in the column bcol3 in each group
SELECT acol1, ANY_VALUE(acol2), COUNT(bcol3) FROM ( SELECT * FROM A CROSS JOIN B ) GROUP BY acol1;

Using a condition on two different tables

For example joining A & B tables on the condition that the aclo1 is smaller that the bcol1

SELECT * FROM A JOIN B ON acol1 < bcol1;

Self-join

Useful for answering questions like finding all pairs that meet some criteria.

For example, we might want to bake something that has two different fruits and would dlike to have a list of all our possible options from the table A:

SELECT * FROM A AS A1 JOIN A AS A2 ON A1.acol1 <> A2.acol1;

┌───────┬───────────┬───────┬───────────┐
│ acol1 │   acol2   │ acol1 │   acol2   │
│ int32 │  varchar  │ int32 │  varchar  │
├───────┼───────────┼───────┼───────────┤
│     2 │ banana    │     1 │ apple     │
│     3 │ cranberry │     1 │ apple     │
│     1 │ apple     │     2 │ banana    │
│     3 │ cranberry │     2 │ banana    │
│     1 │ apple     │     3 │ cranberry │
│     2 │ banana    │     3 │ cranberry │
└───────┴───────────┴───────┴───────────┘

Inner & outer joins

For those, you need a relationship between the two tables captured by an association of keys. Let’s switch to our Harry Potter example

SELECT * FROM Student;
SELECT * FROM House;

`INNER JOIN` is the Default join

SELECT * FROM Student AS S JOIN House AS H ON S.House_ID = H.House_ID;

or since we have the same column name in both tables:

SELECT * FROM Student JOIN House USING (House_ID);

Note that there is only one occurence the House)ID column in this case

OUTER JOINS

SELECT * FROM Student FULL JOIN House USING (House_ID);

SELECT * FROM Student LEFT JOIN House USING (House_ID);

SELECT * FROM Student RIGHT JOIN House USING (House_ID);

Note: Would aCROSS JOIN` make any sense here? Well we can still do it :)

SELECT * FROM Student CROSS JOIN House;

Add a new table

As we mentioned, you can use SQL to add a new table to your database. Now that we have cleaned our snow data we might want to add this table to our database.

CREATE TABLE Snow_cover (
    Site VARCHAR NOT NULL,
    Year INTEGER NOT NULL CHECK (Year BETWEEN 1990 AND 2018),
    Date DATE NOT NULL,
    Plot VARCHAR NOT NULL,
    Location VARCHAR NOT NULL,
    Snow_cover REAL CHECK (Snow_cover BETWEEN 0 AND 130),
    Water_cover REAL CHECK (Water_cover BETWEEN 0 AND 130),
    Land_cover REAL CHECK (Land_cover BETWEEN 0 AND 130),
    Total_cover REAL CHECK (Total_cover BETWEEN 0 AND 130),
    Observer VARCHAR,
    Notes VARCHAR,
    PRIMARY KEY (Site, Plot, Location, Date),
    FOREIGN KEY (Site) REFERENCES Site (Code)
);

Let’s check if the table was added to our database:

D .table

Our table should have no data in it:

SELECT * FROM Snow_cover LIMIT 5;

Now we can import the data from our csv file into our database:

COPY Snow_cover FROM "../ASDN_csv/snow_survey_fixed.csv" (header TRUE, nullstr "NA");

Let’s check:

SELECT * FROM Snow_cover LIMIT 5;

🎉 import successful!!

Challenge

Ask 1: What is the average snow cover at each site?
Ask2: Top 5 most snowy sites

Temporary table

Since every query output is a table, we can save any query as new table!

We can for example create a backup of an entire table:

CREATE TEMP TABLE Camp_assignment_copy AS
   SELECT * FROM Camp_assignment;

Let’s check things:

D .table

Of course it can be a more complex query. Let’s say we often need to know which person worked at a specific site, We could use this query:

SELECT Year, Site, Name 
   FROM Camp_assignment_copy JOIN Personnel ON Observer = Abbreviation;

Since we need this query often we might want to save it… and since each SQL quey table returns a table, we can simply save it as a new table:

CREATE TEMP TABLE Camp_personnel_tmp AS
   SELECT Year, Site, Name 
   FROM Camp_assignment_copy JOIN Personnel ON Observer = Abbreviation;

Let’s check things:

D .table

Note that it does not have to be a JOIN, it can be any operations such as filetering the rows with a WHERE clause!

What is special about a TEMP TABLE vs a TABLE? It is deleted when you exit the database. In other words, temporary tables are only available to the current session.

However a TEMP TABLE is materialized and thus static.

But wait there is even better…

Views

a view looks like a table
virtual, not materialized like CREATE TABLE AS
virtual implies always reflects current data in source tables, i.e., view is (re-)executed each time it is referenced
can make normalized tables more palatable; reduce the number of joins one has to do manually

CREATE VIEW Camp_personnel_v AS
   SELECT Year, Site, Name 
   FROM Camp_assignment_copy JOIN Personnel ON Observer = Abbreviation;

A VIEW is also stored in the database!

D .table

Also try:

D .schema

Note that the VIEWs are mixed with the regular TABLEs. Why do you think?

How to list all the views in DuckDB:

SELECT view_name FROM duckdb_views;

Stressing the differences between a VIEW and a (TMP) TABLE:

DANGER ZONE

Let’s say the site bylo has been remove from our study. We thus want to delete those rows. Since it is a destructive action, it is a good idea to first run the query as a SELECT then switch to the DELETE

SELECT * FROM Camp_assignment_copy WHERE Site == 'bylo';

OK, we are ready to delete:

DELETE FROM Camp_assignment_copy WHERE Site == 'bylo';

Let’s check:

Did the temporary table update?

SELECT * FROM Camp_personnel_tmp LIMIT 10;

Did the view update?

SELECT * FROM Camp_personnel_v LIMIT 10;

Dropping things

You can also use DROP to delete a table, temp table, view, … and more:

DROP TABLE Camp_personnel_tmp;

Note the lack of TEMP when dropping a temporary table!

DROP VIEW Camp_personnel_v;

Challenge

Close the database
Reopen the database

What tables do you see?

SET OPERATIONS

UNION, INTERSECT, EXCEPT
last one is set difference

3rd way to get species that do not have nest data: use EXCEPT

First way:

SELECT Code FROM Species WHERE Code NOT IN (SELECT DISTINCT Species FROM Bird_nests);

Second way:

SELECT BN.Species, S.Code FROM Bird_nests BN RIGHT JOIN Species S ON BN.Species = S.Code WHERE BN.Species IS NULL;

Third way

SELECT Code FROM Species
  EXCEPT
  SELECT DISTINCT Species FROM Bird_nests;

example of UNION: suppose book b14 measured in inches, not millimeters, let’s adjust on fly

  SELECT Book_page, Nest_ID, Egg_num, Length*25.4, Width*25.4
      FROM Bird_eggs
      WHERE Book_page LIKE 'b14%'
  UNION
  SELECT Book_page, Nest_ID, Egg_num, Length, Width
      FROM Bird_eggs
      WHERE Book_page NOT LIKE 'b14%';

what if Book_page is NULL?
caution: not really looking at names of columns, just #columns and data types
so is possible to union nonsensical things together
UNION throws out duplicates, UNION ALL keeps all

Perhaps better example of UNION: let’s create a table of all bird nests and egg counts, including nests that have no egg data recorded.

SELECT Nest_ID, COUNT(*) AS Num_eggs
  FROM Bird_eggs
  GROUP BY Nest_ID

UNION

SELECT Nest_ID, 0 AS Num_eggs
  FROM Bird_nests
  WHERE Nest_ID NOT IN (SELECT DISTINCT Nest_ID FROM Bird_eggs);

Resources

Lecture notes:

Data:

Class data GitHub repository

Learning objectives

Monday Session

Tri-value logic

Nesting queries for more complex analysis

What species are not represented in the Bird_nests table?

Which site are located in Canada and have an area larger than 200?

Joins

Cartesian product

Cross-joins with a condition

Relational join

Let’s switch back to our ASDN database

When to order rows when joining tables?

GOTCHA WITH GROUPING

Wednesday session

Monday recap

Cross joins

Using a condition on two different tables

Self-join

Inner & outer joins

INNER JOIN is the Default join

OUTER JOINS

Add a new table

Temporary table

Views

Stressing the differences between a VIEW and a (TMP) TABLE:

Dropping things

SET OPERATIONS

3rd way to get species that do not have nest data: use EXCEPT

Resources

`INNER JOIN` is the Default join