Week 4 - SQL and DuckDB
Learning objectives
Continued exploration of SQL concepts, including:
- nesting SQL queries
- different type of joins
Monday Session
Let’s open our ASDN database:
duckdb database.duckdbLet’s check we are in the right database
D .table
┌────────────────────┐┌───────────────────┐┌───────────────────┐┌──────────────────┐┌─────────────────────────┐
│ Bird_nests ││ Bird_eggs ││ Site ││ Camp_assignment ││ Species │
│ ││ ││ ││ ││ │
│ Book_page varchar ││ Book_page varchar ││ Code varchar ││ Year integer ││ Code varchar │
│ Year integer ││ Year integer ││ Site_name varchar ││ Site varchar ││ Common_name varchar │
│ Site varchar ││ Site varchar ││ Location varchar ││ Observer varchar ││ Scientific_name varchar │
│ Nest_ID varchar ││ Nest_ID varchar ││ Latitude float ││ Start date ││ Relevance varchar │
│ Species varchar ││ Egg_num integer ││ Longitude float ││ End date ││ │
│ Observer varchar ││ Length float ││ Area float ││ ││ 99 rows │
│ Date_found date ││ Width float ││ ││ 441 rows │└─────────────────────────┘
│ how_found varchar ││ ││ 16 rows │└──────────────────┘
│ Clutch_max integer ││ 406 rows │└───────────────────┘
│ floatAge float │└───────────────────┘
│ ageMethod varchar │
│ │
│ 1547 rows │
└────────────────────┘
┌──────────────────────┐
│ Personnel │
│ │
│ Abbreviation varchar │
│ Name varchar │
│ │
│ 269 rows │
└──────────────────────┘
Tri-value logic
- Expressions can return TRUE, FALSE, or NULL
- NULL is infectious
- In filtering, only TRUE counts
SELECT COUNT(*) FROM Bird_nests
WHERE floatAge < 7 OR floatAge >= 7;Only returns a fraction of the rows due to floatAge being NULL
In SQL, if you want to ask (or not ask) for NULLs: IS NULL, or IS NOT NULL
Nesting queries for more complex analysis
- Everything returned by an SQL query is a table
- Even SELECT COUNT(*) returns a table
- Wherever a table is requested, you can insert a parenthesized query; similarly to the tidyverse piping
What species are not represented in the Bird_nests table?
SELECT * FROM Species WHERE
Code NOT IN (SELECT DISTINCT Species FROM Bird_nests);Which site are located in Canada and have an area larger than 200?
SELECT Location, MAX(Area) AS Max_area
FROM Site
WHERE Location LIKE '%Canada'
GROUP BY Location
HAVING Max_area > 200;Note: HAVING is used to filter data on condition in combination of GROUP BY, whereas WHERE is used to filter rows.
Now let’s imagine there was no HAVING clause. How would you answer the above question?
SELECT Location, Max_area FROM
(SELECT Location, MAX(Area) AS Max_area
FROM Site
WHERE Location LIKE '%Canada'
GROUP BY Location)
WHERE Max_area > 200;Joins
For this part, we are using the toy.duckdb. Make sure you sync your fork and pull the latest version of our data repository.
duckdb toy.duckdbD .table
─────────────────────────────────── toy ────────────────────────────────────
─────────────────────────────────── main ───────────────────────────────────
┌────────────────────┐┌────────────────────┐┌───────────────┐┌───────────────┐
│ Student ││ House ││ B ││ A │
│ ││ ││ ││ │
│ Student_ID integer ││ House_ID integer ││ bcol1 integer ││ acol1 integer │
│ Name varchar ││ Name varchar ││ bcol2 varchar ││ acol2 varchar │
│ Birthdate date ││ Headmaster varchar ││ bcol3 varchar ││ │
│ House_ID integer ││ ││ ││ 3 rows │
│ ││ 3 rows ││ 3 rows │└───────────────┘
│ 4 rows │└────────────────────┘└───────────────┘
└────────────────────┘
Cartesian product
- Conceptually, the database performs a Cartesian product of the tables and then applies the join condition to filter
- In some databases, can do a Cartesian product saying JOIN with no join condition
- But in DuckDB, need to use
CROSS JOIN
Let’s have a look at out tables
SELECT * FROM A;┌───────┬───────────┐
│ acol1 │ acol2 │
│ int32 │ varchar │
├───────┼───────────┤
│ 1 │ apple │
│ 2 │ banana │
│ 3 │ cranberry │
└───────┴───────────┘
SELECT * FROM B;┌───────┬─────────┬──────────┐
│ bcol1 │ bcol2 │ bcol3 │
│ int32 │ varchar │ varchar │
├───────┼─────────┼──────────┤
│ 1 │ pie │ streusel │
│ 2 │ cake │ frosting │
│ 3 │ cookie │ NULL │
└───────┴─────────┴──────────┘
SELECT * FROM A CROSS JOIN B;┌───────┬───────────┬───────┬─────────┬──────────┐
│ acol1 │ acol2 │ bcol1 │ bcol2 │ bcol3 │
│ int32 │ varchar │ int32 │ varchar │ varchar │
├───────┼───────────┼───────┼─────────┼──────────┤
│ 1 │ apple │ 1 │ pie │ streusel │
│ 1 │ apple │ 2 │ cake │ frosting │
│ 1 │ apple │ 3 │ cookie │ NULL │
│ 2 │ banana │ 1 │ pie │ streusel │
│ 2 │ banana │ 2 │ cake │ frosting │
│ 2 │ banana │ 3 │ cookie │ NULL │
│ 3 │ cranberry │ 1 │ pie │ streusel │
│ 3 │ cranberry │ 2 │ cake │ frosting │
│ 3 │ cranberry │ 3 │ cookie │ NULL │
└───────┴───────────┴───────┴─────────┴──────────┘
As a result of a cross-join between to table you have a table that has the following characteristics:
- Number of rows: product of number of rows from tables A & B
- Number of columns: sum of number of columns from tables A & B
Cross-joins with a condition
When there is a condition added, JOIN defaults to inner join
SELECT * FROM A JOIN B ON acol1 < bcol1;┌───────┬─────────┬───────┬─────────┬──────────┐
│ acol1 │ acol2 │ bcol1 │ bcol2 │ bcol3 │
│ int32 │ varchar │ int32 │ varchar │ varchar │
├───────┼─────────┼───────┼─────────┼──────────┤
│ 1 │ apple │ 2 │ cake │ frosting │
│ 1 │ apple │ 3 │ cookie │ NULL │
│ 2 │ banana │ 3 │ cookie │ NULL │
└───────┴─────────┴───────┴─────────┴──────────┘
- Left or right outer join: selects rows not included, supplies NULL for values in other table
SELECT * FROM A RIGHT JOIN B ON acol1 < bcol1;───────┬─────────┬───────┬─────────┬──────────┐
│ acol1 │ acol2 │ bcol1 │ bcol2 │ bcol3 │
│ int32 │ varchar │ int32 │ varchar │ varchar │
├───────┼─────────┼───────┼─────────┼──────────┤
│ 1 │ apple │ 2 │ cake │ frosting │
│ 1 │ apple │ 3 │ cookie │ NULL │
│ 2 │ banana │ 3 │ cookie │ NULL │
│ NULL │ NULL │ 1 │ pie │ streusel │
└───────┴─────────┴───────┴─────────┴──────────┘
- Notice how get B’s row 1 now (because B is the “right table”)
- Similar for left side
SELECT * FROM A LEFT JOIN B ON acol1 < bcol1;┌───────┬───────────┬───────┬─────────┬──────────┐
│ acol1 │ acol2 │ bcol1 │ bcol2 │ bcol3 │
│ int32 │ varchar │ int32 │ varchar │ varchar │
├───────┼───────────┼───────┼─────────┼──────────┤
│ 1 │ apple │ 2 │ cake │ frosting │
│ 1 │ apple │ 3 │ cookie │ NULL │
│ 2 │ banana │ 3 │ cookie │ NULL │
│ 3 │ cranberry │ NULL │ NULL │ NULL │
└───────┴───────────┴───────┴─────────┴──────────┘
Now we we have the missing row from the A (left) table
Do a FULL OUTER JOIN to get both missing sides (rare)
Relational join
- In the case of a join on a foreign key, the effect is to add columns to the “many” table
SELECT * FROM House;┌──────────┬────────────┬────────────────────┐
│ House_ID │ Name │ Headmaster │
│ int32 │ varchar │ varchar │
├──────────┼────────────┼────────────────────┤
│ 1 │ Gryffindor │ Minerva McGonagall │
│ 2 │ Hufflepuff │ Pomona Sprout │
│ 3 │ Slytherin │ Severus Snape │
└──────────┴────────────┴────────────────────┘
SELECT * FROM Student;┌────────────┬──────────────────┬────────────┬──────────┐
│ Student_ID │ Name │ Birthdate │ House_ID │
│ int32 │ varchar │ date │ int32 │
├────────────┼──────────────────┼────────────┼──────────┤
│ 1 │ Harry Potter │ 1980-07-31 │ 1 │
│ 2 │ Hermione Granger │ 1979-09-19 │ 1 │
│ 3 │ Draco Malfoy │ 1980-06-05 │ 3 │
│ 4 │ Ginny Weasley │ 1981-08-11 │ 1 │
└────────────┴──────────────────┴────────────┴──────────┘
D .schema
SELECT * FROM Student AS S JOIN House AS H ON S.House_ID = H.House_ID;┌────────────┬──────────────────┬────────────┬──────────┬──────────┬────────────┬────────────────────┐
│ Student_ID │ Name │ Birthdate │ House_ID │ House_ID │ Name │ Headmaster │
│ int32 │ varchar │ date │ int32 │ int32 │ varchar │ varchar │
├────────────┼──────────────────┼────────────┼──────────┼──────────┼────────────┼────────────────────┤
│ 1 │ Harry Potter │ 1980-07-31 │ 1 │ 1 │ Gryffindor │ Minerva McGonagall │
│ 2 │ Hermione Granger │ 1979-09-19 │ 1 │ 1 │ Gryffindor │ Minerva McGonagall │
│ 3 │ Draco Malfoy │ 1980-06-05 │ 3 │ 3 │ Slytherin │ Severus Snape │
│ 4 │ Ginny Weasley │ 1981-08-11 │ 1 │ 1 │ Gryffindor │ Minerva McGonagall │
└────────────┴──────────────────┴────────────┴──────────┴──────────┴────────────┴────────────────────┘
- more compact syntax, since the columns used to join have the same name in each table:
SELECT * FROM Student JOIN House USING (House_ID);┌────────────┬──────────────────┬────────────┬──────────┬────────────┬────────────────────┐
│ Student_ID │ Name │ Birthdate │ House_ID │ Name │ Headmaster │
│ int32 │ varchar │ date │ int32 │ varchar │ varchar │
├────────────┼──────────────────┼────────────┼──────────┼────────────┼────────────────────┤
│ 1 │ Harry Potter │ 1980-07-31 │ 1 │ Gryffindor │ Minerva McGonagall │
│ 2 │ Hermione Granger │ 1979-09-19 │ 1 │ Gryffindor │ Minerva McGonagall │
│ 3 │ Draco Malfoy │ 1980-06-05 │ 3 │ Slytherin │ Severus Snape │
│ 4 │ Ginny Weasley │ 1981-08-11 │ 1 │ Gryffindor │ Minerva McGonagall │
└────────────┴──────────────────┴────────────┴──────────┴────────────┴────────────────────┘
Note that the USING clause also keep only one of key columns!
Let’s switch back to our ASDN database
duckdb database.duckdbD .mode line
DuckDB’s different output formats: https://duckdb.org/docs/current/clients/cli/output_formats
Let’s have a look at the Bird_eggs and Bird_nests tables:
SELECT COUNT(*) FROM Bird_eggs;
SELECT * FROM Bird_eggs LIMIT 1;
SELECT * FROM Bird_nests LIMIT 1;We can use Nest_ID to join those two tables
SELECT * FROM Bird_eggs
JOIN Bird_nests USING (Nest_ID)
LIMIT 1;Let’s also join the Species table to bring more readable names:
SELECT * FROM Bird_eggs
JOIN Bird_nests USING (Nest_ID)
JOIN Species ON Species = Code
LIMIT 1;When to order rows when joining tables?
Ordering is assuredly lost in a join. An entirely new table is formed.
This is pointless:
SELECT * FROM
(SELECT * FROM Bird_eggs ORDER BY Width)
JOIN Bird_nests
USING (Nest_ID);GOTCHA WITH GROUPING
DuckDB not smart enough to recognize primary keys uniqueness rule in joins
SELECT Nest_ID, COUNT(*)
FROM Bird_nests JOIN Bird_eggs USING (Nest_ID)
GR`OUP BY Nest_ID;- Can we add another column from Bird_nests? Other RDBMs’s say yes
- DuckDB: no
- So follow DuckDB’s suggestions
SELECT Nest_ID, ANY_VALUE(Species), COUNT(*)
FROM Bird_nests JOIN Bird_eggs USING (Nest_ID)
GROUP BY Nest_ID;SELECT Nest_ID, Species, COUNT(*)
FROM Bird_nests JOIN Bird_eggs USING (Nest_ID)
GROUP BY Nest_ID, Species;Wednesday session
- Beware of the lock with active sessions!!
- Always quit DuckDB before leaving workbench
If you can not find the session, you can kill the process running it on the server. In the DuckDB error message, the PID of that process owning the lock will be provided. You can use it to kill this process and remove the lock:
kill PIDMonday recap
Let’s go back to our toy database to review the different types of joins we talked about on Monday.
duckdb toy.duckdbLet’s check we are in the right database
D .table
Quick look at our A & B Tables
SELECT * FROM A;
SELECT * FROM B;Cross joins
A cross join will compute the Cartesian product of all the possible combinations of rows between two tables. The resulting table will have the sum of the columns and the product of the row of two tables we are joining.
SELECT * FROM A CROSS JOIN B;SELECTalways select from the output of the query on the right side of the FROMSELECT acol1, acol2 FROM (SELECT * FROM A CROSS JOIN B);COUNT(*)vsCOUNT(column)Although those functions look similar, they behave very differently;COUNT(*)count the number of rows in a table, group, … without caring about the values.COUNT(col)does look at the values and will thus ignore NULL values.--- Remember that DuckDB does not like the missmatch of number of values even if it is the same value; --- We need to add the any_value or add the column in the Group by clause SELECT acol1, acol2, COUNT(*) FROM ( SELECT * FROM A CROSS JOIN B ) GROUP BY acol1; --- This will count the number of rows in each group SELECT acol1, ANY_VALUE(acol2), COUNT(*) FROM ( SELECT * FROM A CROSS JOIN B ) GROUP BY acol1; --- This will count the number of non-NULL values in the column bcol3 in each group SELECT acol1, ANY_VALUE(acol2), COUNT(bcol3) FROM ( SELECT * FROM A CROSS JOIN B ) GROUP BY acol1;
Using a condition on two different tables
For example joining A & B tables on the condition that the aclo1 is smaller that the bcol1
SELECT * FROM A JOIN B ON acol1 < bcol1;Self-join
Useful for answering questions like finding all pairs that meet some criteria.
For example, we might want to bake something that has two different fruits and would dlike to have a list of all our possible options from the table A:
SELECT * FROM A AS A1 JOIN A AS A2 ON A1.acol1 <> A2.acol1;┌───────┬───────────┬───────┬───────────┐
│ acol1 │ acol2 │ acol1 │ acol2 │
│ int32 │ varchar │ int32 │ varchar │
├───────┼───────────┼───────┼───────────┤
│ 2 │ banana │ 1 │ apple │
│ 3 │ cranberry │ 1 │ apple │
│ 1 │ apple │ 2 │ banana │
│ 3 │ cranberry │ 2 │ banana │
│ 1 │ apple │ 3 │ cranberry │
│ 2 │ banana │ 3 │ cranberry │
└───────┴───────────┴───────┴───────────┘
Inner & outer joins
For those, you need a relationship between the two tables captured by an association of keys. Let’s switch to our Harry Potter example
SELECT * FROM Student;
SELECT * FROM House;INNER JOIN is the Default join
SELECT * FROM Student AS S JOIN House AS H ON S.House_ID = H.House_ID;or since we have the same column name in both tables:
SELECT * FROM Student JOIN House USING (House_ID);Note that there is only one occurence the House)ID column in this case
OUTER JOINS
SELECT * FROM Student FULL JOIN House USING (House_ID);
SELECT * FROM Student LEFT JOIN House USING (House_ID);
SELECT * FROM Student RIGHT JOIN House USING (House_ID);Note: Would aCROSS JOIN` make any sense here? Well we can still do it :)
SELECT * FROM Student CROSS JOIN House;Add a new table
As we mentioned, you can use SQL to add a new table to your database. Now that we have cleaned our snow data we might want to add this table to our database.
CREATE TABLE Snow_cover (
Site VARCHAR NOT NULL,
Year INTEGER NOT NULL CHECK (Year BETWEEN 1990 AND 2018),
Date DATE NOT NULL,
Plot VARCHAR NOT NULL,
Location VARCHAR NOT NULL,
Snow_cover REAL CHECK (Snow_cover BETWEEN 0 AND 130),
Water_cover REAL CHECK (Water_cover BETWEEN 0 AND 130),
Land_cover REAL CHECK (Land_cover BETWEEN 0 AND 130),
Total_cover REAL CHECK (Total_cover BETWEEN 0 AND 130),
Observer VARCHAR,
Notes VARCHAR,
PRIMARY KEY (Site, Plot, Location, Date),
FOREIGN KEY (Site) REFERENCES Site (Code)
);Let’s check if the table was added to our database:
D .table
Our table should have no data in it:
SELECT * FROM Snow_cover LIMIT 5;Now we can import the data from our csv file into our database:
COPY Snow_cover FROM "../ASDN_csv/snow_survey_fixed.csv" (header TRUE, nullstr "NA");Let’s check:
SELECT * FROM Snow_cover LIMIT 5;🎉 import successful!!
Ask 1: What is the average snow cover at each site?
Ask2: Top 5 most snowy sites
Temporary table
Since every query output is a table, we can save any query as new table!
We can for example create a backup of an entire table:
CREATE TEMP TABLE Camp_assignment_copy AS
SELECT * FROM Camp_assignment; Let’s check things:
D .table
Of course it can be a more complex query. Let’s say we often need to know which person worked at a specific site, We could use this query:
SELECT Year, Site, Name
FROM Camp_assignment_copy JOIN Personnel ON Observer = Abbreviation;Since we need this query often we might want to save it… and since each SQL quey table returns a table, we can simply save it as a new table:
CREATE TEMP TABLE Camp_personnel_tmp AS
SELECT Year, Site, Name
FROM Camp_assignment_copy JOIN Personnel ON Observer = Abbreviation;Let’s check things:
D .table
Note that it does not have to be a JOIN, it can be any operations such as filetering the rows with a WHERE clause!
What is special about a TEMP TABLE vs a TABLE? It is deleted when you exit the database. In other words, temporary tables are only available to the current session.
However a TEMP TABLE is materialized and thus static.
But wait there is even better…
Views
- a view looks like a table
- virtual, not materialized like CREATE TABLE AS
- virtual implies always reflects current data in source tables, i.e., view is (re-)executed each time it is referenced
- can make normalized tables more palatable; reduce the number of joins one has to do manually
CREATE VIEW Camp_personnel_v AS
SELECT Year, Site, Name
FROM Camp_assignment_copy JOIN Personnel ON Observer = Abbreviation;A VIEW is also stored in the database!
D .table
Also try:
D .schema
Note that the VIEWs are mixed with the regular TABLEs. Why do you think?
How to list all the views in DuckDB:
SELECT view_name FROM duckdb_views;Stressing the differences between a VIEW and a (TMP) TABLE:
DANGER ZONE
Let’s say the site bylo has been remove from our study. We thus want to delete those rows. Since it is a destructive action, it is a good idea to first run the query as a SELECT then switch to the DELETE
SELECT * FROM Camp_assignment_copy WHERE Site == 'bylo';OK, we are ready to delete:
DELETE FROM Camp_assignment_copy WHERE Site == 'bylo';Let’s check:
Did the temporary table update?
SELECT * FROM Camp_personnel_tmp LIMIT 10;Did the view update?
SELECT * FROM Camp_personnel_v LIMIT 10;Dropping things
You can also use DROP to delete a table, temp table, view, … and more:
DROP TABLE Camp_personnel_tmp;Note the lack of TEMP when dropping a temporary table!
DROP VIEW Camp_personnel_v;- Close the database
- Reopen the database
What tables do you see?
SET OPERATIONS
- UNION, INTERSECT, EXCEPT
- last one is set difference
3rd way to get species that do not have nest data: use EXCEPT
- First way:
SELECT Code FROM Species WHERE Code NOT IN (SELECT DISTINCT Species FROM Bird_nests);- Second way:
SELECT BN.Species, S.Code FROM Bird_nests BN RIGHT JOIN Species S ON BN.Species = S.Code WHERE BN.Species IS NULL;- Third way
SELECT Code FROM Species
EXCEPT
SELECT DISTINCT Species FROM Bird_nests;- example of UNION: suppose book b14 measured in inches, not millimeters, let’s adjust on fly
SELECT Book_page, Nest_ID, Egg_num, Length*25.4, Width*25.4
FROM Bird_eggs
WHERE Book_page LIKE 'b14%'
UNION
SELECT Book_page, Nest_ID, Egg_num, Length, Width
FROM Bird_eggs
WHERE Book_page NOT LIKE 'b14%';- what if Book_page is NULL?
- caution: not really looking at names of columns, just #columns and data types
- so is possible to union nonsensical things together
- UNION throws out duplicates, UNION ALL keeps all
Perhaps better example of UNION: let’s create a table of all bird nests and egg counts, including nests that have no egg data recorded.
SELECT Nest_ID, COUNT(*) AS Num_eggs
FROM Bird_eggs
GROUP BY Nest_ID
UNION
SELECT Nest_ID, 0 AS Num_eggs
FROM Bird_nests
WHERE Nest_ID NOT IN (SELECT DISTINCT Nest_ID FROM Bird_eggs);Resources
Lecture notes:
- lecture-notes-04-mon.txt
- in-class-script-04-mon.sql
- lecture-notes-04-wed.txt
- lecture-notes-data-management-io.sql
Data:
