Week 5 - I/O, data management & more
Learning objectives
- Data management SQL statements
Monday
RECAP: VIEW & TEMP TABLE
Views
- a view looks like a table
- stored in the database
- executed each time it is referenced
- handy way to make SQL more palatable
- ergo, a view similar to a function in a programming language
Note: a view can also be used to subset attributes from a table that might store sensitive data, think SSN, that you do not want to expose to certain users, like a web application.
Ex: suppose we always want to see species names, not codes
CREATE VIEW Nest_view AS
SELECT Book_page, Year, Site, Nest_ID, Scientific_name, Observer
FROM Bird_nests JOIN Species
ON Species = Code;Looks just like Bird_nests but with code replaced by scientific name:
SELECT * FROM Nest_view LIMIT 1;┌───────────┬───────┬─────────┬─────────┬─────────────────────────┬──────────┐
│ Book_page │ Year │ Site │ Nest_ID │ Scientific_name │ Observer │
│ varchar │ int32 │ varchar │ varchar │ varchar │ varchar │
├───────────┼───────┼─────────┼─────────┼─────────────────────────┼──────────┤
│ b14.6 │ 2014 │ chur │ 14HPE1 │ Charadrius semipalmatus │ vloverti │
└───────────┴───────┴─────────┴─────────┴─────────────────────────┴──────────┘
Now use as usual:
SELECT Nest_ID, ANY_VALUE(Scientific_name), COUNT(*) AS Num_eggs
FROM Nest_view JOIN Bird_eggs
USING (Nest_ID)
GROUP BY Nest_ID;Temp tables
- Temp table is like a variable in a programming language: you compute and store the value. It is static, you need manually to recompute it if the data changes
- As name suggests, a temp table is automatically deleted when database connection is closed
WITH clause
Another option is to use aWITH clause, which creates a view for just that statement
Ex: take previous table, use it input to another query:
WITH x AS (
SELECT Nest_ID, ANY_VALUE(Scientific_name) AS Scientific_name,
COUNT(*) AS Num_eggs
FROM Nest_view JOIN Bird_eggs
USING (Nest_ID)
GROUP BY Nest_ID
)
SELECT Scientific_name, AVG(Num_eggs) AS Avg_num_eggs FROM x
GROUP BY Scientific_name;┌─────────────────────────┬────────────────────┐
│ Scientific_name │ Avg_num_eggs │
│ varchar │ double │
├─────────────────────────┼────────────────────┤
│ Pluvialis dominica │ 3.3333333333333335 │
│ Pluvialis squatarola │ 4.0 │
│ Phalaropus fulicarius │ 4.0 │
│ Calidris fuscicollis │ 3.6153846153846154 │
│ Arenaria interpres │ 3.8333333333333335 │
│ Charadrius semipalmatus │ 3.757142857142857 │
│ Calidris alpina │ 3.6666666666666665 │
└─────────────────────────┴────────────────────┘
SET OPERATIONS
- UNION, INTERSECT, EXCEPT
- last one is set difference
- these are set operations, so UNION eliminates duplicates
- to preserve duplicate rows, UNION ALL
UNION
Example of UNION: let’s create a table of all bird nests and egg counts, including nests that have no egg data recorded. In the quiz last week, we did this by using an outer join:
SELECT Nest_ID, COUNT(Egg_num) as Num_eggs
FROM Bird_nests LEFT JOIN Bird_eggs
USING (Nest_ID)
GROUP BY Nest_ID;Using UNION:
SELECT Nest_ID, COUNT(*) AS Num_eggs
FROM Bird_eggs
GROUP BY Nest_ID
UNION
SELECT Nest_ID, 0 AS Num_eggs
FROM Bird_nests
WHERE Nest_ID NOT IN (SELECT DISTINCT Nest_ID FROM Bird_eggs);- Caution: UNION not really looking at names of columns, just number of columns and data types
- So is possible to union nonsensical things together (think
rbindin the R world)
EXCEPT
Example of EXCEPT: 3rd way to get species that do not have nest data
- First way - WHERE clause:
SELECT Code FROM Species
WHERE Code NOT IN (SELECT DISTINCT Species FROM Bird_nests);- Second way - Outer JOIN:
SELECT Species, Code
FROM Bird_nests RIGHT JOIN Species
ON Species = Code
WHERE Species IS NULL;- Third way - EXCEPT
SELECT Code FROM Species
EXCEPT
SELECT DISTINCT Species FROM Bird_nests;DATA MANAGEMENT STATEMENTS
- There’s way more than just SELECT in SQL, see https://www.sqlite.org/lang.html fro more
- Already seen CREATE, INSERT
- Can modify tables, schemas using ALTER
- DROP is counterpart to CREATE
UPDATE table SET ... WHERE ...
DELETE FROM table WHERE ...- Caution!!!: these operate on whole tables if unqualified
Let’s have a look aht the first 10 rows of the Bird_nesets table:
SELECT * FROM Bird_nests LIMIT 10;Good news, we found new data about the floatAge for the Nest 14HPE1. Let’s update the table:
UPDATE Bird_nests
SET floatAge = 4.5, ageMethod = 'float'
WHERE Nest_ID = '14HPE1';- DELETE is similar
- Oops, what happens when accidentally do?:
UPDATE Bird_nests SET floatAge = 4.5, ageMethod = ‘float’;
- There’s no UNDO in databases
- In this class we can recover using git:
D .exit
git restore database.duckdb
duckdb database.duckdbBut in general, won’t be a possibility, databases not typically under git control, usually stored on a server somewhere.
Strategies to avoid catastrophes
- Just subconsciouly be careful, like holding a kitchen knife
- Do SELECT first, then replace SELECT with DELETE <- allows inspection of what’s about to be deleted
- Put comment in front: – DELETE FROM …, then remove comment
- Tweak table name, put x in front, then remove
- Create a temporary table (with the relevant subset when large table)
Resources
Monday
Wednesday
- Bash programming:
- Bash essentials
- CLI pitch aka why should I care?!: https://eds-214.github.io/eds214-handson-cli/cli-pitch.html
- CLI practice: https://eds-214.github.io/eds214-handson-cli/cli-handson-files.html
- CLI advanced topics: https://eds-214.github.io/EDS-214-analytical-workflows/day2-cli_advanced.html
- bash manual: https://www.gnu.org/software/bash/manual/
- vim editor cheat sheet: https://brunj7.github.io/EDS-214-analytical-workflows/vim.html
Our bash script count_lines.sh counting the number of line in csv files within a folder:
#!/bin/bash
# Usage: sh count_lines.sh /path/to/
csv_dir=$1
for file in $csv_dir/*.csv; do
echo "$file has $(wc -l < $file) lines"
doneData:
How to sync your fork
Readings
Jeffrey D. Ullman and Jennifer Widom (2008). A First Course in Database Systems. 3rd ed. Upper Saddle River, NJ: Pearson/Prentice Hall.
Access via Library Catalog- Complete but theoretical introduction to relational databases, data modeling, and relational algebra.
