Week 5 - I/O, data management & more

Learning objectives

  • Data management SQL statements

Monday

RECAP: VIEW & TEMP TABLE

Views

  • a view looks like a table
  • stored in the database
  • executed each time it is referenced
  • handy way to make SQL more palatable
  • ergo, a view similar to a function in a programming language

Note: a view can also be used to subset attributes from a table that might store sensitive data, think SSN, that you do not want to expose to certain users, like a web application.

Ex: suppose we always want to see species names, not codes

CREATE VIEW Nest_view AS
    SELECT Book_page, Year, Site, Nest_ID, Scientific_name, Observer
    FROM Bird_nests JOIN Species
    ON Species = Code;

Looks just like Bird_nests but with code replaced by scientific name:

SELECT * FROM Nest_view LIMIT 1;
┌───────────┬───────┬─────────┬─────────┬─────────────────────────┬──────────┐
│ Book_page │ Year  │  Site   │ Nest_ID │     Scientific_name     │ Observer │
│  varchar  │ int32 │ varchar │ varchar │         varchar         │ varchar  │
├───────────┼───────┼─────────┼─────────┼─────────────────────────┼──────────┤
│ b14.6     │  2014 │ chur    │ 14HPE1  │ Charadrius semipalmatus │ vloverti │
└───────────┴───────┴─────────┴─────────┴─────────────────────────┴──────────┘

Now use as usual:

SELECT Nest_ID, ANY_VALUE(Scientific_name), COUNT(*) AS Num_eggs
    FROM Nest_view JOIN Bird_eggs
    USING (Nest_ID)
    GROUP BY Nest_ID;

Temp tables

  • Temp table is like a variable in a programming language: you compute and store the value. It is static, you need manually to recompute it if the data changes
  • As name suggests, a temp table is automatically deleted when database connection is closed

WITH clause

Another option is to use aWITH clause, which creates a view for just that statement

Ex: take previous table, use it input to another query:

WITH x AS (
  SELECT Nest_ID, ANY_VALUE(Scientific_name) AS Scientific_name,
    COUNT(*) AS Num_eggs
    FROM Nest_view JOIN Bird_eggs
    USING (Nest_ID)
    GROUP BY Nest_ID
  )
SELECT Scientific_name, AVG(Num_eggs) AS Avg_num_eggs FROM x
GROUP BY Scientific_name;
┌─────────────────────────┬────────────────────┐
│     Scientific_name     │    Avg_num_eggs    │
│         varchar         │       double       │
├─────────────────────────┼────────────────────┤
│ Pluvialis dominica      │ 3.3333333333333335 │
│ Pluvialis squatarola    │                4.0 │
│ Phalaropus fulicarius   │                4.0 │
│ Calidris fuscicollis    │ 3.6153846153846154 │
│ Arenaria interpres      │ 3.8333333333333335 │
│ Charadrius semipalmatus │  3.757142857142857 │
│ Calidris alpina         │ 3.6666666666666665 │
└─────────────────────────┴────────────────────┘

SET OPERATIONS

  • UNION, INTERSECT, EXCEPT
  • last one is set difference
  • these are set operations, so UNION eliminates duplicates
  • to preserve duplicate rows, UNION ALL

UNION

Example of UNION: let’s create a table of all bird nests and egg counts, including nests that have no egg data recorded. In the quiz last week, we did this by using an outer join:

SELECT Nest_ID, COUNT(Egg_num) as Num_eggs
  FROM Bird_nests LEFT JOIN Bird_eggs
  USING (Nest_ID)
  GROUP BY Nest_ID;

Using UNION:

SELECT Nest_ID, COUNT(*) AS Num_eggs
  FROM Bird_eggs
  GROUP BY Nest_ID

UNION

SELECT Nest_ID, 0 AS Num_eggs
  FROM Bird_nests
  WHERE Nest_ID NOT IN (SELECT DISTINCT Nest_ID FROM Bird_eggs);
  • Caution: UNION not really looking at names of columns, just number of columns and data types
  • So is possible to union nonsensical things together (think rbind in the R world)

EXCEPT

Example of EXCEPT: 3rd way to get species that do not have nest data

  • First way - WHERE clause:
SELECT Code FROM Species
  WHERE Code NOT IN (SELECT DISTINCT Species FROM Bird_nests);
  • Second way - Outer JOIN:
SELECT Species, Code
  FROM Bird_nests RIGHT JOIN Species
  ON Species = Code
  WHERE Species IS NULL;
  • Third way - EXCEPT
SELECT Code FROM Species
  EXCEPT
  SELECT DISTINCT Species FROM Bird_nests;

DATA MANAGEMENT STATEMENTS

  • There’s way more than just SELECT in SQL, see https://www.sqlite.org/lang.html fro more
  • Already seen CREATE, INSERT
  • Can modify tables, schemas using ALTER
  • DROP is counterpart to CREATE
UPDATE table SET ... WHERE ...
DELETE FROM table WHERE ...
Warning
  • Caution!!!: these operate on whole tables if unqualified

Let’s have a look aht the first 10 rows of the Bird_nesets table:

SELECT * FROM Bird_nests LIMIT 10;

Good news, we found new data about the floatAge for the Nest 14HPE1. Let’s update the table:

UPDATE Bird_nests
  SET floatAge = 4.5, ageMethod = 'float'
  WHERE Nest_ID = '14HPE1';
  • DELETE is similar
  • Oops, what happens when accidentally do?:

UPDATE Bird_nests SET floatAge = 4.5, ageMethod = ‘float’;

  • There’s no UNDO in databases
  • In this class we can recover using git:
D .exit
git restore database.duckdb
duckdb database.duckdb

But in general, won’t be a possibility, databases not typically under git control, usually stored on a server somewhere.

Strategies to avoid catastrophes

  • Just subconsciouly be careful, like holding a kitchen knife
  • Do SELECT first, then replace SELECT with DELETE <- allows inspection of what’s about to be deleted
  • Put comment in front: – DELETE FROM …, then remove comment
  • Tweak table name, put x in front, then remove
  • Create a temporary table (with the relevant subset when large table)

Resources

Monday

Wednesday

Our bash script count_lines.sh counting the number of line in csv files within a folder:

#!/bin/bash
# Usage: sh count_lines.sh /path/to/

csv_dir=$1

for file in $csv_dir/*.csv; do
    echo "$file has $(wc -l < $file) lines"
done

Data:

How to sync your fork

Readings

  1. Jeffrey D. Ullman and Jennifer Widom (2008). A First Course in Database Systems. 3rd ed. Upper Saddle River, NJ: Pearson/Prentice Hall.
    Access via Library Catalog

    • Complete but theoretical introduction to relational databases, data modeling, and relational algebra.


This work is licensed under CC BY 4.0

UCSB logo