RECAP
- open DuckDB on database file
- .tables, .schema to find out what's inside
- SELECT DISTINCT Location
      FROM Site
      ORDER BY Location
      LIMIT 3;
- write query in file, use .read to execute it
- good idea to capture schema in SQL file to recreate database,
  for documentation

FILTERING
- SELECT * FROM Site WHERE Area < 200;
- ...AND Latitude > 60, etc.
- not equals: classic is <>, but everybody supports != nowadays
- string matching
-- double up single quotes to create a single quote: 'I don''t like this'
-- LIKE '...', % for wildcard
-- string matching might be case (in)sensitive, check database settings
-- DuckDB: ILIKE for case-insensitive
-- regexp usually supported via functions

EXPRESSIONS
- SELECT Site_name, Area*2.47 FROM Site;
- expr AS name to give name to computed column
- look at DuckDB function list, lots of 'em
- || to concatenate strings

AGGREGATION & GROUPING
- SELECT COUNT(*) FROM Species;
- number of rows
- SELECT COUNT(Scientific_name) FROM Species;
- number of non-NULL values
- SELECT COUNT(DISTINCT Relevance) FROM Species;
- same as SELECT DISTINCT, but see how many rows are returned
- SELECT AVG(Area) FROM Site;
- MIN, MAX, etc.
- careful: SELECT Location, AVG(Area) FROM Site;
-- !! picks arbitrary value for location
-- But DuckDB prevents this (SQLite doesn't)
- enter grouping
- SQL will return one row per group
- SELECT Location, MAX(Area)
      FROM Site
      GROUP BY Location;
- SELECT Location, COUNT(*)... : counts #rows in each group
- WHERE to filter rows going into grouping operation
- HAVING to filter rows coming out of grouping operation
- SELECT Location, MAX(Area) AS Max_area
      FROM Site
      WHERE Location LIKE '%Canada'
      GROUP BY Location
      HAVING Max_area > 200
      ORDER BY Max_area DESC;

RELATIONAL ALGEBRA
- result of every operation is another table
- even SELECT COUNT(*) returns a little baby table
- CREATE TEMP TABLE AS SELECT... to save query
- can nest queries
-- SELECT ... FROM (SELECT ...)
-- another place: WHERE column IN (SELECT ...)
--- example: SELECT * FROM Bird_nests WHERE Observer IN (SELECT Abbreviation FROM Personnel WHERE Abbreviation LIKE 'b%')
--- example: SELECT * FROM Species WHERE Code NOT IN (SELECT DISTINCT Species FROM Bird_nests);
-- there are others
-- order is lost when query incorporated as subquery, part of join, etc.
-- So, generally, ORDER BY last part of outermost query

NULL PROCESSING
- NULL means unknown
- SELECT COUNT(*) FROM Bird_nests WHERE floatAge > 5;
- or <= 5
- still missing rows, because a NULL value is neither > 5 nor <= 5
- must use IS NULL or IS NOT NULL
- aggregate functions omit NULLs like R does

JOINS, THE HEART AND SOUL OF RDBMS
- SELECT * FROM Camp_assignment;
- but we want full names, not abbreviations
- so follow linkage shown in ER diagram
- SELECT * FROM Camp_assignment JOIN Personnel
      ON Observer = Abbreviation
      LIMIT 10;
- .mode csv, see all columns
- return is a mega-table with all columns
- (database doesn't actually do that unless required, is more efficient)
- more commonly, ask for selected columns
- notice denormalization that is happening
- which is the idea: normalize for data organization/compactness
- temporarily denormalize to relate data back together
- most databases support , in place of JOIN but not DuckDB
- or INNER JOIN to be verbose
- may need to qualify by table name if ambiguous
- SELECT * FROM Camp_assignment JOIN Personnel
      ON Camp_assignment.Observer = Personnel.Abbreviation
      LIMIT 10;
- SELECT * FROM Camp_assignment AS ca JOIN Personnel p
      ON ca.Observer = p.Abbreviation
      LIMIT 10;

- 3 way join:
- SELECT * FROM Camp_assignment CA JOIN Personnel P
        ON CA.Observer = P.Abbreviation
        JOIN Site S ON CA.Site = S.Code
        LIMIT 10;
- can add WHERE clauses, etc.: WHERE CA.Observer = 'lmckinnon'
- order of operations follows order of clauses in statement

- another example: how many bird eggs are in each nest?
- SELECT Nest_ID, COUNT(*) FROM Bird_eggs GROUP BY Nest_ID;
- can get away with USING since same column name
- SELECT Nest_id, COUNT(*)
      FROM Bird_eggs JOIN Bird_nests
      USING (Nest_ID)
      GROUP BY Nest_ID;
- let's add in some nicer species names; need to add Species to the join
- SELECT Nest_id, Species, COUNT(*)
      FROM Bird_eggs JOIN Bird_nests
      USING (Nest_ID)
      GROUP BY Nest_ID;

- Argh@#)*$&@*#)$& f-ing DuckDB won't let us
- There's only one Species per Nest_ID (Nest_ID is primary key of Bird_nests)
- so safe to use ANY_VALUE(Species), or add Species to the GROUP BY
- I prefer the former, since the latter is confusing

- Add nice scientific names
- SELECT Nest_ID, ANY_VALUE(Scientific_name), COUNT(*) AS Num_eggs
  FROM Bird_eggs JOIN Bird_nests USING (Nest_ID)
  JOIN Species On Species = Code
  GROUP BY Nest_ID;