Week 5 - Create a test harness
Part 1
A common task is to create a script — a “test harness” — that will allow you to test something (some code, an algorithm, a model, etc.) by running it repeatedly, perhaps under varying conditions. Example applications include machine learning using different hyperparameters and Monte Carlo simulation using different random seeds. In this assignment you will create a Bash script that will allow you to time how long it takes to run an SQL query. You will be using this script later in class to explore the performance effects of indexes, so hang on to it.
Your script will be invoked like so:
% bash query_timer.sh label num_reps query db_file csv_file
Arguments:
label: explanatory label that will be output
num_reps: number of repetitions
query: SQL query to run
db_file: database file
csv_file: CSV file to create or append to
For example, you might run:
% bash query_timer.sh with_index_a 1000 'SELECT COUNT(*) FROM Bird_nests' \
database.db timings.csv
In this example your script would run the given query (i.e., SELECT COUNT(*) FROM Bird_nests) on the given database (database.db) 1000 times. If the total time that took was 3 seconds, your script will divide that time by the number of repetitions (1000) and compute that each DuckDB invocation took 0.003 seconds. Finally, your script will append the following record to the CSV tile timings.csv:
with_index_a,0.003
Conceptually, your script will look something like this (this is pseudocode):
get current time and store it
loop num_reps times
duckdb db_file query
end loop
get current time
compute elapsed time
divide elapsed time by num_reps
write output
I would like you to follow the above pseudocode because this exercise is fundamentally about using certain Bash features.
(BTW, is this a fair way to time queries? Yes and no. Clearly it’s a bit unfair that we are counting the overhead of repeatedly firing up DuckDB and opening the database file. And there may be other processing going on on the machine that affects the wall-clock time that we’re measuring here. On the other hand, databases are almost always I/O-bound and not compute-bound, that is, their performance is primarily limited by the time it takes to read data into memory from disk, and because of that, wall-clock time can be a better measure than CPU time.)
We’re doing this in Bash, so every one of these steps is a challenge (it is okay to hate Bash). So here are a lot of tips and hints:
Getting the current time: try
date +%s
or use the magic SECONDS variable (do aman bash
to read about it).Looping: see below.
In executing DuckDB, be sure to appropriately quote the query.
Computing elapsed time: use Bash arithmetic.
Division. Bash does not support floating point numbers, so you’ll have to use a helper program. Take yer pick depending on what’s available to you. Let’s say you want to divide 10 by 3 (you will want to reference variables in your computation, but here I’m just illustrating dividing two literal numbers). You might say something like:
elapsed=$(echo "scale=7; 10/3" | bc)
elapsed=$(echo "10/3" | awk -F / '{print $1/$2}')
elapsed=$(python -c "print(10/3)")
elapsed=$(perl -l -e "print 10/3")
Output: be sure to use the appropriate I/O redirection.
Two approaches to looping. If you want to do something 10 times, you can use a while loop that increments a counter:
i=0
while [ $i -lt 10 ]; do
echo "this is loop iteration $i"
i=$((i+1))
done
Or, you can use seq
. Generally you can use a for loop to loop through a list of items like so:
for i in 0 1 2 3 4 5 6 7 8 9; do
echo "this is loop iteration $i"
done
But you can use seq
to generate a list of numbers of a desired length:
for i in $(seq 10); do
echo "this is loop iteration $i"
done
And some more advice. The reason for making the number of repetitions an argument to this script, as opposed to a fixed constant, is that you may have to adjust the number of repetitions depending how fast the query is. The date
command and SECONDS variable have a resolution of only 1 second, so if running the query 10 times still fits in under 1 second your elapsed time will show up as 0. You may need to run a query 100 or 1000 times or more to get positive elapsed times, and to get more precision.
Also, we don’t actually care about the output from the query here. I would keep the output while you’re debugging your script (so that you can verify that DuckDB is being run repeatedly), but once you feel your script is working you can redirect DuckDB’s output and error streams to /dev/null.
Finally, it is highly recommended that you upload your script to https://www.shellcheck.net. You’re not required to follow its advice, but I have found its advice to be enlightening.
Please submit your Bash script.
Part 2
In class we looked at three ways to find out which species we do not have nest data for. Method using NOT IN:
SELECT Code
FROM Species
WHERE Code NOT IN (SELECT DISTINCT Species FROM Bird_nests);
Method using an outer join:
SELECT Code
FROM Bird_nests RIGHT JOIN Species
ON Species = Code
WHERE Nest_ID IS NULL;
Method using a set operation:
SELECT Code FROM Species
EXCEPT
SELECT DISTINCT Species FROM Bird_nests;
Use your test harness to time these three queries. Use descriptive labels to record which query is being run, e.g., subquery
, outer_join
, and except
. Submit your CSV file. Report back how many repetitions you had to use to get good timings, and which method is fastest.