Week 5 - Create a test harness

Part 1

A common task is to create a script — a “test harness” — that will allow you to test something (some code, an algorithm, a model, etc.) by running it repeatedly, perhaps under varying conditions. Example applications include machine learning using different hyperparameters and Monte Carlo simulation using different random seeds. In this assignment you will create a Bash script that will allow you to time how long it takes to run an SQL query. You will be using this script in the last week of class to explore the performance effects of indexes, so hang on to it.

Your script will be invoked like so:

% bash my_sqlite_tester.sh label num_reps query db_file csv_file

  Arguments:
      label:    explanatory label that will be output
      num_reps: number of repetitions
      query:    SQL query to run
      db_file:  database file
      csv_file: CSV file to create or append to

For example, you might run:

% bash my_sqlite_tester.sh with_index_a 1000 'SELECT COUNT(*) FROM Bird_nests' database.db timings.csv

In this example your script would run the given query (i.e., SELECT COUNT(*) FROM Bird_nests) on the given database (database.db) 1000 times. If the total time that took was 3 seconds, your script will divide that time by the number of repetitions (1000) and compute that each SQLite invocation took 0.003 seconds. Finally, your script will append the following record to the CSV tile timings.csv:

with_index_a,0.003

Conceptually, your script will look something like this (this is pseudocode):

get current time and store it
loop num_reps times
    sqlite3 db_file query
end loop
get current time
compute elapsed time
divide elapsed time by num_reps
write output

I would like you to follow the above pseudocode because this exercise is fundamentally about using certain Bash features.

(BTW, is this a fair way to time queries? Yes and no. Clearly it’s a bit unfair that we are counting the overhead of repeatedly firing up SQLite and opening the database file. And there may be other processing going on on the machine that affects the wall-clock time that we’re measuring here. On the other hand, databases are almost always I/O-bound and not compute-bound, that is, their performance is primarily limited by the time it takes to read data into memory from disk, and because of that, wall-clock time can be a better measure than CPU time.)

We’re doing this in Bash, so every one of these steps is a challenge (it is okay to hate Bash). So here are a lot of tips and hints:

  • Getting the current time: try date +%s or use the magic SECONDS variable (do a man bash to read about it).

  • Looping: see below.

  • In executing sqlite3, be sure to appropriately quote the query.

  • Computing elapsed time: use Bash arithmetic.

  • Division. Bash does not support floating point numbers, so you’ll have to use a helper program. Take yer pick depending on what’s available to you. Let’s say you want to divide 10 by 3 (you will want to reference variables in your computation, but here I’m just illustrating dividing two literal numbers). You might say something like:

    • elapsed=$(echo "scale=7; 10/3" | bc)
    • elapsed=$(echo "10/3" | awk -F / '{print $1/$2}')
    • elapsed=$(python -c "print(10/3)")
    • elapsed=$(perl -l -e "print 10/3")
  • Output: be sure to use the appropriate I/O redirection.

Two approaches to looping. If you want to do something 10 times, you can use a while loop that increments a counter:

i=0
while [ $i -lt 10 ]; do
    echo "this is loop iteration $i"
    i=$((i+1))
done

Or, you can use seq. Generally you can use a for loop to loop through a list of items like so:

for i in 0 1 2 3 4 5 6 7 8 9; do
    echo "this is loop iteration $i"
done

But you can use seq to generate a list of numbers of a desired length:

for i in $(seq 10); do
    echo "this is loop iteration $i"
done

And some more advice. The reason for making the number of repetitions an argument to this script, as opposed to a fixed constant, is that you may have to adjust the number of repetitions depending how fast the query is. The date command and SECONDS variable have a resolution of only 1 second, so if running the query 10 times still fits in under 1 second your elapsed time will show up as 0. You may need to run a query 100 or 1000 times or more to get positive elapsed times, and to get more precision.

Also, we don’t actually care about the output from the query here. I would keep the output while you’re debugging your script (so that you can verify that SQLite is being run repeatedly), but once you feel your script is working you can redirect SQLite’s output and error streams to /dev/null.

Finally, it is highly recommended that you upload your script to https://www.shellcheck.net. You’re not required to follow its advice, but I have found its advice to be enlightening.

Please submit your Bash script.

Part 2

In class we looked at three ways to find out which species we do not have nest data for. Method using NOT IN:

SELECT Code
    FROM Species
    WHERE Code NOT IN (SELECT DISTINCT Species FROM Bird_nests);

Method using an outer join:

SELECT Species.Code
    FROM Bird_nests RIGHT JOIN Species
    ON Bird_nests.Species = Species.Code
    WHERE Bird_nests.Nest_ID IS NULL;

Method using a set operation:

SELECT Code FROM Species
EXCEPT
SELECT DISTINCT Species FROM Bird_nests;

Use your test harness to time these three queries. Report back how many repetitions you had to use to get good timings, the query times, and which method is fastest.