Modeling UCSB Library general workshop attendance as a function of modality and scheduling.
This model card follows the Hugging Face model card template.
This is a model for predicting the attendance level of a workshop based on the workshop’s modality (in-person, online, hybrid) and its scheduling characteristics such as single session vs. multiple sessions, week in quarter, day of week, etc. The output variable, attendance level, bins the predicted number of students (i.e., attendees) in groups of 20: 1-20 students, 21-40 students, etc.
The data included, our-workshops.csv, on which the model
was trained, lists general (i.e., non-course-affiliated) Library
workshops, primarily RDS and
DREAM Lab workshops,
dating back to 2019, for which scheduling and instruction modality were
decisions on the Library’s part and not dictated by externalities. The
information for a workshop includes modality (in-person vs. online),
number of students/attendees, and scheduling characteristics (single
session vs. multiple sessions, week within quarter, day(s) within week,
and time of day). See data-description.txt for a fuller
description of the dataset.
The primary source for this data was the instruction
stats page maintained by Teaching & Learning. This was processed
by prep.R and then hand-edited (and hand-corrected in a few
places) to add additional column values by referring back to old Google
Calendar events and the DREAM Lab’s log of past
workshops.
Run this model to predict the attendance level given a proposed schedule for a workshop.
Given its narrow training, this model is unlikely to be relevant to any other institution, or indeed even to any other group outside the UCSB Library.
The data is public record as it reflects workshops that were publicly advertised and publicly attended. Furthermore, the data was obtained from a publicly-accessible URL. Nevertheless, instructor names and work emails might be viewed as private details. In any case, they are irrelevant to this model’s purpose and could be removed.
The range of workshop characteristics is very wide relative to the number of workshops, that is, for any “type” of workshop one might identify, there are only one or a handful of workshops of that type. As a consequence, it is difficult to make any generalizations, which probably accounts for why this model is a failure.
Put another way, the number of independent variables (5) is large in relation to the number of data points (113).
The model ignores workshops’ topical subjects and how the workshops were advertised, both of which may (and likely did) influence attendance.
The data dates back to 2019. The effects of the COVID lockdown (e.g., workshops during that period were required to be online) are ignored.
This model should not be relied upon for anything.
Run decision-tree.qmd. Instructions on how to run the
model on new data are given in that notebook.
The model was trained on the entire dataset.
Random forest.
As described above, the data was preprocessed and then hand-edited and hand-corrected in a few places. The raw number of students was binned into ranges of 20, or quartiles.
Untouched.
The model was trained on the entire dataset and the overall error was estimated from the out-of-bag error rates.
Performance here is accuracy of classification.
The accuracy is 60%.
The model sucks.
Greg Janée (gjanee@ucsb.edu)