git and GitHub

Version Control with git and GitHub

Aka – Say goodbye to script_JB_03v5b.R !!

The problem with save_as

Every file in the scientific process changes. Manuscripts are edited. Figures get revised. Code gets fixed when problems are discovered. Data files get combined together, then errors are fixed, and then they are split and combined again. In the course of a single analysis, one can expect thousands of changes to files. And yet, all we use to track this are simplistic filenames. You might think there is a better way, and you’d be right: version control.

Version control systems help you track all of the changes to your files, without the spaghetti mess that ensues from simple file renaming. In other words, version control is a system that helps you to manage the different versions of your files in an organized manner. It will help you to never have to duplicate files using save as as a way to keep different versions of a file (see below). Version control helps you to develop a timeline of snapshots containing the different versions of a file. At any point in time, you will be able to roll back to a specific version. Bonus: you can add a short description (commit message) to remember what each specific version is about.

What is the difference between git and GitHub?

  • git: is a version control software used to track files in a folder (a repository)
    • git creates a timeline or history of your files
  • GitHub: is a code repository in the cloud that enables users to store their git repositories and share them with others. Github also adds many features to manage projects and document your work.

git

This section focuses on the code versioning system called Git. Note that there are others, such as Mercurial or svn for example.

Git is a free and open source distributed version control system. It has many functionalities and was originally geared towards software development and production environment. In fact, Git was initially designed and developed in 2005 by Linux kernel developers (including Linus Torvalds) to track the development of the Linux kernel. Here is a fun video of Linus Torvalds touting Git to Google.

How does it work?

Git can be enabled on a specific folder/directory on your file system to version files within that directory (including sub-directories). In git (and other version control systems) terms, this “tracked folder” is called a repository (which formally is a specific data structure storing versioning information).

What git is not:

  • Git is not a backup per se
  • Git is not good at versioning large files (there are workarounds) => not meant for large data

Git was initially designed and developed by Linux kernel developers (including Linus Torvalds) to track the development of the Linux kernel in 2005. Here is a fun video of Linus Torvalds touting Git to Google engineers.

Repository

Git can be enabled on a specific folder/directory on your file system to version files within that directory (including sub-directories). In git (and other version control systems) terms, this “tracked folder” is called a repository (which formally is a specific data structure storing versioning information).

Although there are many ways to start a new repository, GitHub (or any other cloud solutions, such as GitLab) provide among the most convenient way of starting a repository.

GitHub

GitHub is a company that hosts git repositories online and provides several collaboration features (among which forking). GitHub fosters a great user community and has built a nice web interface to git, also adding great visualization/rendering capacities of your data. It is like a coding social platform for nerds ALL

A few highlights of what you can do with GitHub:

  • Publish and share your work (like the website we are using today!!)
  • Visualize your files and modifications (highlight changes in your code; can render files such as csv, png, …)
  • Manage projects and tasks (GitHub issues)
  • Uses the markdown syntax everywhere (like Rmarkdown, quarto, and Jupyter notebooks!)

GitHub Dashboard

This is the default landing page when you log into your account. It provides a mix of the most recent resources and activities of your and your collaborators’ actions, as well as some resources relevant to your work. The dashboard therefore changes on a regular basis. Once logged in, you can access your dashboard at https://github.com

GitHub User page

This page can be reached using the following URL: https://github.com/username. For my user (brunj7) it would be: https://github.com/brunj7. It is a great space for you to provide some information about yourself and the main repositories you are working on. It also lists the GitHub Organizations you are part of. But more importantly, Users own repositories to host and share their code. You can list repositories from a User by clicking on the repositories tab in the main GitHub menu bar at the top.

GitHub Organization page

We will talk more about GitHub Organizations later. In a nutshell, organizations are like groups or teams that users can be members of. Like Users, Organizations can have a landing page and own repositories. However, they add several perks in terms of user management. Similarly to Users, you can access repositories from an Organization by clicking on the repositories tab in the main GitHub menu bar at the top. You can access an organization’s page similarly to a user: https://github.com/organization-name; e.g. https://github.com/UCSB-Library-Research-Data-Services

Let’s look at a repository on GitHub

This screen shows the copy of a repository stored on GitHub, with its list of files, when the files and directories were last modified, and some information on who made the most recent changes.

If we drill into the “commits” for the repository, we can see the history of changes made to all of the files. Looks like kellijohnson and seananderson were fixing things in June and July:

And finally, if we drill into the changes made on June 13, we can see exactly what was changed in each file:

Tracking these changes, and seeing how they relate to released versions of software and files is exactly what Git and GitHub are good for. We will show how they can really be effective for tracking versions of scientific code, figures, and manuscripts to accomplish a reproducible workflow.

Recap

Aknowledgements

Some parts of this content were adapted from NCEAS Reproducible Research Techniques for Synthesis & Collaborative Coding with GitHub. LNO Scientific Computing Team.


UCSB library logo

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License