Preserve your code

Making your code readable

https://twitter.com/cjm4189/status/1557346489613094914

It is important to make your code easy to read if you hope that others will reuse it. It starts with using a consistent style witing your scripts (at least within a project).

Here is a good style guide for R: https://style.tidyverse.org/
Style guide for Python: https://www.python.org/dev/peps/pep-0008/

import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

There is also the visual aspect of the code that should not be neglected. Like prose, if you receive a long text without any paragraphs, you might be not very excited about reading it. Indentation, spaces, and empty lines should be leveraged to make a script visually inviting and easy to read. The good news is that most Integrated Development Environments (IDEs) will help you to do so by auto formatting your scripts according to conventions. Note that also a lot of IDEs, such as RStudio, rely on some conventions to ease the navigation of scripts and notebooks. For example, try to add four - or # after a line starting with one or several # in an R Script!

Comments

Real Programmers don’t comment their code. If it was hard to write, it should be hard to understand.
Tom Van Vleck, based on people he knew…_ (https://multicians.org/thvv/realprogs.html)

Joke aside, it is really hard to comment too much your code, because even steps that might seem trivial today might not be so anymore in a few weeks or months for now. In addition, well commented code is more likely to be read by others. Note also that comments should complement the code and should not being seen as work around vague naming conventions of variables or functions, but neither should they simple restate what the code is doing.

x <- 9.81  #  gravitational acceleration               POOR
gravity_acc <- 9.81  #  gravitational acceleration     BETTER

name_index <- 0  # set name_index to 0                 POOR
name_index <- 0                                        BETTER

Inline

It does not matter if you are using a script or notebook. It is important to provide comments along your code to complement it by:

Explaining what the code does
Capturing decisions that were made on the analytical side. For example, why a specific value was used for a threshold.
Specifying why and when code was added to handle an edge case such as an unexpected value in the data (so a new user doesn’t have to guess what the code does and might want to delete it assuming it is not necessary)

Other thoughts:

It is OK to state (what seems) the obvious (some might disagree with this statement)
Try to keep comments to the point and short

Functions

Both Python and R have conventions on how to document functions. Adopting those conventions will help you to make your code readable but also to automate part of the documentation development.

Roxygen2

The goal of roxygen2 is to make documenting your code as easy as possible. It can dynamically inspect the objects that it’s documenting, so it can automatically add data that you’d otherwise have to write by hand.

How do we insert it? Make sure you cursor is inside the function you want to document and from RStudio Menu Code -> Insert Roxygen Skeleton

Example:

#' Add together two numbers
#'
#' @param x A number
#' @param y A number
#' @return The sum of \code{x} and \code{y}
#' @examples
#' add(1, 1)
#' add(10, 1)
add2 <- function(x, y) {
  x + y
}

Try it! - Copy the function (without the documentation) in a new script - Add a third parameter to the function such as it sums 3 numbers - Add the Roxygen skeleton - Fill it to best describe your function

Note that when you are developing an R package, the Roxygen skeleton can be leveraged to develop the help pages of your package so you only have one place to update and the help will synchronize automatically.

Python Docstring

A docstring is a string literal that occurs as the first statement in a module, function, class, or method definition. Such a docstring becomes the __doc__ special attribute of that object.

def complex(real=0.0, imag=0.0):
    """Form a complex number.

    Keyword arguments:
    real -- the real part (default 0.0)
    imag -- the imaginary part (default 0.0)
    """
    if imag == 0.0 and real == 0.0:
        return complex_zero

Go here for more information: https://www.python.org/dev/peps/pep-0257/

Leveraging Notebooks

We have focused on and been experimenting with Notebooks during the week because they provide space to further develop content, such as methodology, around the code you are developing in your analysis. Notebooks also enable you to integrate the outputs of your scientific research with the code that was used to produce it. Finally, notebooks can be rendered into various formats that allows them to be shared with a broad audience.

Notebooks are not only used within the scientific community, see here for some thoughts from Airbnb data science team.

Hands-on

Documenting

getPercent <- function( value, pct ) {
    result <- value * ( pct / 100 )
    return( result )
}

Try adding the Roxygen Skeleton to this function and fill all the information you think is necessary to document the function

Commenting

Let’s try to improve the readability and documentation of this repository: https://github.com/brunj7/better-comments. Follow the instructions on the README

For inspiration, you can check out the NASA code for APOLLO 11 dating from 1969: https://github.com/chrislgarry/Apollo-11!!

Code repositories

On-line code repositories are a great way to version and share your code. Here are a few examples of git-based code repositories:

GitHub
GitLab
Bitbucket
SourceForge

Note however that there is no long-term commitment of any of those main code repositories and that archiving the specific snapshot of your code that was used for a specific analysis along your data is a great idea. Several data repositories offer an integration that lets you do that with data repositories. For example, Zenodo has a great integration with GitHub that lets you issue a DOI for a specific release (read snapshot) of your repository and preserve it independently from the code repository. See here for more details.

Note that code repositories and data repositories complement each other: you can see your code repository as the live version of your work and the code snapshot archive as the historical trace that was produced for a specific analysis. In other words, we recommend linking both the code repository and its snapshot to the data archive.