Setup & Installation

Why OpenRefine?

OpenRefine is a powerful open-source multi-platform and free data cleaning and transformation tool which can handle large datasets (~100,000 rows) depending on memory allocation. It helps users explore, clean, and refine data, making it easier to analyze and prepare for further use with no coding skills required.

OpenRefine treats all data as plain text by default, meaning it won’t automatically interpret or convert data types unless explicitly instructed to do so. This approach avoids common issues seen in tools like Excel, which auto-converts certain strings—such as gene names like “MARCH1”—into dates, leading to data corruption in scientific research1. By requiring deliberate data type transformations, OpenRefine offers better control to researchers and helps preserve data integrity.

Another key strength of OpenRefine is its ability to track every action you take through its project history feature. All changes in OpenRefine are user-initiated and recorded in a “history” or “recipe,” making it easy to review or undo actions. This means you can easily review, reverse, or revisit any step in your data cleaning process. It makes the entire workflow reproducible, as you can apply the same steps to other datasets or go back to previous stages without losing any progress.

What really sets OpenRefine apart, is how it combines this project history capability with version control. This is especially useful when working in teams. As your project evolves, version control allows you to track changes over time, keeping everyone aligned and on the same page. If you’re working with multiple collaborators, you can rely on OpenRefine’s version history to ensure that everyone is using the same cleaning process and has access to the latest changes. This collaborative feature is a big advantage over other tools, which often don’t offer such detailed tracking or easy collaboration, making it harder to maintain consistency or manage multiple revisions.

Additionally, since OpenRefine is open-source, you can create and share custom extensions, further streamlining teamwork. The combination of project history and version control makes OpenRefine particularly powerful for teams, ensuring a smooth, consistent, and transparent workflow, all while reducing the risk of errors. Unlike other data cleaning tools, OpenRefine offers a level of collaboration and organization that helps keep your cleaning process standardized and your team on track.

Download & Requirements

OpenRefine is designed to work with Windows, Mac, and Linux operating systems and can be downloaded from https://openrefine.org/download. To run OpenRefine, Java must be installed. The Mac version includes Java, and OpenRefine 3.4 or higher offers a Windows package with Java included. Alternatively, you can install Java (JRE) from Adoptium.net. On Windows, if Java isn’t installed, OpenRefine will automatically open a browser window with installation instructions.

When you first try to launch OpenRefine on a Mac by double-clicking its icon, you might see a message saying:

“OpenRefine cannot be opened because the developer cannot be verified.”

If this appears, click Cancel to dismiss the message.

To open the application anyway:

  1. Right-click (or Control-click) the OpenRefine icon.

  2. Choose Open from the context menu.

  3. A similar warning will appear, but this time it will include an Open button.

  4. Click Open to launch the application.

macOS will remember this choice, and you won’t need to repeat these steps the next time you open OpenRefine.

Once you download and install it, it runs as a small web server on your own computer, and you access that local web server by using your browser.

Using a web browser as a graphical interface, OpenRefine works as a desktop application. To launch it, double-click on the OpenRefine icon or run the executable. Despite of running in the browser, Openrefine does not require internet access for basic functions. It works best on browsers using the engine WebKit, such as Chrome and Safari. No matter how you start OpenRefine, it will load its interface in your computer’s default browser.

If you would like to use another browser instead, start OpenRefine and then point your chosen browser at the home screen: http://127.0.0.1:3333.

Some Privacy Considerations

Before using any software to handle sensitive or private data, it’s important to review its privacy policies. Below is a summary of key aspects of OpenRefine’s current privacy practices:

  • Local Data Storage: All data, history, and preferences in OpenRefine are stored locally on your computer. No data is sent to OpenRefine servers. The storage location depends on your OS.

  • Developer Access: OpenRefine developers cannot access your data, as it never leaves your device.

  • External Communications: Some features (e.g., reconciliation services, Google Drive/Sheets, SQL databases, Wikibase) do involve sending data to third parties. These are clearly indicated in the UI.

  • Encryption: Data is not encrypted by default. You can protect it by enabling full disk encryption through your OS.

  • Cookies: OpenRefine uses cookies only for OAuth authentication (Google Drive, Wikibase).

Footnotes

  1. Ziemann, M., Eren, Y. & El-Osta, A. Gene name errors are widespread in the scientific literature. Genome Biol 17, 177 (2016). https://doi.org/10.1186/s13059-016-1044-7↩︎