This document is a compilation of NCEAS training material on version control to create a 45-60 min crash course on getting started with version control for RStudio users. You will learn:

  • how to setup git on your machine
  • how to create repository on GitHub
  • git basic workflow an manipulations using RStudio
  • Collaborating using git and GitHub



1 Introduction to Version Control Concepts

Version control is a system that helps you to manage the version of your files. It will help you to never have to duplicate files using save as as a way to keep different versions of a file (see below). Version control help you to create a timeline of snapshots containing different versions of a file. Bonus: you can add a short description to remember what each specific version is about.

For scientists, version control is a useful tool to help you to track changes you make to your scripts and enable you to share your code with your collaborators. For example, if you break your code, git can help you revert to an earlier working version. Another example could be that you want one of your collaborators to add a new feature to your code to improve your analysis. Version control can help you to do so in a smooth and organized manner, tracking who changed what in the script.



2 git

This training material focuses on the code versioning system called Git. Note that there are others, such as Mercurial or svn for example.

Git is a free and open source distributed version control system. It has many functionalities and was originally geared towards software development and production environment. In fact, Git was initially designed and developed in 2005 by Linux kernel developers (including Linus Torvalds) to track the development of the Linux kernel. Here is a fun video of Linus Torvalds touting Git to Google.

2.1 How does it work?

Git can be enabled on a specific folder/directory on your file system to version files within that directory (including sub-directories). In git (and other version control systems) terms, this “tracked folder” is called a repository (which formally is a specific data structure storing versioning information).

3 What git is not

  • Git is not a backup per se
  • Git is not good at versioning large files (there are workarounds) => not meant for data


4 Setting up git on your computer

MacOSX and Linux computers all come with git pre-installed, but it is not always directly usable. The best way to test if git is ready to use is at the command line:

git --version
## git version 2.17.2 (Apple Git-113)

It should return something like above. If you get an error, you will have to install git

Windows users will have to install a software called git bash before being able to use git.

4.1 Installing git

You can download a copy of git here: https://git-scm.com/downloads and follow the instructions.

4.1.1 Windows

You can keep the options to default during the installation, until you reach Configuring the terminal emulator to use with Git Bash -> be sure Use MinTTY is selected. This will install both git and a set of useful command-line tools using a trimmed down Bash shell.

4.1.2 Mac OSX

Depending on the version, you might have to run a few commands from the terminal. Please refer to the README.txt that comes with the download regarding the exact steps to follow.

4.2 Setting up your git identity

Before you start using git on any computer, you will have to set your identity on your system, as every snapshot of files is associated with the user whom implemented the modifications to the file(s).

Open the Terminal or git bash and then type the following commands.

4.2.1 Setup your profile

Your name and email:

git config --global user.name "your Full Name"
git config --global user.email "your Email"

4.2.2 Optional

Check that everything is correct:

git config --global --list

Modify everything at the same time:

git config --global --edit

Set your text editor:

git config --system core.editor nano

Here nano is used as example; you can choose most of the text editor you might have installed on your computer (atom, sublime, notepad++ …).

Problem with any of those steps? Check out Jenny Brian Happy git trouble shooting section

4.3 Linking git and RStudio

In most of the cases, RStudio should automatically detect git when it is installed on your computer. The best way to check this is to go to the Tools menu -> Global Options and click on git/SVN

If git is properly setup, the window should look like this:

Click OK.

Note: if git was not enabled, you might be asked to restart RStudio to enable it.



5 First Repository

Git can be enabled on a specific folder/directory on your file system to version files within that directory (including sub-directories). In git (and other version control systems) terms, this “tracked folder” is called a repository (which formally is a specific data structure storing versioning information).

Although there many ways to start a new repository, GitHub (or any other cloud solutions, such as GitLab) provide among the most convenient way of starting a repository.

5.1 GitHub

Let’s distinguish between git and GitHub:

  • git: version control software used to track files in a folder (a repository)
    • git creates the versioned history of a repository
  • GitHub: web site that allows users to store their git repositories and share them with others

GitHub is a company that hosts git repositories online and provides several collaboration features (among which forking). GitHub fosters a great user community and has built a nice web interface to git, also adding great visualization/rendering capacities of your data.

5.2 Let’s look at a GitHub repository

This screen shows the copy of a repository stored on GitHub, with its list of files, when the files and directories were last modified, and some information on who made the most recent changes.

If we drill into the “commits” for the repository, we can see the history of changes made to all of the files. Looks like kellijohnson and seananderson were fixing things in June and July:

And finally, if we drill into the changes made on June 13, we can see exactly what was changed in each file:

Tracking these changes, and seeing how they relate to released versions of software and files is exactly what Git and GitHub are good for. We will show how they can really be effective for tracking versions of scientific code, figures, and manuscripts to accomplish a reproducible workflow.

5.3 Creating a Repository on GitHub

We are going to create a new repository on your GitHub account. If you do not have an account yet, it is free to create one here: https://github.com/join?source=header-home

To create a new repository follow these steps:

  • Click on image alt text
  • Enter a descriptive name for your new repository, myfirst-repo
    (avoid upper case and use - instead of spaces or _)
  • Write a 1-sentence description about the repository content
  • Choose “Public” (as of January 2019, GitHub now offers unlimited free private repositories with a maximum of 3 collaborators)
  • Check “Initialize this repository with a README”
  • Add a .gitignore file (optional). As the name suggest, the gitignore file is used to specify the file format that git should not track. GitHub offers pre-written gitignore files for commodity
  • Add a license file (optional)

Here is a website to look for more pre-written.gitignore files: https://github.com/github/gitignore


=> Here it is, you now have a repository in the cloud!!


5.4 Getting a Local Copy of a Repository

The next step is going to get a local copy of this repository to your personal computer. In git jargon, creating an exact copy of a repository on your local computer is called cloning.

RStudio can help us to clone a repository. Since RStudio Projects also work at the folder/directory level, it is the “unit” that is going to be used to link a repository to RStudio.

  1. You can create a new RStudio Project from the upper-right corner of the RStudio IDE window, choosing New Project
  2. Choose Version Control
  3. Select git
  4. Go back to your web browser and from the GitHub repository page click on the green clone or download button and copy the URL to your repository.
    Note: The URL should start with “https://”. If it starts with “git@github.com”, click “Use HTTPS” above the URL.
  5. Paste this URL in the first box and leave the second box empty. Finally select a location on your HD where the repository will be cloned to.
  6. Click Create Project


** => Congratulations!! you have cloned the repository to your computer and created a RStudio project out of it.**


You can also use your computer file browser to look at the files in the repository. You have two files:

  • The my-repo-name.Rproj file for the RStudio Project you just created. Note that because we left the second box empty on step 5, the name of the repository was used to name the RStudio project. This file will be what you open to begin working on your R and RMarkdown scripts. The Rproj file will save your settings and open tabs when you close the project, and will restore these settings the next time you open it.
  • The README.md file that was automatically generated by GitHub when creating the repository

If you look again at your repository page on GitHub you will noticed that the .Rproj file is not there. It is because this file was created by RStudio on your local machine and you have not yet tried to synchronize the files between your local copy and the one in the cloud (remote copy in git jargon). Note also that the .gitignore file is not showing up in the Finder view. It is because files with a name starting with a dot are considered “hidden”. By default most of OS will not show those files. However if you use the Files panel in RStudio, you can see the .gitignore file.

We are going to edit the README.md file, adding more information about the repository (purpose of the this file). You can directly edit this file in RStudio. You can open the file by clicking on its name from the Files tab in the lower-right panel.



6 Tracking File Changes with git

6.1 Basic Workflow Overview

  1. You modify files in your working directory and save them as usual

  2. You add snapshots of your changed files to your staging area

  3. You do a commit, which takes the files as they are in the staging area and permanently stores them as snapshots to your Git directory.

We can make an analogy with taking a family picture, where each family member would represent a file.

  • Adding files (to the staging area), is like deciding which family member(s) are going to be on your next picture
  • Committing is like taking the picture

These 2-step process enables you to flexibly group files into a specific commit.

These steps are repeated for every version you want to keep (every time you would like to use save as). Every time you commit, you create a new snapshot, you add the new version of the file to the git database, while keeping all the previous versions in the database. It creates an history of the content of your repository that is like a graph that you can navigate:

6.2 Using git from Rstudio

6.2.1 Tracking changes

RStudio provides a great interface to git helping you navigating the git workflow and getting information about the state of your repository through nice icons and visualizations of the information.

If you click on the Git tab in your RStudio upper-right panel, you should see the following information

The RStudio Git pane lists every file that’s been added, modified or deleted. The icon describes the change:

from R packages, H. Wickham

from R packages, H. Wickham

In our case, it means that:

  • the .gitignorefile has been modified since the last commit
  • the .Rproj file has never been tracked by git (remember RStudio just created this project file for us)

Note also that the README.md file is not listed, but it exists (see Filespane). It is because files with no modifications since last commit are not listed.

GitHub has created the .gitignore file for us and we have not modified it since. So why is it listed as modified? We can check this by clicking on the Diff button (upper-left on the Git pane).

We can see that a new line (in green) has been added at the end of the .gitignore file. In fact, RStudio did that when creating the project to make sure that some temporary files are not tracked by git.

Let us improve the content of the README.md file as below to make it more descriptive.

As soon as you saved your changes, you should see the README.md file listed as modified in the git pane.

Let us look at the diff of the README.md file. As you can see, the original lines are in red, in other words for git those lines have been deleted. The new lines that we just typed are in green, which indicates that these lines have been added for git. Note the line numbers in the left margin that help you to track which line have been removed and added.

6.2.2 Keeping Changes as Snapshots

Now we would like to save a snapshot of this version of the README.md file. Here are the steps we will need to do:

  1. Add the file to the next commit by checking the box in front of the file name in the git pane.
    Note that the icon M will move to the left to show you that this file is now staged to be part of the next commit
  2. Commit:
    1. Click the Commit button at the top of the git pane
    2. Write a short but descriptive commit message in the new window
    3. Click on the he Commit button to save this version of the file in the git database
    4. Close the windows to get back to the main RStudio window

Once done, add both the .gitignore and the myfirst-repo.Rproj and commit those files together.

Note that the icons at the top of the git pane have been organized in sequence from left to right to match the git workflow.

6.2.3 Good Commit Message Tips

Clearly, good documentation of what you’ve done is critical to making the version history of your repository meaningful and helpful. Its tempting to skip the commit message altogether, or to add some stock blurd like ‘Updates’. Its better to use messages that will be helpful to your future self in deducing not just what you did, but why you did it. Also, commit messages are best understood if they follow the active verb convention. For example, you can see that my commit messages all started with a past tense verb, and then explained what was changed.

While some of the changes we illustrated here were simple and easily explained in a short phrase, for more complex changes, its best to provide a more complete message. The convention, however, is to always have a short, terse first sentence, followed by a more verbose explanation of the details and rationale for the change. This keeps the high level details readable in the version log. I can’t count the number of times I’ve looked at the commit log from 2, 3, or 10 years prior and been so grateful for diligence of my past self and collaborators.

6.3 Looking at the Repository History

We have done 2 new commits at this point. Let us look at the commit timeline we have created so far. You can click on the Clock icon at the top to visualize the history.

You can see that there has been 3 commits so far. The first one has been done by GitHub when we created the repository and the 2 commits we just did. The most recent commit is at the top.

6.4 Sending changes back to GitHub

Now that we have created these two commits on our local machine, our local version of the repository is different from the version on GitHub. RStudio communicate this information to you. If you look below the icons on the git pane, you will see the warning message: “Your branch is ahead of ‘origin/master’ by two commits”. This can be translated as you have two additional commits on your local machine that you never shared back to the remote repository on GitHub. Open your favorite web browser and look at the content of your repository on GitHub. You will see the old version of the README.md and .gitignore file and no trace of the .Rproj file.

There are two git commands to exchange between a local and remote versions of a repository: - pull: git will get the latest remote version and try to merge it with your local version - push: git will send your local version to the remote version of the repository (in our case GitHub)

Before sending your local version to the remote, you should always get the latest remote version first. In other words, you should pull first and push second. This is the way git protects the remote version against incompatibilities with the local version. You always deal with potential problems on your local machine. Therefore your sequence will always be:

  1. pull
  2. push

Of course RStudio have icons for that on top of the git pane, with the blue arrow down being for pull and the green arrow up being for push. Remember the icons are organized in sequence!

Let us do the pull and push to synchronized the remote repositories. We have now synchronized the local (our computer) and remote (on GitHub) versions of our repository.

You can now look at the page of your repository on GitHub, you should see 3 files with the exact same version that you have on your local!



7 Collaborative Workflows with GitHub

7.1 Collaborating through Forking, aka the GitHub workflow

A fork is a copy of a repository that will be stored under your user account. Forking a repository allows you to freely experiment with changes without affecting the original project. We can create a fork on Github by clicking the “fork” button in the top right corner of our repository webpage.

Most commonly, forks are used to either propose changes to someone else’s project or to use someone else’s project as a starting point for your own idea.

When you are satisfied with your work, you can initiate a Pull Request to initiate discussion about your modifications and requesting to integrate your changes to the main repository. Your commit history allows the original repository administrators to see exactly what changes would be merged if they accept your request. Do this by going to the original repository and clicking the “New pull request” button

Next, click “compare across forks”, and use the dropdown menus to select your fork as the “head fork” and the original repository as the “base fork”.

Then type a title and description for the changes you would like to make. By using GitHub’s @mention syntax in your Pull Request message, you can ask for feedback from specific people or teams.

This workflow is recommende when you do not have push/write access to a repository, such as contributing to a open source software or R package, or if you are heavily changing a project.

7.2 Collaborating through write / push access

When you collaborate closely and actively with colleagues, you do not want necessarily to have to review all their changes through pull requests. You can then give them write access (git push) to your repository to allow them to directly edit and contribute to its content. This is the workflow we will recommend to use within your working group.

7.2.0.1 Adding collaborators to a repository

  • Click on the repository
  • On the right panel, click
  • On the left pane, click Collaborators and enter the usernames you want to addcollaborators

Under this collaborative workflow, we recommend to use git branches combined with pull requests to avoid conflicts and to track and discuss collaborators contributions.

8 Branches

What are branches? Well in fact nothing new, as the master is a branch. A branch represents an independent line of development, parallel to the master (branch).

Why should you use branches? For 2 main reasons:

  • We want the master to only keep a version of the code that is working
  • We want to version the code we are developing to add/test new features (for now we mostly talk about feature branch) in our script without altering the version on the master.

8.1 Working with branches

8.1.1 Creating a new branch

In RStudio, you can create a branch using the git tab.

  1. Click on the branch button
  2. Fill the branch name in the new branch window; in this example, we are going to use test for the name; leave the other options as default and click create
  3. you will be directly creating a local and remote branch and switch to it

Congratulations you just created your first branch!

Let us check on Github:

As you can see, now there are two branches on our remote repository: - master - test

8.1.2 Using a branch

Here there is nothing new. The workflow is exactly the same as we did before, execepts our commit will be created on the test branch instead of the master branch.

Let us create our first R Script in this repository



9 Managing Merge Conflicts

The most common cause of merge conflicts happens when another user changes the same file that you just modified. It can happen during pull from a remote repository (or when merging branches).

  1. If you know for sure what file version you want to keep:
  • keep the remote file: git checkout --theirs conflicted_file.txt
  • keep the local file: git checkout --ours conflicted_file.txt

=> You still have to git add and git commit after this (git checkout)

  1. If you do not know why there is a conflict: Dig into the files, looking for:
<<<<<<< HEAD
local version (ours)
=======
remote version (theirs)
>>>>>>> [remote version (commit#)]

Manually make the required changes to merge these two versions, and remove the lines that git created (eg “<<<<<<< HEAD”)
=> You still have to git add and git commit after this

During this process, if you want to roll back to the situation before you started the merge: git merge --abort



10 References

10.1 Using RStudio:

10.2 Mainly from the command line:

10.3 GitHub Workflow




Creative Commons License