Want more content like this? Subscribe here to be notified of new releases!

Data science tools

Tutorial on how to work with Git

Star

git bash terminal version control

Motivation

Group work on a same codebase can be challenging if a structured workflow isn't shared between participants. To make this process easy, people have come up with fantastic tools so that teams can work together in a seamless manner: version control.

In this tutorial, we are going to explore the Git framework, which is one of the most popular frameworks that people use pretty much anywhere, be it for projects with friends to industry-scale codebases shared with thousands of other engineers.

This article aims at showing the main commands that you will be using in your day-to-day life. You can use it as a recipe to structure your first steps with code versioning. Later on, you will likely remember these commands your own (great!), or with just a reduced set of notes, such as the the one provided by the study guide associated to this tutorial.

Among the plethora of advantages of version control, we can note the following to be particularly helpful:

keeping track of successive states of the code
enable multiple people working on the same piece of code

To use Git, we will use a command line interface such as Terminal on MacOS/Linux and its equivalent generated by the Windows Subsystem for Linux (WSL) on Windows. For illustration purposes, we will apply Git concepts through a repository hosted on GitHub, but note that the reasoning remains unchanged with any other service (e.g. GitLab, Bitbucket among others).

Getting started

Installation

Making your life easy with SSH

Who wants to type their credentials all the time? That's right, no one! To make our life easy, we can set up a pair of public/private SSH keys associated to our account that GitHub will silently use to authenticate us at each interaction with the server.

We can do so in 3 steps:

Generate (or reuse) a key (see GitHub's tutorials: Checking for existing SSH keys and Generating a new SSH key),
Add the key to the system's keys manager (see GitHub's tutorial: Adding your SSH key to the ssh-agent)
Add the key to your GitHub account so that it can automatically recognize your computer (see GitHub's tutorial: Adding a new SSH key to your GitHub account).

Important note: this step is optional. If you don't wish to do it, that's fine! Just keep in mind that by following this guide, you will have to replace instances of

git@github.com:[username]/[repo-name].git

https://github.com/[username]/[repo-name].git

which serves an equivalent purpose, the only difference being the inconvenience of typing our GitHub account's password at each interaction with the remote server.

Start

From scratch

In order to get a fresh start on a repository, create a folder with the desired name [repo-name], open a console and go to that location with the command

cd path/to/[repo-name]

Once this is done, we need to tell Git to start tracking changes that will occur at that location. To do so, type

git init

Now, we create a repository on the GitHub website with the same name [repo-name]. We need to link this remote repository to the local folder with the command

git remote add origin git@github.com:[username]/[repo-name].git

We are now ready to make use of it!

From an existing repository

This method is adapted to cases where we want to continue working on a repository already on GitHub. To clone it, type

git clone git@website-name.com:[username]/[repo-name].git

and then

cd [repo-name]

to switch to the [repo-name] folder. That's just it, we are now ready for the real stuff!

Configuration

Local

This is an important step to set up the identity we will be using in our commits. We can specify the full name of our profile associated to this repository with

git config user.name '[Your Name]'

as well as our commit email address with

git config user.email '[your@email.com]'

Tip: specify the email address you listed on your GitHub account so that commits displayed on the web interface show up as coming from your account.

Global

If we want our full name/email settings to be applied to all our computer's repositories, we have to use the same commands as above where the only difference is that instances of git config are to be replaced by git config --global.

Making progress on the code

Retrieve changes made by others

When we develop code with others, we have to pull their changes to our local folder from time to time to make sure our work takes into account their latest changes.

In order to pull data from the remote (called origin) repository of the main (master) branch, we type

git pull origin master

which can be read as "git, please pull changes from the remote repository origin that are related to the master branch".

In cases where the retrieved code is orthogonal to our contributions, the merge is done automatically. In the event it is not the case, and the modified lines collide with our local changes, "conflicts" will need to be resolved. But no worries about it, doing so can be done in a few structured steps!

These conflicts usually show up in the format

<<<<<<< HEAD
Code currently here
=======
Code conflict brought by commit [commit_number]
>>>>>>> [commit_number]

where each segment of code between delimiters <<<<<<< (or >>>>>>>) and ======= correspond to a specific version of the code, where:

HEAD refers to our local changes
[commit_number] refers to the hash string associated to the conflicting remote change. We can look up the corresponding description of the change from the list given by the command git log --oneline.

At this point, we have to make a decision about which piece of code is to kept (if applicable) or whether to write something else at this position altogether. To do so, we simply replace the block

<<<<<<< HEAD
[...]
>>>>>>> [commit_number]

by the correct replacement in order to solve the conflict. If applicable, repeat the process at all other such locations.

Staging area

Add files

Say we have changed n files and we want to take the corresponding changes into account. We need to move these changes to the staging area, which we can do with the git add command:

git add [file_1] [file_2] [file_n]

If we want to add all applicable changes, we can avoid ourselves the trouble of writing down all files explicitly by instead just typing

git add .

Remove files

Now, suppose that we have added n unwanted files to the staging area. To undo these actions, we can type

git reset HEAD [file_1] [file_2] [file_n]

to change the status of these files from "staged" to "unstaged".

Similarly, a succinct way of directly removing all files from the staging area is

git reset HEAD .

Note: contrary to what the naming might suggest, this command will not "reset" the files per se, but only remove them (as they are) from the staging area.

Take a snapshot of the changes

It is good practice to "save" the changes that are in the staging area when we feel they are:

self-sufficient, i.e. the code is expected to keep compiling and make sense without other changes.
targeted, e.g. they work towards bringing a single new/modified/removed feature. This property has two main goals. First, it is useful for us to place a label on the milestones the code is going through. Also, it makes the life of potential code reviewers easy as they are left with a relatively small change to review that has a concrete meaning.
coherent in that we can summarize the corresponding contribution in English terms.

In Git language, we say that it is time to do a "commit", or in other words, take a picture of the changes that we want to bring to the repository.

We do so with the command

git commit -m '[Your description]'

where the descriptive message attached to it is crucial for us to remember what the changes were about. One appropriate way of writing commit messages in a way that is explicit, yet succinct enough, is to place an action verb, followed by a (short) summary, e.g.:

Add data files used for X
Fix issue that used to occur after doing Y
Remove deprecated features Z

Send changes to the server

Let's send our local changes to the remote server! To do so, simply type

git push origin master

which can be read as "git, please push changes to the remote repository origin that are related to the master branch".

If the remote repository contains changes that we have not pulled from the server already, we follow the steps:

Pull the latest changes from the remote repository
(if applicable) Resolve conflicts and add all files to the staging area
Take a snapshot of the merged changes

Now, we can finally push your changes to the remote repository!

Working in parallel

Create a new separate branch from existing code

Sometimes, we might want to work on something which we don't want to mix with the main branch (e.g. master) of the code. Such use cases include the development of a new feature, trying out a new idea, or simply kicking off a subproject that we know might break the current code.

Once again, Git has a workflow ready for us to do that! First, check that you are indeed currently on the master branch (or whatever branch you wish to initialize your new branch from). To do so, type

git branch

and you should see an output where one of the lines looks like * master, which indicates that we are currently on the master branch. If that's not the case, we can switch to it by typing the command

git checkout master

Now, we are ready to create our new branch called [branch_name]! To do so, we simply type

git checkout -b [branch_name]

which triggers two successive actions:

it creates a new branch of name [branch_name]
it switches the current working branch to it

That's it! We can now develop whatever we want on that new branch.

Progress on the code

To develop new code, all instructions detailed previously in this guide remain unchanged, except for actions requiring interactions with the remote branch, where instances involving a specific branch (e.g. master) have to be adapted accordingly.

For example, retrieving updates for this branch becomes

git pull origin [branch_name]

and sending changes to the remote server is now

git push origin [branch_name]

Merge branches together

After some time, say we have developed the feature of our dreams on the [branch_name] branch, and we now wish to merge it on the master branch. How can we do that?

First, switch to the master branch with the command

git checkout master

and then merge the content of the latest snapshot of the new branch with the command

git merge [branch_name]

Delete a branch that you don't plan to use

Sometimes, a feature we develop might not turn out to be like we wanted. When the time comes to clean up superfluous branches, first go to the master branch with the command git checkout master.

Locally

To delete an unwanted [branch_name] on your local folder, type

git branch -d [branch_name]

Remotely

We can also delete the unwanted [branch_name] on the repository located on the remote server by entering the command

git push origin --delete [branch_name]

Frequent situations

Difference between two snapshots

If we wish to visualize the difference between two commits [commit_1] and [commit_2], we can type

git diff [commit_1] [commit_2]

which outputs the added/deleted/modified lines of [commit_2] with respect to [commit_1]. This technique can be used as a quick way to visualize the summary between two states of the repository.

Note: Here, [commit_1] and [commit_2] represent hash numbers corresponding to two snapshots taken in the past. We can find these numbers on the summary of past commits output by the command git log --oneline.

Tell Git to ignore certain files

Sometimes, there are files that can be useful for us to have in the repository folder, but that we might not want to track with Git: e.g. system files, heavy files copied for local testing... No worries, Git has also a solution for that!

To ignore tracking to the types of files of your choice, we need to create a file called .gitignore (this precise naming is important) at the root of the repository, where each line excludes files of our choice. What's magic about it is that the file is wildcard-friendly, meaning that we can exclude an entire category of files with a single expression.

Here is a (non-exhaustive) list of the types of items that one might find useful to put in a .gitignore file:

__pycache__           # Files created when running Python code
*.ipynb_checkpoints   # Jupyter notebook checkpoints
*.DS_Store            # MacOS file system-related files
*.Rhistory            # R code history
test*                 # All files that start with `test`

This functionality can be seen as a productivity booster as it helps coders target and interact only with the files that matter and avoid additional overhead linked to noisy changes caused by files no one cared about in the first place.

Reinitialize an unstaged change made since the latest snapshot

Say we develop some code since the last snapshot, but then we realize we want to forget about these new changes (and just return to the latest snapshot). To do so, use the command

git checkout -- [file_1] [file_2] [file_n]

which can be simplified by

git checkout -- .

if the goal is to remove changes of all unstaged files at once.

Go back in time to a previous snapshot

The great thing about version control is that we have the option of going back in time if things go south. To do so, we have to identify the target snapshot where we would like to go back to and copy its associated hash number [prev_commit] from the git log --oneline command on history of changes.

Then, type the command

git reset --hard [prev_commit]

to go back in time to the snapshot associated to [prev_commit]. That's it! We can now continue adding commits on top of it as if nothing had happened!

Disclaimer: this mechanism does not exist in real life. :-)

Conclusion

That's it, now you know the basics of working with Git! The commands presented above will be exactly the ones you need 99% of the time. Git will also likely be your savior in the remaining 1% of the time where you need to play with more advanced functionalities. In the case a concept you are looking for is not mentioned here, Stack Overflow will be your next safest bet.