Data science tools
Tutorial on how to work with Git
git
bash
terminal
version control
By Afshine Amidi and Shervine Amidi
Motivation
Group work on a same codebase can be challenging if a structured workflow isn't shared between participants. To make this process easy, people have come up with fantastic tools so that teams can work together in a seamless manner: version control.
In this tutorial, we are going to explore the Git framework, which is one of the most popular frameworks that people use pretty much anywhere, be it for projects with friends to industry-scale codebases shared with thousands of other engineers.
This article aims at showing the main commands that you will be using in your day-to-day life. You can use it as a recipe to structure your first steps with code versioning. Later on, you will likely remember these commands your own (great!), or with just a reduced set of notes, such as the the one provided by the study guide associated to this tutorial.
Among the plethora of advantages of version control, we can note the following to be particularly helpful:
- keeping track of successive states of the code
- enable multiple people working on the same piece of code
To use Git, we will use a command line interface such as Terminal on MacOS/Linux and its equivalent generated by the Windows Subsystem for Linux (WSL) on Windows. For illustration purposes, we will apply Git concepts through a repository hosted on GitHub, but note that the reasoning remains unchanged with any other service (e.g. GitLab, Bitbucket among others).
Getting started
Installation
Making your life easy with SSH
Who wants to type their credentials all the time? That's right, no one! To make our life easy, we can set up a pair of public/private SSH keys associated to our account that GitHub will silently use to authenticate us at each interaction with the server.
We can do so in 3 steps:
- Generate (or reuse) a key (see GitHub's tutorials: Checking for existing SSH keys and Generating a new SSH key),
- Add the key to the system's keys manager (see GitHub's tutorial: Adding your SSH key to the ssh-agent)
- Add the key to your GitHub account so that it can automatically recognize your computer (see GitHub's tutorial: Adding a new SSH key to your GitHub account).
Important note: this step is optional. If you don't wish to do it, that's fine! Just keep in mind that by following this guide, you will have to replace instances of
git@github.com:[username]/[repo-name].git
by
https://github.com/[username]/[repo-name].git
which serves an equivalent purpose, the only difference being the inconvenience of typing our GitHub account's password at each interaction with the remote server.
Start
From scratch
In order to get a fresh start on a repository, create a folder with the desired name [repo-name]
, open a console and go to that location with the command
cd path/to/[repo-name]
Once this is done, we need to tell Git to start tracking changes that will occur at that location. To do so, type
git init
Now, we create a repository on the GitHub website with the same name [repo-name]
. We need to link this remote repository to the local folder with the command
git remote add origin git@github.com:[username]/[repo-name].git
We are now ready to make use of it!
From an existing repository
This method is adapted to cases where we want to continue working on a repository already on GitHub. To clone it, type
git clone git@website-name.com:[username]/[repo-name].git
and then
cd [repo-name]
to switch to the [repo-name]
folder. That's just it, we are now ready for the real stuff!
Configuration
Local
This is an important step to set up the identity we will be using in our commits. We can specify the full name of our profile associated to this repository with
git config user.name '[Your Name]'
as well as our commit email address with
git config user.email '[your@email.com]'
Tip: specify the email address you listed on your GitHub account so that commits displayed on the web interface show up as coming from your account.
Global
If we want our full name/email settings to be applied to all our computer's repositories, we have to use the same commands as above where the only difference is that instances of git config
are to be replaced by git config --global
.
Making progress on the code
Retrieve changes made by others
When we develop code with others, we have to pull their changes to our local folder from time to time to make sure our work takes into account their latest changes.
In order to pull data from the remote (called origin
) repository of the main (master
) branch, we type
git pull origin master
which can be read as "git
, please pull
changes from the remote repository origin
that are related to the master
branch".
In cases where the retrieved code is orthogonal to our contributions, the merge is done automatically. In the event it is not the case, and the modified lines collide with our local changes, "conflicts" will need to be resolved. But no worries about it, doing so can be done in a few structured steps!
These conflicts usually show up in the format
<<<<<<< HEAD
Code currently here
=======
Code conflict brought by commit [commit_number]
>>>>>>> [commit_number]
where each segment of code between delimiters <<<<<<<
(or >>>>>>>
) and =======
correspond to a specific version of the code, where:
HEAD
refers to our local changes[commit_number]
refers to the hash string associated to the conflicting remote change. We can look up the corresponding description of the change from the list given by the commandgit log --oneline
.
At this point, we have to make a decision about which piece of code is to kept (if applicable) or whether to write something else at this position altogether. To do so, we simply replace the block
<<<<<<< HEAD
[...]
>>>>>>> [commit_number]
by the correct replacement in order to solve the conflict. If applicable, repeat the process at all other such locations.
Staging area
Add files
Say we have changed n
files and we want to take the corresponding changes into account. We need to move these changes to the staging area, which we can do with the git add
command:
git add [file_1] [file_2] [file_n]
If we want to add all applicable changes, we can avoid ourselves the trouble of writing down all files explicitly by instead just typing
git add .
Remove files
Now, suppose that we have added n
unwanted files to the staging area. To undo these actions, we can type
git reset HEAD [file_1] [file_2] [file_n]
to change the status of these files from "staged" to "unstaged".
Similarly, a succinct way of directly removing all files from the staging area is
git reset HEAD .
Note: contrary to what the naming might suggest, this command will not "reset" the files per se, but only remove them (as they are) from the staging area.
Take a snapshot of the changes
It is good practice to "save" the changes that are in the staging area when we feel they are:
- self-sufficient, i.e. the code is expected to keep compiling and make sense without other changes.
- targeted, e.g. they work towards bringing a single new/modified/removed feature. This property has two main goals. First, it is useful for us to place a label on the milestones the code is going through. Also, it makes the life of potential code reviewers easy as they are left with a relatively small change to review that has a concrete meaning.
- coherent in that we can summarize the corresponding contribution in English terms.
In Git language, we say that it is time to do a "commit", or in other words, take a picture of the changes that we want to bring to the repository.
We do so with the command
git commit -m '[Your description]'
where the descriptive message attached to it is crucial for us to remember what the changes were about. One appropriate way of writing commit messages in a way that is explicit, yet succinct enough, is to place an action verb, followed by a (short) summary, e.g.:
Add data files used for X
Fix issue that used to occur after doing Y
Remove deprecated features Z
Send changes to the server
Let's send our local changes to the remote server! To do so, simply type
git push origin master
which can be read as "git
, please push
changes to the remote repository origin
that are related to the master
branch".
If the remote repository contains changes that we have not pulled from the server already, we follow the steps:
- Pull the latest changes from the remote repository
- (if applicable) Resolve conflicts and add all files to the staging area
- Take a snapshot of the merged changes
Now, we can finally push your changes to the remote repository!
Working in parallel
Create a new separate branch from existing code
Sometimes, we might want to work on something which we don't want to mix with the main branch (e.g. master
) of the code. Such use cases include the development of a new feature, trying out a new idea, or simply kicking off a subproject that we know might break the current code.
Once again, Git has a workflow ready for us to do that! First, check that you are indeed currently on the master
branch (or whatever branch you wish to initialize your new branch from). To do so, type
git branch
and you should see an output where one of the lines looks like * master
, which indicates that we are currently on the master
branch. If that's not the case, we can switch to it by typing the command
git checkout master
Now, we are ready to create our new branch called [branch_name]
! To do so, we simply type
git checkout -b [branch_name]
which triggers two successive actions:
- it creates a new branch of name
[branch_name]
- it switches the current working branch to it
That's it! We can now develop whatever we want on that new branch.
Progress on the code
To develop new code, all instructions detailed previously in this guide remain unchanged, except for actions requiring interactions with the remote branch, where instances involving a specific branch (e.g. master
) have to be adapted accordingly.
For example, retrieving updates for this branch becomes
git pull origin [branch_name]
and sending changes to the remote server is now
git push origin [branch_name]
Merge branches together
After some time, say we have developed the feature of our dreams on the [branch_name]
branch, and we now wish to merge it on the master
branch. How can we do that?
First, switch to the master
branch with the command
git checkout master
and then merge the content of the latest snapshot of the new branch with the command
git merge [branch_name]
Delete a branch that you don't plan to use
Sometimes, a feature we develop might not turn out to be like we wanted. When the time comes to clean up superfluous branches, first go to the master
branch with the command git checkout master
.
Locally
To delete an unwanted [branch_name]
on your local folder, type
git branch -d [branch_name]
Remotely
We can also delete the unwanted [branch_name]
on the repository located on the remote server by entering the command
git push origin --delete [branch_name]
Frequent situations
Difference between two snapshots
If we wish to visualize the difference between two commits [commit_1]
and [commit_2]
, we can type
git diff [commit_1] [commit_2]
which outputs the added/deleted/modified lines of [commit_2]
with respect to [commit_1]
. This technique can be used as a quick way to visualize the summary between two states of the repository.
Note: Here, [commit_1]
and [commit_2]
represent hash numbers corresponding to two snapshots taken in the past. We can find these numbers on the summary of past commits output by the command git log --oneline
.
Tell Git to ignore certain files
Sometimes, there are files that can be useful for us to have in the repository folder, but that we might not want to track with Git: e.g. system files, heavy files copied for local testing... No worries, Git has also a solution for that!
To ignore tracking to the types of files of your choice, we need to create a file called .gitignore
(this precise naming is important) at the root of the repository, where each line excludes files of our choice. What's magic about it is that the file is wildcard-friendly, meaning that we can exclude an entire category of files with a single expression.
Here is a (non-exhaustive) list of the types of items that one might find useful to put in a .gitignore
file:
__pycache__ # Files created when running Python code
*.ipynb_checkpoints # Jupyter notebook checkpoints
*.DS_Store # MacOS file system-related files
*.Rhistory # R code history
test* # All files that start with `test`
This functionality can be seen as a productivity booster as it helps coders target and interact only with the files that matter and avoid additional overhead linked to noisy changes caused by files no one cared about in the first place.
Reinitialize an unstaged change made since the latest snapshot
Say we develop some code since the last snapshot, but then we realize we want to forget about these new changes (and just return to the latest snapshot). To do so, use the command
git checkout -- [file_1] [file_2] [file_n]
which can be simplified by
git checkout -- .
if the goal is to remove changes of all unstaged files at once.
Go back in time to a previous snapshot
The great thing about version control is that we have the option of going back in time if things go south. To do so, we have to identify the target snapshot where we would like to go back to and copy its associated hash number [prev_commit]
from the git log --oneline
command on history of changes.
Then, type the command
git reset --hard [prev_commit]
to go back in time to the snapshot associated to [prev_commit]
. That's it! We can now continue adding commits on top of it as if nothing had happened!
Disclaimer: this mechanism does not exist in real life. :-)
Conclusion
That's it, now you know the basics of working with Git! The commands presented above will be exactly the ones you need 99% of the time. Git will also likely be your savior in the remaining 1% of the time where you need to play with more advanced functionalities. In the case a concept you are looking for is not mentioned here, Stack Overflow will be your next safest bet.