Data science tools

Engineering productivity tips

terminal command prompt bash tmux git vim jupyter notebook

By Afshine Amidi and Shervine Amidi

Working in groups with Git

Overview Git is a version control system (VCS) that tracks changes of different files in a given repository. In particular, it is useful for:

Getting started The table below summarizes the commands to start a new project, depending on whether or not the repository already exists:

Case Action Command Illustration
No existing repository Initialize repository from local folder git init Initialization
Repository already exists Copy repository from remote to local git clone path/to/address.git Clone

File check-in We can track modifications made in the repository, done by either modifying, adding or deleting a file, through the following steps:

Step Command Illustration
1. Add modified, new, or deleted file to staging area git add file Add
2. Save snapshot along with descriptive message git commit -m 'description' Commit

Remark 1: git add . will have all modified files to the staging area.

Remark 2: files that we do not want to track can be listed in the .gitignore file.

Sync with remote The following commands enable changes to be synchronized between remote and local machines:

Action Command Illustration
Fetch most recent changes from remote branch git pull origin name_of_branch Pull
Push latest local changes to remote branch git push origin name_of_branch Push

Parallel workstreams In order to make changes that do not interfere with the current branch, we can create another branch name_of_new_branch as follows:

git checkout -b name_of_new_branch   # Create and checkout to that branch
Depending on whether we want to incorporate or discard the branch, we have the following commands:

Action Command Illustration
Merge name_of_branch with current branch git merge name_of_branch Merge
Remove name_of_branch git branch -D name_of_branch Delete

Tracking status We can check previous changes made to the repository with the following commands:

Action Command Illustration
Check status of modified file(s) git status Status
View last commits git log --oneline Log
Compare changes made between two commits git diff commit_1 commit_2 Diff
View list of local branches git branch Branch

Canceling changes Canceling changes is done differently depending on the situation that we are in. The table below sums up the most common cases:

Case Action Command Illustration
Unstaged Revert file to state in last commit git checkout -- file Revert
Staged Remove file from staging area git reset HEAD file Remove
Committed Go back to a previous commit git reset --hard prev_commit Go back

Project structure It is important to keep a consistent and logical structure of the project. One example of structure of a data science project is as follows:

  ├── analysis/
    ├── graphs/
    └── notebooks/
  ├── data/
    ├── query/
    ├── raw/
    └── processed/
  ├── modeling/
    ├── methods/
    ├── results/
    └── tests/

Working with Bash

Basic terminal commands The table below sums up the most useful terminal commands:

Category Action Command
Exploration Display list of files (including hidden ones) ls (-a)
Show path to current directory pwd
Show content of file cat path_to_file
Show statistics of file (lines/words/characters) wc path_to_file
Create new folder mkdir folder_name
Change directory to folder cd path_to_folder
Create new empty file touch filename
Copy-paste file (folder) from origin to destination scp (-R) origin destination
Move file/folder from origin to destination mv origin destination
Remove file (folder) rm (-R) path
Compression Compress folder into file tar -czvf compressed.tar.gz folder
Uncompress file tar -xzvf compressed.tar.gz
Miscellaneous Display message echo "message"
Overwrite / append file with output output > file.txt / output >> file.txt
Execute a given command with elevated privileges sudo command
Connect to a remote machine ssh remote_machine_address

Chaining It is a concept that improves readability by chaining operations with the pipe | operator. A few common examples are summed up in the table below:

Action Command
Count number of files in a folder ls path_to_folder | wc -l
Count number of lines in file cat path_to_file | wc -l
Show last n commands executed history | tail -n

Advanced search The find command allows the search of specific files and manipulate them if necessary. The general structure of the command is as follows:

find path_to_folder/. [conditions] [actions]

The possible conditions and actions are summarized in the table below:

Category Action Command
Filters Certain names, regex accepted -name 'certain_name'
Certain file types (d/f for directory/file) -type certain_type
Certain file sizes (c/k/M/G for B/kB/MB/GB) -size file_size
Opposite of a given condition -not [condition]
Actions Delete selected files -delete
Print selected files -print

Remark: the flags above can be combined to make a multi-condition search.

Changing permissions The following command enables to change the permissions of a given file (or folder):

chmod (-R) three_digits file
with three_digits being a combination of three digits, where:

Each digit is one of (0, 4, 5, 6, 7), and has the following meaning:

Representation Binary Digit Explanation
--- 000 0 No permission
r-- 100 4 Only read permission
r-x 101 5 Both read and execution permissions
rw- 110 6 Both read and write permissions
rwx 111 7 Read, write and execution permissions

For instance, giving read, write, execution permissions to everyone for a given_file is done by running the following command:

chmod 777 given_file

Remark: in order to change ownership of a file to a given user and group, we use the command chown user:group file.

Terminal shortcuts The table below summarizes the main shortcuts when working with the terminal:

Action Shortcut
Search previous commands Ctrl + R
Go to beginning / end of line Ctrl + A / Ctrl + E
Remove everything after the cursor Ctrl + K
Clear line Ctrl + U
Clear terminal window Ctrl + L

Automating tasks

Create aliases Shortcuts can be added to the ~/.bash_profile file by adding the following line of code:


Bash scripts Bash scripts are files whose file name ends with .sh and where the file itself is structured as follows:

... [bash script] ...

Crontabs By letting the day of the month vary between 1-31 and the day of the week vary between 0-6 (Sunday-Saturday), a crontab is of the following format:

  *         *         *         *         *
minute    hour       day      month      day
                   of month            of week

tmux Terminal multiplexing, often known as tmux, is a way of running tasks in the background and in parallel. The table below summarizes the main commands:

Category Action Command
Session management Open a new / last existing session tmux / tmux attach
Leave current session tmux detach
List all open sessions tmux ls
Remove session_name tmux kill-session -t session_name
Window management Open / close a window Ctrl + B + C / Ctrl + B + X
Move to $n^{\textrm{th}}$ window Ctrl + B + N

Mastering editors

Vim Vim is a popular terminal editor enabling quick and easy file editing, which is particularly useful when connected to a server. The main commands to have in mind are summarized in the table below:

Category Action Command
File handling Go to beginning / end of line 0 / $
Go to first / last line / $i^{\textrm{th}}$ line gg / G / i G
Go to previous / next word b / w
Exit file with / without saving changes Esc + :wq / :q!
Text editing Copy line n line(s), where $n\in\mathbb{N}$ nyy
Insert n line(s) previously copied p
Searching Search for expression containing name_of_pattern /name_of_pattern
Next / previous occurrence of name_of_pattern n / N
Replacing Replace old with new expressions with confirmation for each change Esc + :%s/old/new/gc

Jupyter notebook Editing code in an interactive way is easily done through jupyter notebooks. The main commands to have in mind are summarized in the table below:

Category Action Shortcut
Cell transformation Transform selected cell to text / code Click cell + m / y
Delete selected cell Click cell + dd
Add new cell below / above selected cell Click cell + b / a
Revert changes to cell Click cell + z