Data science tools
Data manipulation with R
R
dplyr
tidyr
lubridate
By Afshine Amidi and Shervine Amidi
Main concepts
File management The table below summarizes useful commands to make sure the working directory is correctly set:
Category | Action | Command |
Paths | Change directory to another path | setwd(path) |
Get current working directory | getwd() |
|
Join paths | file.path(path_1, ..., path_n) |
|
Files | List files and folders in a given directory | list.files(path, include.dirs = TRUE) |
Check if path is a file / folder | file_test('-f', path) |
|
file_test('-d', path) |
||
Read / write csv file | read.csv(path_to_csv_file) |
|
write.csv(df, path_to_csv_file) |
Chaining The symbol %>%
, also called "pipe", enables to have chained operations and provides better legibility. Here are its different interpretations:
-
f(arg_1, arg_2, ..., arg_n)
is equivalent toarg_1 %>% f(arg_2, arg_3, ..., arg_n)
, and also to:arg_1 %>% f(., arg_2, ..., arg_n)
arg_2 %>% f(arg_1, ., arg_3, ..., arg_n)
arg_n %>% f(arg_1, ..., arg_n-1, .)
- A common use of pipe is when a dataframe
df
gets first modified bysome_operation_1
, thensome_operation_2
, untilsome_operation_n
in a sequential way. It is done as follows:
# df gets some_operation_1, then some_operation_2, ..., then some_operation_n df %>% some_operation_1(params_1) %>% some_operation_2(params_1) %>% ... %>% some_operation_n(params_1)
Exploring the data The table below summarizes the main functions used to get a complete overview of the data:
Category | Action | Command |
Look at data | Select columns of interest | df %>% select(col_list) |
Remove unwanted columns | df %>% select(-col_list) |
|
Look at $n$ first rows / last rows | df %>% head(n) / df %>% tail(n) |
|
Summary statistics of columns | df %>% summary() |
|
Data types | Data types of columns | df %>% str() |
Number of rows / columns | df %>% NROW() / df %>% NCOL() |
Data types The table below sums up the main data types that can be contained in columns:
Data type | Description | Example |
character |
String-related data | 'teddy bear' |
factor |
String-related data that can be put in bucket, or ordered | 'high' |
numeric |
Numerical data | 24.0 |
int |
Numeric data that are integer | 24 |
Date |
Dates | '2020-01-01' |
POSIXct |
Timestamps | '2020-01-01 00:01:00' |
Data preprocessing
Filtering We can filter rows according to some conditions as follows:
df %>%
filter(some_col some_operation some_value_or_list_or_col)
where some_operation
is one of the following:
Category | Operation | Command |
Basic | Equality / non-equality | == / != |
Inequalities | < , <= , >= , > |
|
And / or | & / | |
|
Advanced | Check for missing value | is.na() |
Belonging | %in% c(val_1, ..., val_n) |
|
Pattern matching | %like% 'val' |
Remark: we can filter columns with the select_if()
function.
Changing columns The table below summarizes the main column operations:
Action | Command |
Add new columns on top of old ones | df %>% mutate(new_col = operation(other_cols)) |
Add new columns and discard old ones | df %>% transmute(new_col = operation(other_cols)) |
Modify several columns in-place | df %>% mutate_at(vars, funs) |
Modify all columns in-place | df %>% mutate_all(funs) |
Modify columns fitting a specific condition | df %>% mutate_if(condition, funs) |
Unite columns | df %>% unite(new_merged_col, old_cols_list) |
Separate columns | df %>% separate(col_to_separate, new_cols_list) |
Conditional column A column can take different values with respect to a particular set of conditions with the case_when()
command as follows:
case_when(condition_1 ~ value_1, # If condition_1 then value_1
condition_2 ~ value_2, # If condition_2 then value_2
...
TRUE ~ value_n) # Otherwise, value_n
Remark: the ifelse(condition_if_true, value_true, value_other)
function can be used and is easier to manipulate if there is only one condition.
Mathematical operations The table below sums up the main mathematical operations that can be performed on columns:
Operation | Command |
$\sqrt{x}$ | sqrt(x) |
$\lfloor x\rfloor$ | floor(x) |
$\lceil x\rceil$ | ceiling(x) |
Datetime conversion Fields containing datetime values can be stored in two different POSIXt
data types:
Action | Command |
Converts to datetime with seconds since origin | as.POSIXct(col, format) |
Converts to datetime with attributes (e.g. time zone) | as.POSIXlt(col, format) |
where format
is a string describing the structure of the field and using the commands summarized in the table below:
Category | Command | Description | Example |
Year | '%Y' / '%y' |
With / without century | 2020 / 20 |
Month | '%B' / '%b' / '%m' |
Full / abbreviated / numerical | August / Aug / 8 |
Weekday | '%A' / '%a' |
Full / abbreviated | Sunday / Sun |
'%u' / '%w' |
Number (1-7) / Number (0-6) | 7 / 0 |
|
Day | '%d' / '%j' |
Of the month / of the year | 09 / 222 |
Time | '%H' / '%M' |
Hour / minute | 09 / 40 |
Timezone | '%Z' / '%z' |
String / Number of hours from UTC | EST / -0400 |
Remark: data frames only accept datetime in POSIXct
format.
Date properties In order to extract a date-related property from a datetime object, the following command is used:
format(datetime_object, format)
format
follows the same convention as in the table above.
Data frame transformation
Merging data frames We can merge two data frames by a given field as follows:
merge(df_1, df_2, join_field, join_type)
where join_field
indicates fields where the join needs to happen:
Case | Same field names | Different field names |
Option | by = 'field' |
by.x = 'field_name_1', by.y = 'field_name_2' |
and where join_type
indicates the join type, and is one of the following:
Join type | Option | Illustration |
Inner join | default | ![]() |
Left join | all.x = TRUE |
![]() |
Right join | all.y = TRUE |
![]() |
Full join | all = TRUE |
![]() |
Remark: if the by
parameter is not specified, the merge will be a cross join.
Concatenation The table below summarizes the different ways data frames can be concatenated:
Type | Command | Illustration |
Rows | rbind(df_1, ..., df_n) |
![]() |
Columns | cbind(df_1, ..., df_n) |
![]() |
Common transformations The common data frame transformations are summarized in the table below:
Type | Command | Illustration | |
Before | After | ||
Long to wide | spread(
df, key = 'key',
value = 'value'
) |
![]() |
![]() |
Wide to long | gather(
df, key = 'key',
value = 'value',
c(key_1, ..., key_n)
) |
![]() |
![]() |
Row operations The following actions are used to make operations on rows of the data frame:
Action | Command | Illustration | |
Before | After | ||
Sort with respect to columns | df %>%
arrange(col_1, ..., col_n) |
![]() |
![]() |
Dropping duplicates | df %>% unique() |
![]() |
![]() |
Drop rows with at least a null value | df %>% na.omit() |
![]() |
![]() |
Remark: by default, the arrange()
function sorts in ascending order. If we want to sort it in descending order, the -
command needs to be used before a column.
Aggregations
Grouping data Aggregate metrics are computed across groups as follows:

The R command is as follows:
df %>% # Original ungrouped data frame
group_by(col_1, ..., col_n) %>% # Group by some columns
summarize(agg_metric = some_aggregation(some_cols)) # Aggregation step
Aggregation functions The table below summarizes the main aggregate functions that can be used in an aggregation query:
Category | Action | Command |
Properties | Count of observations | n() |
Values | Sum of values of observations | sum() |
Max / min of values of observations | max() / min() |
|
Mean / median of values of observations | mean() / median() |
|
Standard deviation / variance across observations | sd() / var() |
Window functions
Definition A window function computes a metric over groups and has the following structure:

The R command is as follows:
df %>% # Original ungrouped data frame
group_by(col_1, ..., col_n) %>% # Group by some columns
mutate(win_metric = window_function(col)) # Window function
Remark: applying a window function will not change the initial number of rows of the data frame.
Row numbering The table below summarizes the main commands that rank each row across specified groups, ordered by a specific field:
Command | Description | Example |
row_number(x) |
Ties are given different ranks | 1, 2, 3, 4 |
rank(x) |
Ties are given same rank and skip numbers | 1, 2.5, 2.5, 4 |
dense_rank(x) |
Ties are given same rank and do not skip numbers | 1, 2, 2, 3 |
Values The following window functions allow to keep track of specific types of values with respect to the group:
Command | Description |
first(x) |
Takes the first value of the column |
last(x) |
Takes the last value of the column |
lag(x, n) |
Takes the $n^{\textrm{th}}$ previous value of the column |
lead(x, n) |
Takes the $n^{\textrm{th}}$ following value of the column |
nth(x, n) |
Takes the $n^{\textrm{th}}$ value of the column |