Data manipulation with R study guide

Star

Main concepts

File management The table below summarizes useful commands to make sure the working directory is correctly set:

CategoryActionCommand
PathsChange directory to another pathsetwd(path)
Get current working directorygetwd()
Join pathsfile.path(path_1, ..., path_n)
FilesList files and folders in a given directorylist.files(path, include.dirs = TRUE)
Check if path is a file / folderfile_test('-f', path)
file_test('-d', path)
Read / write csv fileread.csv(path_to_csv_file)
write.csv(df, path_to_csv_file)


Chaining The symbol %>%, also called "pipe", enables to have chained operations and provides better legibility. Here are its different interpretations:


Exploring the data The table below summarizes the main functions used to get a complete overview of the data:

CategoryActionCommand
Look at dataSelect columns of interestdf %>% select(col_list)
Remove unwanted columnsdf %>% select(-col_list)
Look at $n$ first rows / last rowsdf %>% head(n) / df %>% tail(n)
Summary statistics of columnsdf %>% summary()
Data typesData types of columnsdf %>% str()
Number of rows / columnsdf %>% NROW() / df %>% NCOL()


Data types The table below sums up the main data types that can be contained in columns:

Data typeDescriptionExample
characterString-related data'teddy bear'
factorString-related data that can be put in bucket, or ordered'high'
numericNumerical data24.0
intNumeric data that are integer24
DateDates'2020-01-01'
POSIXctTimestamps'2020-01-01 00:01:00'


Data preprocessing

Filtering We can filter rows according to some conditions as follows:

df %>%
  filter(some_col some_operation some_value_or_list_or_col)

where some_operation is one of the following:

CategoryOperationCommand
BasicEquality / non-equality== / !=
Inequalities<, <=, >=, >
And / or& / |
AdvancedCheck for missing valueis.na()
Belonging%in% c(val_1, ..., val_n)
Pattern matching%like% 'val'

Remark: we can filter columns with the select_if() function.


Changing columns The table below summarizes the main column operations:

ActionCommand
Add new columns on top of old onesdf %>% mutate(new_col = operation(other_cols))
Add new columns and discard old onesdf %>% transmute(new_col = operation(other_cols))
Modify several columns in-placedf %>% mutate_at(vars, funs)
Modify all columns in-placedf %>% mutate_all(funs)
Modify columns fitting a specific conditiondf %>% mutate_if(condition, funs)
Unite columnsdf %>% unite(new_merged_col, old_cols_list)
Separate columnsdf %>% separate(col_to_separate, new_cols_list)


Conditioning A column can take different values with respect to a particular set of conditions with the case_when() command as follows:

case_when(condition_1 ~ value_1,  # If condition_1 then value_1
          condition_2 ~ value_2,  # If condition_2 then value_2
                ...
          TRUE ~ value_n)         # Otherwise, value_n

Remark: the ifelse(condition_if_true, value_true, value_other) function can be used and is easier to manipulate if there is only one condition.


Mathematical operations The table below sums up the main mathematical operations that can be performed on columns:

OperationCommand
$\sqrt{x}$sqrt(x)
$\lfloor x\rfloor$floor(x)
$\lceil x\rceil$ceiling(x)


Datetime conversion Fields containing datetime values can be stored in two different POSIXt data types:

ActionCommand
Converts to datetime with seconds since originas.POSIXct(col, format)
Converts to datetime with attributes (e.g. time zone)as.POSIXlt(col, format)

where format is a string describing the structure of the field and using the commands summarized in the table below:

CategoryCommandDescriptionExample
Year'%Y' / '%y'With / without century2020 / 20
Month'%B' / '%b' / '%m'Full / abbreviated / numericalAugust / Aug / 8
Weekday'%A' / '%a'Full / abbreviatedSunday / Sun
'%u' / '%w'Number (1-7) / Number (0-6)7 / 0
Day'%d' / '%j'Of the month / of the year09 / 222
Time'%H' / '%M'Hour / minute09 / 40
Timezone'%Z' / '%z'String / Number of hours from UTCEST / -0400

Remark: data frames only accept datetime in POSIXct format.


Date properties In order to extract a date-related property from a datetime object, the following command is used:

format(datetime_object, format)

where format follows the same convention as in the table above.


Data frame transformation

Merging data frames We can merge two data frames by a given field as follows:

merge(df_1, df_2, join_field, join_type)

where join_field indicates fields where the join needs to happen:

CaseSame field namesDifferent field names
Optionby = 'field'by.x = 'field_name_1', by.y = 'field_name_2'

and where join_type indicates the join type, and is one of the following:

Join typeOptionIllustration
Inner joindefaultInner join
Left joinall.x = TRUELeft join
Right joinall.y = TRUERight join
Full joinall = TRUEFull outer join

Remark: if the by parameter is not specified, the merge will be a cross join.


Concatenation The table below summarizes the different ways data frames can be concatenated:

TypeCommandIllustration
Rowsrbind(df_1, ..., df_n)Row concatenation
Columnscbind(df_1, ..., df_n)Column concatenation


Common transformations The common data frame transformations are summarized in the table below:

TypeCommandIllustration
BeforeAfter
Long to widespread(
  df, key = 'key',
  value = 'value'
)
WideLong
Wide to longgather(
  df, key = 'key',
  value = 'value',
  c(key_1, ..., key_n)
)
LongWide


Row operations The following actions are used to make operations on rows of the data frame:

ActionCommandIllustration
BeforeAfter
Sort with respect to columnsdf %>%
  arrange(col_1, ..., col_n)
UnsortedSorted
Dropping duplicatesdf %>% unique()With duplicatesWithout duplicates
Drop rows with at least a null valuedf %>% na.omit()With NULL valuesWithout NULL values

Remark: by default, the arrange() function sorts in ascending order. If we want to sort it in descending order, the - command needs to be used before a column.


Aggregations

Grouping data Aggregate metrics are computed across groups as follows:


The R command is as follows:

df %>%                                                 # Original ungrouped data frame
 group_by(col_1, ..., col_n) %>%                       # Group by some columns
 summarize(agg_metric = some_aggregation(some_cols))   # Aggregation step


Aggregation functions The table below summarizes the main aggregate functions that can be used in an aggregation query:

CategoryActionCommand
PropertiesCount of observationsn()
ValuesSum of values of observationssum()
Max / min of values of observationsmax() / min()
Mean / median of values of observationsmean() / median()
Standard deviation / variance across observationssd() / var()


Window functions

Definition A window function computes a metric over groups and has the following structure:


The R command is as follows:

df %>%                                       # Original ungrouped data frame
 group_by(col_1, ..., col_n) %>%             # Group by some columns
 mutate(win_metric = window_function(col))   # Window function

Remark: applying a window function will not change the initial number of rows of the data frame.


Row numbering The table below summarizes the main commands that rank each row across specified groups, ordered by a specific field:

CommandDescriptionExample
row_number(x)Ties are given different ranks1, 2, 3, 4
rank(x)Ties are given same rank and skip numbers1, 2.5, 2.5, 4
dense_rank(x)Ties are given same rank and do not skip numbers1, 2, 2, 3


Values The following window functions allow to keep track of specific types of values with respect to the group:

CommandDescription
first(x)Takes the first value of the column
last(x)Takes the last value of the column
lag(x, n)Takes the $n^{\textrm{th}}$ previous value of the column
lead(x, n)Takes the $n^{\textrm{th}}$ following value of the column
nth(x, n)Takes the $n^{\textrm{th}}$ value of the column