Data science tools


Data manipulation R-Python conversion guide
Star

dplyr tidyr lubridate pandas numpy datetime

By Afshine Amidi and Shervine Amidi

Main concepts

File management The table below summarizes useful commands to make sure the working directory is correctly set:

Category R command Python command
Paths setwd(path) os.chdir(path)
getwd() os.getcwd()
file.path(path_1, ..., path_n) os.path.join(path_1, ..., path_n)
Files list.files(path, include.dirs = TRUE) os.listdir(path)
file_test('-f', path) os.path.isfile(path)
file_test('-d', path) os.path.isdir(path)
read.csv(path_to_csv_file) pd.read_csv(path_to_csv_file)
write.csv(df, path_to_csv_file) df.to_csv(path_to_csv_file)

Chaining Successive operations can be chained in R using the %>% symbol in order to provide better legibility. It is done as follows:

# df gets some_operation_1, then some_operation_2, ..., then some_operation_n
df %>%
  some_operation_1(params_1) %>%
  some_operation_2(params_1) %>%
             ...             %>%
  some_operation_n(params_1)

The corresponding Python command is as follows:

# df gets some_operation_1, then some_operation_2, ..., then some_operation_n
(df
 .some_operation_1(params_1)
 .some_operation_2(params_2)
 .      ...
 .some_operation_n(params_n))


Exploring the data The table below summarizes the main functions used to get a complete overview of the data:

Category R command Python command
Look at data df %>% select(col_list) df[col_list]
df %>% select(-col_list) df.drop(col_list, axis=1)
df %>% head(n) / df %>% tail(n) df.head(n) / df.tail(n)
df %>% summary() df.describe()
Data types df %>% str() df.dtypes / df.info()
df %>% NROW() / df %>% NCOL() df.shape

Data types The table below sums up the main data types that can be contained in columns:

R Data type Python Data type Description
character object String-related data
factor String-related data that can be put in bucket, or ordered
numeric float64 Numerical data
int int64 Numeric data that are integer
POSIXct datetime64 Timestamps

Data preprocessing

Filtering We can filter rows according to some conditions as follows:

Category R command Python command
Overview df %>% filter(col operation val_or_col) df[df['col'] operation val_or_col]
Operation == / != == / !=
<, <=, >=, > <, <=, >=, >
& / | & / |
is.na() pd.isnull()
%in% (val_1, ..., val_n) .isin([val_1, ..., val_n])
%like% 'val' .str.contains('val')

Changing columns The table below summarizes the main column operations:

Action R command Python command
Add new columns on top of old ones df %>%
  mutate(new_col = operation(other_cols))
df.assign(
  new_col=lambda x: some_operation(x)
)
Unite columns df  %>%
  unite(new_merged_col, old_cols_list)
df['new_merged_col'= (
  df[old_cols_list].agg('-'.joinaxis=1)
)

Conditional column A column can take different values with respect to a particular set of conditions with the following R command:

case_when(condition_1 ~ value_1,  # If condition_1 then value_1
          condition_2 ~ value_2,  # If condition_2 then value_2
                ...
          TRUE ~ value_n)         # Otherwise, value_n

The corresponding Python command is as follows:

np.select(
  [condition_1, ..., condition_n],  # If condition_1, ..., condition_n
  [value_1, ..., value_n],          # Then value_1, ..., value_n respectively
  default=default_value             # Otherwise, default_value
)

Mathematical operations The table below sums up the main mathematical operations that can be performed on columns:

Operation R command Python command
$\sqrt{x}$ sqrt(x) np.sqrt(x)
$\lfloor x\rfloor$ floor(x) np.floor(x)
$\lceil x\rceil$ ceiling(x) np.ceil(x)

Datetime conversion Fields containing datetime values are converted from string to datetime as follows:

Action R Command Python Command
Converts string to datetime as.POSIXct(col, format) pd.to_datetime(col, format)

where format is a string describing the structure of the field and using the commands summarized in the table below:

Category Command Description Example
Year '%Y' / '%y' With / without century 2020 / 20
Month '%B' / '%b' / '%m' Full / abbreviated / numerical August / Aug / 8
Weekday '%A' / '%a' Full / abbreviated Sunday / Sun
'%u' / '%w' Number (1-7) / Number (0-6) 7 / 0
Day '%d' / '%j' Of the month / of the year 09 / 222
Time '%H' / '%M' Hour / minute 09 / 40
Timezone '%Z' / '%z' String / Number of hours from UTC EST / -0400

Data frame transformation

Merging data frames We can merge two data frames by a given field as follows:

Category R command Python command
Overview merge(df_1, df_2, join_field, join_type) df_1.merge(df_2, join_field, join_type)
Join field by = 'field' on='field'
by.x = 'field_1', by.y = 'field_2' left_on='field_1'right_on='field_2'
Join type all.x = TRUE how='left'
all.y = TRUE how='right'
default how='inner'
all = TRUE how='outer'

Concatenation The table below summarizes the different ways data frames can be concatenated:

Type R command Python command
Rows rbind(df_1, ..., df_n) pd.concat([df_1, ..., df_n], axis=0)
Columns cbind(df_1, ..., df_n) pd.concat([df_1, ..., df_n], axis=1)

Common transformations The common data frame transformations are summarized in the table below:

Type R command Python command
Long to wide spread(
  df, key = 'key_name'
  value = 'value_name'
)
pd.pivot_table(
  df, columns='key_name',
  values='value_name',
  index=other_cols,
  aggfunc=np.sum
)
Wide to long gather(
  df, key = 'key_name',
  value = 'value_name',
  c(key_1, ..., key_n)
)
pd.melt(
  df, var_name='key',
  value_name='value',
  value_vars=[
    'key_1', ..., 'key_n'
  ], id_vars=some_cols
)

Row operations The following actions are used to make operations on rows of the data frame:

Action R command Python command
Sort data frame df %>% arrange(col_list) df.sort_values(by=col_list, ascending=True)
Drop duplicates df %>% unique() df.drop_duplicates()
Drop rows with null values df %>% na.omit() df.dropna()

Aggregations

Grouping data A data frame can be aggregated with respect to given columns as follows:


The R command is as follows:

df %>%                                                 # Original ungrouped data frame
 group_by(col_1, ..., col_n) %>%                       # Group by some columns
 summarize(agg_metric = some_aggregation(some_cols))   # Aggregation step

and the Python command is as follows:

(df                                                    # Original ungrouped data frame
 .groupby(['col_1', ..., 'col_n'])                     # Group by some columns
 .agg({'col': builtin_agg})                            # Aggregation step

where builtin_agg is among the following:

Action R Command Python Command
Count across observations n() 'count'
Sum across observations sum() 'sum'
Max / min of values of observations max() / min() 'max' / 'min'
Mean / median of values of observations mean() / median() 'mean' / 'median'
Standard deviation / variance across observations sd() / var() 'std' / 'var'

Window functions

Definition A window function computes a metric over groups and has the following structure:


The R command is as follows:

df %>%                                       # Original ungrouped data frame
 group_by(col_1, ..., col_n) %>%             # Group by some columns
 mutate(win_metric = window_function(col))   # Window function

and the Python command is as follows:

(df
 .assign(win_metric = lambda x:
           x.groupby(['col_1', ..., 'col_n'])['col'].window_function(params))

Remark: applying a window function will not change the initial number of rows of the data frame.


Row numbering The table below summarizes the main commands that rank each row across specified groups, ordered by a specific field:

R Command Python Command Description
row_number(x) x.rank(method='first') Ties are given different ranks
rank(x) x.rank(method='min') Ties are given same rank and skip numbers
dense_rank(x) x.rank(method='dense') Ties are given same rank and do not skip numbers

Values The following window functions allow to keep track of specific types of values with respect to the group:

R Command Python Command Description
lag(x, n) x.shift(n) Takes the $n^{\textrm{th}}$ previous value of the column
lead(x, n) x.shift(-n) Takes the $n^{\textrm{th}}$ following value of the column



You may also like...

Data manipulation with R
  • • Filtering
  • • Types of joins
  • • Aggregations, window functions
  • • Data frame transformation
Data manipulation with Python
  • • Filtering
  • • Types of joins
  • • Aggregations, window functions
  • • Data frame transformation
R-Python conversion for data visualization
  • Conversion made easy between R (ggplot2) and Python (matplotlib, seaborn)