Conversion for data manipulation R-Python conversion guide

Main concepts

File management The table below summarizes useful commands to make sure the working directory is correctly set:

CategoryR commandPython command
Pathssetwd(path)os.chdir(path)
getwd()os.getcwd()
file.path(path_1, ..., path_n)os.path.join(path_1, ..., path_n)
Fileslist.files(path, include.dirs = TRUE)os.listdir(path)
file_test('-f', path)os.path.isfile(path)
file_test('-d', path)os.path.isdir(path)
read.csv(path_to_csv_file)pd.read_csv(path_to_csv_file)
write.csv(df, path_to_csv_file)df.to_csv(path_to_csv_file)


Chaining Successive operations can be chained in R using the %>% symbol in order to provide better legibility. It is done as follows:

# df gets some_operation_1, then some_operation_2, ..., then some_operation_n
df %>%
  some_operation_1(params_1) %>%
  some_operation_2(params_1) %>%
             ...             %>%
  some_operation_n(params_1)

The corresponding Python command is as follows:

# df gets some_operation_1, then some_operation_2, ..., then some_operation_n
(df
 .some_operation_1(params_1)
 .some_operation_2(params_2)
 .      ...
 .some_operation_n(params_n))


Exploring the data The table below summarizes the main functions used to get a complete overview of the data:

CategoryR commandPython command
Look at datadf %>% select(col_list)df[col_list]
df %>% select(-col_list)df.drop(col_list, axis=1)
df %>% head(n) / df %>% tail(n)df.head(n) / df.tail(n)
df %>% summary()df.describe()
Data typesdf %>% str()df.dtypes / df.info()
df %>% NROW() / df %>% NCOL()df.shape


Data types The table below sums up the main data types that can be contained in columns:

R Data typePython Data typeDescription
characterobjectString-related data
factorString-related data that can be put in bucket, or ordered
numericfloat64Numerical data
intint64Numeric data that are integer
POSIXctdatetime64Timestamps


Data preprocessing

Filtering We can filter rows according to some conditions as follows:

CategoryR commandPython command
Overviewdf %>% filter(col operation val_or_col)df[df['col'] operation val_or_col]
Operation== / !=== / !=
<, <=, >=, ><, <=, >=, >
& / |& / |
is.na()pd.isnull()
%in% (val_1, ..., val_n).isin([val_1, ..., val_n])
%like% 'val'.str.contains('val')


Changing columns The table below summarizes the main column operations:

ActionR commandPython command
Add new columns on top of old onesdf %>%
  mutate(new_col = operation(other_cols))
df.assign(
  new_col=lambda x: some_operation(x)
)
Unite columnsdf %>%
  unite(new_merged_col, old_cols_list)
df['new_merged_col'] = (
  df[old_cols_list].agg('-'.join, axis=1)
)


Conditional column A column can take different values with respect to a particular set of conditions with the following R command:

case_when(condition_1 ~ value_1,  # If condition_1 then value_1
          condition_2 ~ value_2,  # If condition_2 then value_2
                ...
          TRUE ~ value_n)         # Otherwise, value_n

The corresponding Python command is as follows:

np.select(
  [condition_1, ..., condition_n],  # If condition_1, ..., condition_n
  [value_1, ..., value_n],          # Then value_1, ..., value_n respectively
  default=default_value             # Otherwise, default_value
)


Mathematical operations The table below sums up the main mathematical operations that can be performed on columns:

OperationR commandPython command
$\sqrt{x}$sqrt(x)np.sqrt(x)
$\lfloor x\rfloor$floor(x)np.floor(x)
$\lceil x\rceil$ceiling(x)np.ceil(x)


Datetime conversion Fields containing datetime values are converted from string to datetime as follows:

ActionR CommandPython Command
Converts string to datetimeas.POSIXct(col, format)pd.to_datetime(col, format)

where format is a string describing the structure of the field and using the commands summarized in the table below:

CategoryCommandDescriptionExample
Year'%Y' / '%y'With / without century2020 / 20
Month'%B' / '%b' / '%m'Full / abbreviated / numericalAugust / Aug / 8
Weekday'%A' / '%a'Full / abbreviatedSunday / Sun
'%u' / '%w'Number (1-7) / Number (0-6)7 / 0
Day'%d' / '%j'Of the month / of the year09 / 222
Time'%H' / '%M'Hour / minute09 / 40
Timezone'%Z' / '%z'String / Number of hours from UTCEST / -0400


Data frame transformation

Merging data frames We can merge two data frames by a given field as follows:

CategoryR commandPython command
Overviewmerge(df_1, df_2, join_field, join_type)df_1.merge(df_2, join_field, join_type)
Join fieldby = 'field'on='field'
by.x = 'field_1', by.y = 'field_2'left_on='field_1', right_on='field_2'
Join typeall.x = TRUEhow='left'
all.y = TRUEhow='right'
defaulthow='inner'
all = TRUEhow='outer'


Concatenation The table below summarizes the different ways data frames can be concatenated:

TypeR commandPython command
Rowsrbind(df_1, ..., df_n)pd.concat([df_1, ..., df_n], axis=0)
Columnscbind(df_1, ..., df_n)pd.concat([df_1, ..., df_n], axis=1)


Common transformations The common data frame transformations are summarized in the table below:

TypeR commandPython command
Long to widespread(
  df, key = 'key_name'
  value = 'value_name'
)
pd.pivot_table(
  df, columns='key_name',
  values='value_name',
  index=other_cols,
  aggfunc=np.sum
)
Wide to longgather(
  df, key = 'key_name',
  value = 'value_name',
  c(key_1, ..., key_n)
)
pd.melt(
  df, var_name='key',
  value_name='value',
  value_vars=[
    'key_1', ..., 'key_n'
  ], id_vars=some_cols
)


Row operations The following actions are used to make operations on rows of the data frame:

ActionR commandPython command
Sort data framedf %>% arrange(col_list)df.sort_values(by=col_list, ascending=True)
Drop duplicatesdf %>% unique()df.drop_duplicates()
Drop rows with null valuesdf %>% na.omit()df.dropna()


Aggregations

Grouping data A data frame can be aggregated with respect to given columns as follows:


The R command is as follows:

df %>%                                                 # Original ungrouped data frame
 group_by(col_1, ..., col_n) %>%                       # Group by some columns
 summarize(agg_metric = some_aggregation(some_cols))   # Aggregation step

and the Python command is as follows:

(df                                                    # Original ungrouped data frame
 .groupby(['col_1', ..., 'col_n'])                     # Group by some columns
 .agg({'col': builtin_agg}))                           # Aggregation step

where builtin_agg is among the following:

ActionR CommandPython Command
Count across observationsn()'count'
Sum across observationssum()'sum'
Max / min of values of observationsmax() / min()'max' / 'min'
Mean / median of values of observationsmean() / median()'mean' / 'median'
Standard deviation / variance across observationssd() / var()'std' / 'var'


Window functions

Definition A window function computes a metric over groups and has the following structure:


The R command is as follows:

df %>%                                       # Original ungrouped data frame
 group_by(col_1, ..., col_n) %>%             # Group by some columns
 mutate(win_metric = window_function(col))   # Window function

and the Python command is as follows:

(df
 .assign(win_metric = lambda x:
           x.groupby(['col_1', ..., 'col_n'])['col'].window_function(params)))

Remark: applying a window function will not change the initial number of rows of the data frame.


Row numbering The table below summarizes the main commands that rank each row across specified groups, ordered by a specific field:

R CommandPython CommandDescription
row_number(x)x.rank(method='first')Ties are given different ranks
rank(x)x.rank(method='min')Ties are given same rank and skip numbers
dense_rank(x)x.rank(method='dense')Ties are given same rank and do not skip numbers


Values The following window functions allow to keep track of specific types of values with respect to the group:

R CommandPython CommandDescription
lag(x, n)x.shift(n)Takes the $n^{\textrm{th}}$ previous value of the column
lead(x, n)x.shift(-n)Takes the $n^{\textrm{th}}$ following value of the column