DataRaccoon

TLDR

My tips/experience in writing better pandas code that have been found useful from a readability, operational stand point.

Problem Intro

Problem Set One

There is a significant number of data scientist / analyst who write their prototyping code in pandas, and is required to ship to production encounters one or more of the following problems:

  1. Re-assignment multiple times, hard code referencing.
  2. Hard to understand pandas code and decipher the author's intention.
  3. Especially hard to make changes (partly due to the absence of testing).

For example,

import pandas as pd

df = pd.DataFrame({"A": [1, 2, 3, 4]})
df["B"] = df.A * 2

def divide_by_ten(input_series):
    return input_series / 10

df["C"] = divide_by_ten(df.B)

Problem Set Two

It also happens that there is (quite) a number of data professionals who loves the R's Dplyr or Spark's Dataframe API

In my opinion, this is largely because of "piping". In R you could do something like this:

df %>% 
    mutate(...) %>% 
    group_by(...) %>%
    summarize(...) %>% 
    left_join(...) %>%
    filter(...) %>%
    ... 

Or if you are using PySpark, you would do something like this:

(df
.WithColumn(new_col = sf.col(...))
.groupBy(...)
.agg(...)
.join(..., how='left')
.where(...)
)

Hopefully, the problem is clear to you and the content here will help you in writing faster and better code!

Setup/Pre-req

Python Environment with the following libraries:

pandas==1.2.4
numpy==1.20.2

If you are familar with docker or have went through my docker post, this is my dockerfile:

FROM continuumio/miniconda3:4.8.2
WORKDIR $HOME/src
COPY requirements.txt $HOME/src
RUN pip install -r requirements.txt

And these are the imports you will need:

import pandas as pd
import numpy as np
import random

Creating DataFrames

First, some small revision on creating dataframes.

DF from dict

The most common way to create a dataframe is by a dictionary, where key is the column name, and the value is a list containing the column of the data frame.

d = {"category": ["a", "b", "c", "d", "e"], "value": [1, 2, 3, 4, 5]}
df = pd.DataFrame(data=d)
# this is also the equivalent
df = pd.DataFrame.from_dict(data=d)

"""
    category    value
0   a           1
1   b           2
2   c           3
3   d           4
4   e           5
"""

DF row wise

If you happen to want each key to be a row in the dataframe:

df_rows = pd.DataFrame.from_dict(data=d, orient="index")
df_rows

"""
            0   1   2   3   4
category    a   b   c   d   e
value       1   2   3   4   5
"""

DF from records

If you are working with the cloud, chances are you are dealing with generators that sends you each data point.

records = [
    dict(category="a", value=1),
    dict(category="b", value=2),
    dict(category="c", value=3),
]

df = pd.DataFrame.from_records(data=records)

Using Loc

Selecting with Loc

In most pandas code, this is how selection is usually done:

# This return a series
df.category
df["category"]

# This returns a dataframe
df[["category"]]

# This returns step wise
df[0:5:2]

Instead of specifying it this way, you can make use of the loc function. Let's use the rows-indexed dataframe as an example:

df_rows.loc["category"] # this returns a series
df_rows.loc[["category"]] # this returns a data frame 

You can also use iloc, which uses indexing instead:

df_rows.iloc[0, 0:4] # still returns a series
df_rows.iloc[[0], 2:5] # double square brackets returns dataframe

This bring forth interesting ways to make use of loc. Suppose you want the category row, with values greater than 2. i.e

df_rows.loc[[True, False], [False, False, True, True, True]]
"""
            2   3   4
category    c   d   e
"""

The following code are all equivalent:

df_rows.loc[[True, False], [False, False, True, True, True]]

df_rows.loc[["category"]].iloc[0:1, 2:5]

df_rows.loc[["category"]].loc[:, [False, False, True, True, True]]

(df_rows
.loc[["category"]]
.loc[:, lambda df: df.loc["value"] > 2]
)

Subsetting with loc

Typical Pandas code when it comes to filtering by some column values would look like this:

df[df["value"]>3]

By making use of loc:

df.loc[lambda df: df["value"] > 3]

This, in general allows yourself (or future contributors) to read your code better, afterall, explicit is better than implicit.

There are many other reasons why using loc is better, more information can be found in references.


Chaining w Groupby

As a data scientist, you normally need to calculate aggregate statistics, and if you are not using pandas on a regular basis, usually the indexes will catch you off guard.

In order to illustrate, we create a bigger dataframe for the following examples:

random.seed(10)
num_values = 1000
category_one = random.choices(["a", "b", "c", "d", "e"], k=num_values)
category_two = random.choices(["i", "j", "k"], k=num_values)
values_one = random.sample(range(0, 1000), num_values)
values_two = random.sample(range(2000, 3000), num_values)
# Another way of defining it using list of lists
# Instead of using dict of arrays
df = pd.DataFrame(
    zip(category_one, category_two, values_one, values_two),
    columns=["c1", "c2", "v1", "v2"],
)

df.columns
"""
Index(['c1', 'c2', 'v1', 'v2'], dtype='object')
"""

This dataframe contains two categorical columns, and two numerical columns. Typical group-by transform code will look like this:

df_agg = df.groupby(["c1", "c2"]).agg(["count"])
df_agg.index
"""
MultiIndex([('a', 'i'),
            ('a', 'j'),
            ('a', 'k'),
            ('b', 'i'),
            ('b', 'j'),
            ('b', 'k'),
            ('c', 'i'),
            ('c', 'j'),
            ('c', 'k'),
            ('d', 'i'),
            ('d', 'j'),
            ('d', 'k'),
            ('e', 'i'),
            ('e', 'j'),
            ('e', 'k')],
           names=['c1', 'c2'])
"""           

df_agg.columns

"""
df_agg.columns
MultiIndex([('v1', 'count'),
            ('v2', 'count')],
           )
"""

Dealing with MultiIndexes can be tricky, possibly unwelcome. One can fix it by doing so:

df_agg = df_agg.reset_index()
df_agg.columns = ["group1", "group2", "count_v1", "count_v2"]

This poses a problem - what if you have multiple aggregations? This is how you might do it

df_agg = df.groupby(["c1", "c2"]).agg(["count", "min", "max", "mean"])

df_agg.index = df_agg.index.map("_".join) # flattening index
df_agg.loc["a_i", :]  # statistics that belong to group a, i

"""
v1  count      62.000000
    min        39.000000
    max       991.000000
    mean      513.129032
v2  count      62.000000
    min      2024.000000
    max      2968.000000
    mean     2519.016129
Name: a_i, dtype: float64
"""

# flatten the columns 
df_agg = df_agg.reset_index()
df_agg.columns = df_agg.columns.map("_".join) # flattening columns
df_agg.loc[lambda df: df['index_']=="a_i"]

"""
index_  v1_count  v1_min  v1_max     v1_mean  v2_count  v2_min  v2_max  v2_mean   
a_i        62      39     991  513.129032        62    2024    2968      2519.016129  
"""

Aside: you can use numpy aggregation functions too.

df_agg = df.groupby(["c1", "c2"]).agg([np.size, np.min, np.max, np.mean])

NamedAgg

Instead of worrying about flattening the index(es), and as previously mentioned, explicit is always better than implicit, this is where NamedAgg is very helpful.

df_agg = df.groupby(["c1", "c2"]).agg(
    v1_size=pd.NamedAgg(column="v1", aggfunc=np.size),
    v1_min=pd.NamedAgg(column="v1", aggfunc=np.min),
    v2_mean=pd.NamedAgg(column="v2", aggfunc=np.mean),
    v2_max=pd.NamedAgg(column="v2", aggfunc=np.max),
    v2_mean_max_diff=pd.NamedAgg(column="v2", aggfunc=lambda x: np.max(x) - np.mean(x)),
)

df_agg.reset_index().head(5)

"""
c1  c2  v1_size v1_min  v2_mean v2_max  v2_mean_max_diff
0   a   i   62  39  2519.016129 2968    448.983871
1   a   j   66  16  2500.984848 2984    483.015152
2   a   k   72  5   2553.694444 2998    444.305556
3   b   i   59  12  2486.033898 2997    510.966102
4   b   j   61  7   2538.459016 2996    457.540984
"""

Simplfying NamedAgg

In fact, pandas.NamedAgg is just a namedtuple. Plain tuples are allowed as well. Because of this, you can actually store the aggregations as a dictionary, and pass it over in the aggregation function.

dict_of_agg = {
    "v1_size": ("v1", np.size),
    "v1_mean": ("v1", np.min),
    "v2_mean": ("v2", np.mean),
    "v2_max": ("v2", np.max),
    "v2_mean_max_diff": ("v2", lambda x: np.max(x) - np.mean(x)),
}
df_agg = df.groupby(["c1", "c2"]).agg(**dict_of_agg)

Transform

When preparing data for modeling, a common data manipulation or feature engineering, is to add summary statistics to categorical features. You might approach this the following way:

  • Calculate summary statistics based on the categorical feature
  • Perform a join after, like so:
df_eg1 = df.copy()
df_group_stats = (
    df_eg1.groupby(["c1", "c2"]).agg(c1_c2_v1_sum=("v1", np.sum)).reset_index()
)
df_eg1 = df_eg1.merge(df_group_stats, on=["c1", "c2"])

Turns out there is a better way to do this, with the transform method:

df_eg2["c1_c2_v1_sum"] = df_eg2.groupby(["c1", "c2"])["v1"].transform(np.sum)

To test that these two results are the same:

# Sort just for comparision
df_eg1 = df_eg1.sort_values(["c1", "c2"]).reset_index(drop=True)
df_eg2 = df_eg2.sort_values(["c1", "c2"]).reset_index(drop=True)
df_eg1.equals(df_eg2)

"""
True
"""

For users coming from SQL-land, think of transform as a window function. You can perform ranking on groups like so:

df.groupby(["c1", "c2"])["v1"].transform("rank")

Another common use case when transform can be very useful, is to impute missing values based on certain conditions:

df_eg3 = pd.DataFrame({"c1": ["a", "a", "b", "b"], "v1": [1, np.nan, 2, np.nan]})
df_eg3["v1"] = df_eg3["v1"].fillna(df_eg3.groupby("c1")["v1"].transform("mean"))

"""
    c1  v1
0   a   1.0
1   a   1.0
2   b   2.0
3   b   2.0
"""

Similarly, more information can be found in the references section.

Filter

When doing exploratory data analysis, sometimes you only want to analyze subset of data fulfilling a certain set of conditions. One way of approaching the problem, could be:

  • Calculate summary statistics into a new dataframe.
  • Filter out keys with summary statistics fulfilling a certain condition.
  • Perform an anti join.

Cutting to the chase instead,

df.groupby(["c1", "c2"]).filter(lambda x: len(x) > 78)

To illustrate:

df_d_j = df_agg.loc[[("d", "j")]].reset_index()
df_d_j

"""
    c1  c2  v1_size v1_mean v2_mean v2_max  v2_mean_max_diff
0   d   j   79  2   2505.759494 2981    475.240506
"""

# to verify the results
# Also note, we can reuse the `dict_of_agg`
df_d_j_2 = (
    df.groupby(["c1", "c2"])
    .filter(lambda x: len(x) > 78)
    .reset_index()
    .groupby(["c1", "c2"])
    .agg(**dict_of_agg)
    .reset_index()
)

df_d_j.equals(df_d_j_2)  # True

For users coming from SQL-land, think of transform as a having function.


Creating New Columns

At the problem intro, we mentioned about the re-assignment / hard code referencing problem. To reiterate, suppose we want to calculate the ratio of v1/(v1+v2):

df['sum'] = df['v1'] + df['v2']
df['fraction_v1'] = df['v1'] / df['sum']

If you need to change the sum into something else, say total_v1_v2, you would need to change multiple lines of code.

Another problem with this approach, when you want to prototype functions, you would find yourself re-running multiple lines of python code above or find yourself doing df.copy().

To overcome this problem, let's first understand the assign function.

Assign

One huge advantage of assign, is it returns you a new object that has a copy of the original data with the requested changes. The original dataframe remains unchanged!

df.assign(sum=lambda x: x["v1"] + x["v2"], fraction_v1=lambda x: x["v1"] / x["sum"])

There is still some hard coding involved with the sum variable, also, the feature transformation can be assigned to a dictionary!

value1 = "v1"
value2 = "v2"
feat1_name = "sum"
feat2_name = "fraction_v1"
dict_of_features = {
    feat1_name: lambda x: x[value1] + x[value2],
    feat2_name: lambda x: x[value1] / x[feat1_name],
}

df.assign(**dict_of_features)

Assign w Apply

You can also use apply with custom functions, here is a trivial example to replicate the above:

def simple_ratio(x, y):
    return x / (x + y)


df.assign(
    **{
        feat2_name: lambda x: x.apply(
            lambda y: simple_ratio(y[value1], y[value2]), axis=1
        )
    }
)

Pipe

There still remain 2 problems unsolved:

  • what if we want to convert such steps to be unit-testable?
  • love the pyspark api/dplyr api from R?

Introducing the pipe method.

To replicate the above example of calculating the ratio of v1/(v1+v2), first create a function as follows:

def add_ratio(df_input, col1, col2):
    return df_input.assign(
        **{
            feat2_name: lambda x: x.apply(
                lambda y: simple_ratio(y[col1], y[col2]), axis=1
            )
        }
    )

This function is easily test-able, done with mocks. To use it for data transformation in your main code / data processing,

# recall that df_input is the first input variable in the add_ratio function
df.pipe(add_ratio, col1=value1, col2=value2)

However, if the first input is not the dataframe, you can set pipe to know which arguments to replace in the function, like so:

df.pipe((add_ratio, "df_input"), col1=value1, col2=value2).head(5)

Summary

To summarize,

  • Using loc
    • for selecting columns
    • for slicing of dataframes
  • GroupBy
    • Using NamedAgg
    • Transform
    • Filter
  • Assign
    • Assign with dictionaries
    • Assign with apply
  • Pipe

Hope this helps you in creating better & faster pandas code!

References