My tips/experience in writing better pandas code that have been found useful from a readability, operational stand point.

There is a significant number of data scientist / analyst who write their prototyping code in pandas, and is required to ship to production encounters one or more of the following problems:

- Re-assignment multiple times, hard code referencing.
- Hard to understand pandas code and decipher the author's intention.
- Especially hard to make changes (partly due to the absence of testing).

For example,

```
import pandas as pd
df = pd.DataFrame({"A": [1, 2, 3, 4]})
df["B"] = df.A * 2
def divide_by_ten(input_series):
return input_series / 10
df["C"] = divide_by_ten(df.B)
```

It also happens that there is (quite) a number of data professionals who loves the R's Dplyr or Spark's Dataframe API

In my opinion, this is largely because of "piping". In R you could do something like this:

```
df %>%
mutate(...) %>%
group_by(...) %>%
summarize(...) %>%
left_join(...) %>%
filter(...) %>%
...
```

Or if you are using PySpark, you would do something like this:

```
(df
.WithColumn(new_col = sf.col(...))
.groupBy(...)
.agg(...)
.join(..., how='left')
.where(...)
)
```

Hopefully, the problem is clear to you and the content here will help you in writing faster and better code!

Python Environment with the following libraries:

```
pandas==1.2.4
numpy==1.20.2
```

If you are familar with docker or have went through my docker post, this is my dockerfile:

```
FROM continuumio/miniconda3:4.8.2
WORKDIR $HOME/src
COPY requirements.txt $HOME/src
RUN pip install -r requirements.txt
```

And these are the imports you will need:

```
import pandas as pd
import numpy as np
import random
```

First, some small revision on creating dataframes.

The most common way to create a dataframe is by a dictionary, where key is the column name, and the value is a list containing the column of the data frame.

```
d = {"category": ["a", "b", "c", "d", "e"], "value": [1, 2, 3, 4, 5]}
df = pd.DataFrame(data=d)
# this is also the equivalent
df = pd.DataFrame.from_dict(data=d)
"""
category value
0 a 1
1 b 2
2 c 3
3 d 4
4 e 5
"""
```

If you happen to want each key to be a row in the dataframe:

```
df_rows = pd.DataFrame.from_dict(data=d, orient="index")
df_rows
"""
0 1 2 3 4
category a b c d e
value 1 2 3 4 5
"""
```

If you are working with the cloud, chances are you are dealing with generators that sends you each data point.

```
records = [
dict(category="a", value=1),
dict(category="b", value=2),
dict(category="c", value=3),
]
df = pd.DataFrame.from_records(data=records)
```

In most pandas code, this is how selection is usually done:

```
# This return a series
df.category
df["category"]
# This returns a dataframe
df[["category"]]
# This returns step wise
df[0:5:2]
```

Instead of specifying it this way, you can make use of the `loc`

function. Let's use the rows-indexed dataframe as an example:

```
df_rows.loc["category"] # this returns a series
df_rows.loc[["category"]] # this returns a data frame
```

You can also use `iloc`

, which uses `indexing`

instead:

```
df_rows.iloc[0, 0:4] # still returns a series
df_rows.iloc[[0], 2:5] # double square brackets returns dataframe
```

This bring forth interesting ways to make use of `loc`

. Suppose you want the category row, with values greater than 2. i.e

```
df_rows.loc[[True, False], [False, False, True, True, True]]
"""
2 3 4
category c d e
"""
```

The following code are all equivalent:

```
df_rows.loc[[True, False], [False, False, True, True, True]]
df_rows.loc[["category"]].iloc[0:1, 2:5]
df_rows.loc[["category"]].loc[:, [False, False, True, True, True]]
(df_rows
.loc[["category"]]
.loc[:, lambda df: df.loc["value"] > 2]
)
```

Typical Pandas code when it comes to filtering by some column values would look like this:

`df[df["value"]>3]`

By making use of `loc`

:

`df.loc[lambda df: df["value"] > 3]`

This, in general allows yourself (or future contributors) to read your code better, afterall, **explicit is better than implicit**.

There are many other reasons why using `loc`

is better, more information can be found in references.

As a data scientist, you normally need to calculate aggregate statistics, and if you are not using pandas on a regular basis, usually the indexes will catch you off guard.

In order to illustrate, we create a bigger dataframe for the following examples:

```
random.seed(10)
num_values = 1000
category_one = random.choices(["a", "b", "c", "d", "e"], k=num_values)
category_two = random.choices(["i", "j", "k"], k=num_values)
values_one = random.sample(range(0, 1000), num_values)
values_two = random.sample(range(2000, 3000), num_values)
# Another way of defining it using list of lists
# Instead of using dict of arrays
df = pd.DataFrame(
zip(category_one, category_two, values_one, values_two),
columns=["c1", "c2", "v1", "v2"],
)
df.columns
"""
Index(['c1', 'c2', 'v1', 'v2'], dtype='object')
"""
```

This dataframe contains two categorical columns, and two numerical columns. Typical group-by transform code will look like this:

```
df_agg = df.groupby(["c1", "c2"]).agg(["count"])
df_agg.index
"""
MultiIndex([('a', 'i'),
('a', 'j'),
('a', 'k'),
('b', 'i'),
('b', 'j'),
('b', 'k'),
('c', 'i'),
('c', 'j'),
('c', 'k'),
('d', 'i'),
('d', 'j'),
('d', 'k'),
('e', 'i'),
('e', 'j'),
('e', 'k')],
names=['c1', 'c2'])
"""
df_agg.columns
"""
df_agg.columns
MultiIndex([('v1', 'count'),
('v2', 'count')],
)
"""
```

Dealing with MultiIndexes can be tricky, possibly unwelcome. One can fix it by doing so:

```
df_agg = df_agg.reset_index()
df_agg.columns = ["group1", "group2", "count_v1", "count_v2"]
```

This poses a problem - what if you have multiple aggregations? This is how you might do it

```
df_agg = df.groupby(["c1", "c2"]).agg(["count", "min", "max", "mean"])
df_agg.index = df_agg.index.map("_".join) # flattening index
df_agg.loc["a_i", :] # statistics that belong to group a, i
"""
v1 count 62.000000
min 39.000000
max 991.000000
mean 513.129032
v2 count 62.000000
min 2024.000000
max 2968.000000
mean 2519.016129
Name: a_i, dtype: float64
"""
# flatten the columns
df_agg = df_agg.reset_index()
df_agg.columns = df_agg.columns.map("_".join) # flattening columns
df_agg.loc[lambda df: df['index_']=="a_i"]
"""
index_ v1_count v1_min v1_max v1_mean v2_count v2_min v2_max v2_mean
a_i 62 39 991 513.129032 62 2024 2968 2519.016129
"""
```

Aside: you can use numpy aggregation functions too.

`df_agg = df.groupby(["c1", "c2"]).agg([np.size, np.min, np.max, np.mean])`

Instead of worrying about flattening the index(es), and as previously mentioned, **explicit is always better than implicit**, this is where NamedAgg is very helpful.

```
df_agg = df.groupby(["c1", "c2"]).agg(
v1_size=pd.NamedAgg(column="v1", aggfunc=np.size),
v1_min=pd.NamedAgg(column="v1", aggfunc=np.min),
v2_mean=pd.NamedAgg(column="v2", aggfunc=np.mean),
v2_max=pd.NamedAgg(column="v2", aggfunc=np.max),
v2_mean_max_diff=pd.NamedAgg(column="v2", aggfunc=lambda x: np.max(x) - np.mean(x)),
)
df_agg.reset_index().head(5)
"""
c1 c2 v1_size v1_min v2_mean v2_max v2_mean_max_diff
0 a i 62 39 2519.016129 2968 448.983871
1 a j 66 16 2500.984848 2984 483.015152
2 a k 72 5 2553.694444 2998 444.305556
3 b i 59 12 2486.033898 2997 510.966102
4 b j 61 7 2538.459016 2996 457.540984
"""
```

In fact, `pandas.NamedAgg`

is just a `namedtuple`

. Plain tuples are allowed as well. Because of this, you can actually store the aggregations as a dictionary, and pass it over in the aggregation function.

```
dict_of_agg = {
"v1_size": ("v1", np.size),
"v1_mean": ("v1", np.min),
"v2_mean": ("v2", np.mean),
"v2_max": ("v2", np.max),
"v2_mean_max_diff": ("v2", lambda x: np.max(x) - np.mean(x)),
}
df_agg = df.groupby(["c1", "c2"]).agg(**dict_of_agg)
```

When preparing data for modeling, a common data manipulation or feature engineering, is to add summary statistics to categorical features. You might approach this the following way:

- Calculate summary statistics based on the categorical feature
- Perform a join after, like so:

```
df_eg1 = df.copy()
df_group_stats = (
df_eg1.groupby(["c1", "c2"]).agg(c1_c2_v1_sum=("v1", np.sum)).reset_index()
)
df_eg1 = df_eg1.merge(df_group_stats, on=["c1", "c2"])
```

Turns out there is a better way to do this, with the `transform`

method:

`df_eg2["c1_c2_v1_sum"] = df_eg2.groupby(["c1", "c2"])["v1"].transform(np.sum)`

To test that these two results are the same:

```
# Sort just for comparision
df_eg1 = df_eg1.sort_values(["c1", "c2"]).reset_index(drop=True)
df_eg2 = df_eg2.sort_values(["c1", "c2"]).reset_index(drop=True)
df_eg1.equals(df_eg2)
"""
True
"""
```

For users coming from SQL-land, think of transform as a `window`

function. You can perform ranking on groups like so:

`df.groupby(["c1", "c2"])["v1"].transform("rank")`

Another common use case when transform can be very useful, is to impute missing values based on certain conditions:

```
df_eg3 = pd.DataFrame({"c1": ["a", "a", "b", "b"], "v1": [1, np.nan, 2, np.nan]})
df_eg3["v1"] = df_eg3["v1"].fillna(df_eg3.groupby("c1")["v1"].transform("mean"))
"""
c1 v1
0 a 1.0
1 a 1.0
2 b 2.0
3 b 2.0
"""
```

Similarly, more information can be found in the references section.

When doing exploratory data analysis, sometimes you only want to analyze subset of data fulfilling a certain set of conditions. One way of approaching the problem, could be:

- Calculate summary statistics into a new dataframe.
- Filter out keys with summary statistics fulfilling a certain condition.
- Perform an anti join.

Cutting to the chase instead,

`df.groupby(["c1", "c2"]).filter(lambda x: len(x) > 78)`

To illustrate:

```
df_d_j = df_agg.loc[[("d", "j")]].reset_index()
df_d_j
"""
c1 c2 v1_size v1_mean v2_mean v2_max v2_mean_max_diff
0 d j 79 2 2505.759494 2981 475.240506
"""
# to verify the results
# Also note, we can reuse the `dict_of_agg`
df_d_j_2 = (
df.groupby(["c1", "c2"])
.filter(lambda x: len(x) > 78)
.reset_index()
.groupby(["c1", "c2"])
.agg(**dict_of_agg)
.reset_index()
)
df_d_j.equals(df_d_j_2) # True
```

For users coming from SQL-land, think of transform as a `having`

function.

At the problem intro, we mentioned about the re-assignment / hard code referencing problem. To reiterate, suppose we want to calculate the ratio of `v1/(v1+v2)`

:

```
df['sum'] = df['v1'] + df['v2']
df['fraction_v1'] = df['v1'] / df['sum']
```

If you need to change the `sum`

into something else, say `total_v1_v2`

, you would need to change multiple lines of code.

Another problem with this approach, when you want to prototype functions, you would find yourself re-running multiple lines of python code above or find yourself doing `df.copy()`

.

To overcome this problem, let's first understand the `assign`

function.

One huge advantage of assign, is it returns you a new object that has a copy of the original data with the requested changes. The original dataframe remains unchanged!

`df.assign(sum=lambda x: x["v1"] + x["v2"], fraction_v1=lambda x: x["v1"] / x["sum"])`

There is still some hard coding involved with the `sum`

variable, also, the feature transformation can be assigned to a dictionary!

```
value1 = "v1"
value2 = "v2"
feat1_name = "sum"
feat2_name = "fraction_v1"
dict_of_features = {
feat1_name: lambda x: x[value1] + x[value2],
feat2_name: lambda x: x[value1] / x[feat1_name],
}
df.assign(**dict_of_features)
```

You can also use apply with custom functions, here is a trivial example to replicate the above:

```
def simple_ratio(x, y):
return x / (x + y)
df.assign(
**{
feat2_name: lambda x: x.apply(
lambda y: simple_ratio(y[value1], y[value2]), axis=1
)
}
)
```

There still remain 2 problems unsolved:

- what if we want to convert such steps to be unit-testable?
- love the pyspark api/dplyr api from R?

Introducing the pipe method.

To replicate the above example of calculating the ratio of `v1/(v1+v2)`

, first create a function as follows:

```
def add_ratio(df_input, col1, col2):
return df_input.assign(
**{
feat2_name: lambda x: x.apply(
lambda y: simple_ratio(y[col1], y[col2]), axis=1
)
}
)
```

This function is easily test-able, done with mocks. To use it for data transformation in your main code / data processing,

```
# recall that df_input is the first input variable in the add_ratio function
df.pipe(add_ratio, col1=value1, col2=value2)
```

However, if the first input is not the dataframe, you can set pipe to know which arguments to replace in the function, like so:

`df.pipe((add_ratio, "df_input"), col1=value1, col2=value2).head(5)`

To summarize,

- Using loc
- for selecting columns
- for slicing of dataframes

- GroupBy
- Using NamedAgg
- Transform
- Filter

- Assign
- Assign with dictionaries
- Assign with apply

- Pipe

Hope this helps you in creating better & faster pandas code!

- Using Pandas loc
- NamedAgg
- Transform and Apply with groupby
- Assign