A Brief Tour of Grouping and Aggregating in Pandas

By Andre Perunicic | October 13, 2017

If you work with data in Python, chances are that you’ve heard of the pandas data manipulation library. You can think of pandas as a way to programmatically interact with spreadsheets. It works well with huge datasets, unlike its desktop counterparts like Google Sheets and Microsoft Excel, and implements a number of common database operations like merging, pivoting, and grouping. Moreover, being backed by numpy and efficient algorithm implementations makes it fast and easily integrated with other tools in the vast Python data science landscape.

This article is a brief introduction to pandas with a focus on one of its most useful features when it comes to quickly understanding a dataset: grouping. Data in pandas is stored in dataframes, its analog of spreadsheets. Here are the first few rows of a dataframe that will be described in a bit more detail further down.

	sepal length	sepal width	petal length	petal width	species	location
0	5.1	3.5	1.4	0.2	setosa	south
1	4.9	3.0	1.4	0.2	setosa	south
2	4.7	3.2	1.3	0.2	setosa	north
3	4.6	3.1	1.5	0.2	setosa	south
4	5.0	3.6	1.4	0.2	setosa	north
...	...	...	...	...	...	...

As you can see, this dataframe features 4 numerical variables and 2 categorical variables. Grouping lets you slice up the rows of a dataframe into, well, groups that have the same values in one or more categorical variables. These are useful because you can then easily calculate statistics for each group and aggregate the results into a new dataframe. For instance, we could split the dataframe whose first few rows are shown above into groups with the same species and location, and then calculate the minimum and maximum petal widths and lengths for each group. Aggregating these results then yields a new dataframe summarizing the groups:

	species	location	petal width min	petal width max	petal length min	petal length max
0	setosa	north	0.1	0.6	1.2	1.7
1	setosa	south	0.1	0.5	1.0	1.9
2	versicolor	north	1.0	1.8	3.3	5.0
3	versicolor	south	1.0	1.6	3.0	5.1
4	virginica	north	1.4	2.4	4.8	6.9
5	virginica	south	1.5	2.5	4.5	6.7

Pandas makes grouping and aggregation pretty easy, but there are still a few sticking points and syntax challenges to understand before being able to take full advantage of these features. By the end of the article you should have a great understanding of what pandas’ grouping and aggregation capabilities are and how to use them.

Initial Setup and Dataframe Basics

If you want to play along, installing pandas and some supporting packages is simple.

# Create a new directory to work in.
mkdir pandas-aggregation 
cd pandas-aggregation

# Setup a virtualenv so we can install packages locally.
virtualenv env
. env/bin/activate

# Install data analysis related packages.
pip install pandas numpy scikit-learn
# Install an interactive python notebook environment.
pip install jupyter

For this article I’ll assume that commands are executed within a Jupyter notebook, an interactive environment that lets you write code and immediately see nicely formatted outputs. Start Jupyter with jupyter notebook and use the menu to create a new notebook file. I will use the Iris dataset to illustrate the code throughout the article. This well known dataset consists of 150 measurements of sepals and petals from three different kinds of the Iris flower. You may remember the “location” feature from the table above, but I just added that feature myself for demonstrative purposes; it isn’t actually included in the Iris dataset.

In any case, the dataset can be easily loaded via an existing method from scikit-learn, a Python machine learning library. First, load all the requirements

import pandas as pd
import numpy as np
from sklearn import datasets

and load the data via

iris = datasets.load_iris()

While loading was simple, the data is kind of difficult to work with: the features and labels are stored in separate numpy arrays within the iris object and feature names are treated separately. We can extract them with

X = iris.data
y = iris.target

where measurements are stored in the numpy array X

array([[ 5.1,  3.5,  1.4,  0.2],
       [ 4.9,  3. ,  1.4,  0.2],
       [ 4.7,  3.2,  1.3,  0.2],
       # ...
      ])

whose column names are defined in iris.feature_names:

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

Each entry of the array y corresponds to a row of X and is either 0, 1 or 2, depending on which species the measurement belongs to. We can combine this data into a single array with np.c_[X, y] and place it into a pandas dataframe using

data = np.c_[X, y]
columns = [column.replace(' (cm)', '') for column in iris.feature_names] + ['species']
df = pd.DataFrame(data, columns=columns)

The first few rows of the dataframe can be shown with df.head() (see the first table in the article for example output).

We’ll want to change the “species” column from being integer-valued to containing the actual names of the three subspecies stored in iris.target_names. The dictionary

replacements = {0: 'setosa', 1: 'versicolor', 2: 'virginica'}

contains a mapping between integers and the corresponding name and utilized for this purpose via

df['species'] = df['species'].map(replacements)

We see that columns in pandas are accessed and modified using syntax of the form df['<column name>'']. Moreover, columns of a dataframe are instances of type pandas.core.series.Series and have useful methods attached to them. For example, map() can be used to replace entries of a series just as we’ve done above.

You can access multiple columns with, e.g., df[['petal width', 'petal length']], or restrict your selection to rows meeting certain criteria with something like

df.loc[df['species'] == 'setosa', 'petal length'].head()

which gives

0    1.4
1    1.4
2    1.3
3    1.5
4    1.4
Name: petal length, dtype: float64

Internally, pandas maintains row and column indexes which are used with custom __getitem__() and __eq__() methods to make selections with [] like this possible. This is just another way of saying that pandas uses built-in methods to achieve “pythonic” object behavior. There’s obviously a lot more that you can do, but these few things will already get you pretty far.

Grouping and Aggregation Basics

Suppose that you want to calculate the mean petal length of each of the three species. One approach is to manually select the rows corresponding to each species and calculate the mean on the corresponding slice.

aggregated_rows = []
metrics = ['petal length', 'petal width', 'sepal length', 'sepal width']
for species in df['species'].unique():
    # axis=0 calculates the mean for each column.
    result = df.loc[df['species'] == species, metrics].mean(axis=0)
    aggregated_rows.append([species] + list(result))

mean_metrics = [metric + ' mean' for metric in metrics]
gdf = pd.DataFrame(aggregated_rows, columns=['species'] + mean_metrics)

The final “grouped” dataframe gdf then looks something like the following.

	species	petal length mean	petal width mean	sepal length mean	sepal width mean
0	setosa	1.464	0.244	5.006	3.418
1	versicolor	4.260	1.326	5.936	2.770
2	virginica	5.552	2.026	6.588	2.974

There is, however, a much easier and more efficient way to accomplish this that is built into pandas. First, you can form the groups easily with the groupby() method:

groups = df.groupby('species')
for species, species_df in groups:
    print("There are {} rows corresponding to species {}."
          .format(species_df.shape[0], species))

which prints

There are 50 rows corresponding to species setosa.
There are 50 rows corresponding to species versicolor.
There are 50 rows corresponding to species virginica.

You could compute the mean within the for loop as before with species_df.mean(axis=0), but an easier way is to just follow up with the agg() method and the desired operation:

groups.agg('mean')

If groupby() is the bread, then agg() the butter. It takes each group produced by a call to groupby() and applies calculations specified in its arguments to each group before collapsing the results into a new dataframe. In this particular case, agg calculates the mean of each column in each group and produces

	sepal length	sepal width	petal length	petal width
species
setosa	5.006	3.418	1.464	0.244
versicolor	5.936	2.770	4.260	1.326
virginica	6.588	2.974	5.552	2.026

The remainder of the article will explore the different ways to use groupby() and agg() to quickly and efficiently extract per-level statistics from one (or more!) categorical variables.

Single Grouping Column, Single Aggregation Function

Consider (again) the simplest grouping/aggregation case:

gdf = df.groupby('species').agg('mean')

The grouped dataframe gdf looks exactly like the one shown in the table right above this section. I usually take the additional step of “flattening” the grouped dataframe so that the columns go from a staggered form like

	sepal length	sepal width	petal length	petal width
species

to a “flat” form like

	species	sepal length	sepal width	petal length	petal width

What’s going on here is that pandas stores information about each group in a row index gdf.index which is printed as above and described by

Index([u'setosa', u'versicolor', u'virginica'], dtype='object', name=u'species')

To flatten this particular dataframe we have to move this index data into a column, which is accomplished by calling

gdf = gdf.reset_index()

thereby producing

	species	sepal length	sepal width	petal length	petal width
0	setosa	5.006	3.418	1.464	0.244
1	versicolor	5.936	2.770	4.260	1.326
2	virginica	6.588	2.974	5.552	2.026

Whether or not you do this is of course up to you, but I find it useful when following up multiple aggregations with a merge: a common scenario when wanting to combine statistics about the same entities from multiple sources.

It’s also bit annoying to have to remember that (in this case) the displayed value is the mean of each column in the given group. You could rename the columns manually with gdf = gdf.rename(columns={'<old name>': '<new name>'}), but an easier way is to pass ['mean'] instead of simply 'mean' into the agg() method. We’ll cover this case next, while also adding multiple aggregation functions into the mix.

Single Grouping Column, Multiple Aggregation Functions

To use multiple aggregation functions you can simply pass them in as a list to agg()

gdf = df.groupby('species').agg(['min', 'max'])

which produces

	sepal length		sepal width		petal length		petal width
	min	max	min	max	min	max	min	max
species
setosa	4.3	5.8	2.3	4.4	1.0	1.9	0.1	0.6
versicolor	4.9	7.0	2.0	3.4	3.0	5.1	1.0	1.8
virginica	4.9	7.9	2.2	3.8	4.5	6.9	1.4	2.5

Flattening the dataframe is accomplished a bit differently this time, since now type(gdf.columns) equal to pandas.core.index.MultiIndex instead of just pandas.core.index.Index as was the case in the previous example. For practical purposes this means that reset_index() won’t produce a fully flattened dataframe. However, we can manually set the columns (which will reset the type to pandas.core.index.Index) with

gdf.columns = [' '.join(col) for col in gdf.columns]

This will concatenate the original column name (i.e., ‘sepal width’) with the name of the aggregation function (i.e., ‘max’). After that, gdf.reset_index() works as before, producing

	sepal length min	sepal length max	sepal width min	sepal width max	petal length min	petal length max	petal width min	petal width max
species
setosa	4.3	5.8	2.3	4.4	1.0	1.9	0.1	0.6
versicolor	4.9	7.0	2.0	3.4	3.0	5.1	1.0	1.8
virginica	4.9	7.9	2.2	3.8	4.5	6.9	1.4	2.5

Single Grouping Column, Custom Aggregation

In addition to specifying a list of aggregation functions, pandas allows the user to separately customize the aggregation functions and column names for each column. For instance,

def percentile(n):
    def _percentile(x):
        return np.percentile(x, n)
    _percentile.__name__ = '{}-th percentile'.format(n)
    return _percentile

gdf = df.groupby('species').agg({
    'sepal width': {
        'width min': 'min',
        'width max': 'max'
    },
    'sepal length': ['max', 'mean', percentile(20)]
})

will only aggregate the groups for the ‘sepal width’ and ‘sepal length’ columns, and will apply different functions in each case, resulting in the following.

	sepal width		sepal length
	width min	width max	max	mean	20-th percentile
species
setosa	2.3	4.4	5.8	5.006	4.7
versicolor	2.0	3.4	7.0	5.936	5.5
virginica	2.2	3.8	7.9	6.588	6.1

For “sepal width”, we are applying the 'min' and 'max' built-in functions with custom names, and for “petal width” we are applying the 'max' and 'mean' built-in functions as well as our own function for computing percentiles. As you can see from the definition of percentile(n), the custom name “20-th percentile” is achieved by setting the __name__ attribute of the returned function. Flattening this dataframe is again achieved by resetting the columns and the index. This time we’ll retain only the function name of each aggregated column with

gdf.columns = [col[1] for col in gdf.columns]
gdf.reset_index()

which results in a flattened dataframe

	species	width min	width max	max	mean	20-th percentile
0	setosa	2.3	4.4	5.8	5.006	4.7
1	versicolor	2.0	3.4	7.0	5.936	5.5
2	virginica	2.2	3.8	7.9	6.588	6.1

Multiple Grouping Columns

Pandas can also group based on multiple columns, simply by passing a list into the groupby() method. To demonstrate this, we’ll add a fake data column to the dataframe

# Add a second categorical column to form groups on.
df['location'] = np.random.choice(['north', 'south'], df.shape[0])

and proceed as usual

gdf = df.groupby(['species', 'location']).agg(['min', 'max'])

thereby producing

		sepal length		sepal width		petal length		petal width
		min	max	min	max	min	max	min	max
species	location
setosa	north	4.7	5.8	3.0	4.2	1.2	1.7	0.1	0.6
setosa	south	4.3	5.7	2.3	4.4	1.0	1.9	0.1	0.5
versicolor	north	4.9	6.9	2.0	3.2	3.3	5.0	1.0	1.8
versicolor	south	5.0	7.0	2.2	3.4	3.0	5.1	1.0	1.6
virginica	north	5.8	7.7	2.2	3.3	4.8	6.9	1.4	2.4
virginica	south	4.9	7.9	2.5	3.8	4.5	6.7	1.5	2.5

The only real difference between this case and the single grouping column case is both gdf.columns and gdf.index are of type pd.core.index.MultiIndex. This allows for multi-index selectors via .loc such as gdf.loc[('virginica', 'south')] giving

sepal length  min    4.9
              max    7.9
sepal width   min    2.5
              max    3.8
petal length  min    4.5
              max    6.7
petal width   min    1.5
              max    2.5
Name: (virginica, south), dtype: float64

and gdf.loc[('virginica', 'south')].loc[('petal width', 'min')] giving 1.5.

In terms of flattening, there is no real difference between this case and the single grouping column case, since reset_index() transfers the index levels to columns. The result looks like

	species	location	sepal length min	sepal length max	sepal width min	sepal width max	petal length min	petal length max	petal width min	petal width max
0	setosa	north	4.4	5.7	2.3	4.4	1.0	1.9	0.1	0.6
1	setosa	south	4.3	5.8	3.0	4.2	1.1	1.9	0.1	0.5
2	versicolor	north	4.9	7.0	2.0	3.2	3.0	4.9	1.0	1.8
3	versicolor	south	5.2	6.9	2.3	3.4	3.5	5.1	1.0	1.7
4	virginica	north	4.9	7.7	2.2	3.8	4.5	6.9	1.5	2.5
5	virginica	south	5.6	7.9	2.5	3.8	4.8	6.6	1.4	2.5

Flattening Utility

I’ve been doing this on a case-by-case basis throughout the article, but putting together a simple utility for flattening aggregated dataframes is pretty easy. All you have to do is reset any multi-indexes that appear. For columns, which correspond to (source column, aggregation function) pairs you can simply use a list comprehension, and for rows reset_index() is sufficient.

def flatten_aggregated_dataframe(
    gdf, concat_name=True, concat_separator=' ', name_level=1, inplace=False):
    """
    Flatten aggregated DataFrame.

    Args:
        gdf: DataFrame obtained through aggregation.
        concat_name: Whether to concatenate original column name and
            aggregation function name in the case of MultiIndex columns.
        concat_separator: Which string to place between original column name
            and aggregation function name if concat_name is True.
        name_level: Which element of a column tuple to use in the case of 
            MultiIndex columns and concat_name == False. Should be 0 for 
            original column name and 1 for aggregation function name.
        inplace: Whether to modify the aggregated DataFrame directly 
            (or return a copy).
    """
    if not inplace:
        gdf = gdf.copy()
    if type(gdf.columns) == pd.core.index.MultiIndex:
        if concat_name:
            columns = [concat_separator.join(col) for col in gdf.columns]
        else:
            columns = [col[name_level % 2] for col in gdf.columns]
        gdf.columns = columns
    return gdf.reset_index()

Using this method is then as simple as

gdf = df.groupby(['species', 'location']).agg(['min', 'max'])
gdf = flatten_aggregated_dataframe(gdf)

and gdf looks exactly like the ones created manually above. You can play around with the parameters for different naming choices and effects.

Conclusion

We started with the basics of pandas dataframes, and have gone through a tour of grouping and aggregating in multiple ways. We saw that you can group rows in a dataframe by one or more categorical columns, and that you have a tremendous amount of flexibility in how to compute statistics on these groups: from iterating through groups manually to using custom aggregation functions. I hope that you’ll find some of these snippets helpful in your own data analysis projects.

As always, if you have a data analysis or aggregation task that you’re having trouble with, you may wish to know that we specialize in all kinds of data gathering and processing: don’t hesitate to get in touch.