Consider the following table of the yields of three compounds, A, B and C, attained in a synthesis experiment by three students, Anu, Jenny and Tom.
In [x]: data = [['Anu', 'A', 5.4], ['Anu', 'B', 6.7], ['Anu', 'C', 10.1],
...: ['Jenny', 'A', 6.5], ['Jenny', 'B', 5.9], ['Jenny', 'C', 12.2],
...: ['Tom', 'A', 4.0], ['Tom', 'B', None], ['Tom', 'C', 9.5]
...: ]
In [x]: df = pd.DataFrame(data, columns=['Student', 'Compound', 'Yield /g'])
In [x]: print(df)
Out[x]:
Student Compound Yield /g
0 Anu A 5.4
1 Anu B 6.7
2 Anu C 10.1
3 Jenny A 6.5
4 Jenny B 5.9
5 Jenny C 12.2
6 Tom A 4.0
7 Tom B NaN
8 Tom C 9.5
One way of analyzing these data is to group them by compound ("split" into separate data structures, each with a common value of 'Compound'
) and then apply some operation (say, finding the mean) to each group, before recombining into a single DataFrame
.
In [x]: grouped = df.groupby('Compound')
In [x]: grouped.mean()
Out[x]:
Yield /g
Compound
A 5.3
B 6.3
C 10.6
Here, the 'Student'
column has been ignored as a so-called "nuisance" column: there is no helpful way to take the mean of a string. The max()
and min()
functions, however, consider the strings' lexigraphical ordering:
In [x]: grouped.max()
Out[x]:
Student Yield /g
Compound
A Tom 6.5
B Tom 6.7
C Tom 12.2
Note that max()
has returned 'Tom'
for every row, since this name is lexigraphically last ("greatest") in the 'Student'
column. The column 'Yield /g'
consists of the maximum yields for each compound, across all students.
To apply the function to a subset of the columns only (which may be necessary for a very large DataFrame
), select them before the function call, for example:
In [x]: grouped['Yield /g'].min()
Out[x]:
Compound
A 4.0
B 5.9
C 9.5
Name: Yield /g, dtype: float64
The object returned by groupby()
can be iterated over:
In [x]: for compound, group in grouped:
...: print('Compound:', compound)
...: print(group)
...:
Compound: A
Student Compound Yield /g
0 Anu A 5.4
3 Jenny A 6.5
6 Tom A 4.0
Compound: B
Student Compound Yield /g
1 Anu B 6.7
4 Jenny B 5.9
7 Tom B NaN
Compound: C
Student Compound Yield /g
2 Anu C 10.1
5 Jenny C 12.2
8 Tom C 9.5
We can also group by the 'Student'
column:
In [x]: grouped = df.groupby('Student')
In [x]: grouped.mean()
Out[x]:
Yield /g
Student
Anu 7.40
Jenny 8.20
Tom 6.75
Another powerful feature is the ability to group on the basis of a specified mapping, provided, for example, by a dictionary. Suppose each student is undertaking a different degree programme:
In [x]: degree_programmes = {'Anu': 'Chemistry',
'Jenny': 'Chemistry',
'Tom': 'Pharmacology'}
First, turn the 'Student'
column into an Index
and then group, not by the Index
itself but using the provided mapping:
In [x]: df.set_index('Student', inplace=True)
In [x]: df.groupby(degree_programmes).mean()
Out[x]:
Yield /g
Chemistry 7.80
Pharmacology 6.75