Split–Apply–Combine

Consider the following table of the yields of three compounds, A, B and C, attained in a synthesis experiment by three students, Anu, Jenny and Tom.

In [x]: data = [['Anu', 'A', 5.4], ['Anu', 'B', 6.7], ['Anu', 'C', 10.1], 
   ...:         ['Jenny', 'A', 6.5], ['Jenny', 'B', 5.9], ['Jenny', 'C', 12.2],
   ...:         ['Tom', 'A', 4.0], ['Tom', 'B', None], ['Tom', 'C', 9.5] 
   ...:        ]

In [x]: df = pd.DataFrame(data, columns=['Student', 'Compound', 'Yield /g'])

In [x]: print(df)
Out[x]:
  Student Compound  Yield /g
0     Anu        A       5.4
1     Anu        B       6.7
2     Anu        C      10.1
3   Jenny        A       6.5
4   Jenny        B       5.9
5   Jenny        C      12.2
6     Tom        A       4.0
7     Tom        B       NaN
8     Tom        C       9.5

One way of analyzing these data is to group them by compound ("split" into separate data structures, each with a common value of 'Compound') and then apply some operation (say, finding the mean) to each group, before recombining into a single DataFrame.

In [x]: grouped = df.groupby('Compound')

In [x]: grouped.mean()
Out[x]:
          Yield /g
Compound
A              5.3
B              6.3
C             10.6

Here, the 'Student' column has been ignored as a so-called "nuisance" column: there is no helpful way to take the mean of a string. The max() and min() functions, however, consider the strings' lexigraphical ordering:

In [x]: grouped.max()
Out[x]:
         Student  Yield /g
Compound
A            Tom       6.5
B            Tom       6.7
C            Tom      12.2

Note that max() has returned 'Tom' for every row, since this name is lexigraphically last ("greatest") in the 'Student' column. The column 'Yield /g' consists of the maximum yields for each compound, across all students. To apply the function to a subset of the columns only (which may be necessary for a very large DataFrame), select them before the function call, for example:

In [x]: grouped['Yield /g'].min()
Out[x]: 
Compound
A    4.0
B    5.9
C    9.5
Name: Yield /g, dtype: float64

The object returned by groupby() can be iterated over:

In [x]: for compound, group in grouped:
   ...:     print('Compound:', compound)
   ...:     print(group)
   ...:
Compound: A
  Student Compound  Yield /g
0     Anu        A       5.4
3   Jenny        A       6.5
6     Tom        A       4.0
Compound: B
  Student Compound  Yield /g
1     Anu        B       6.7
4   Jenny        B       5.9
7     Tom        B       NaN
Compound: C
  Student Compound  Yield /g
2     Anu        C      10.1
5   Jenny        C      12.2
8     Tom        C       9.5

We can also group by the 'Student' column:

In [x]: grouped = df.groupby('Student')

In [x]: grouped.mean()
Out[x]:
         Yield /g
Student
Anu          7.40
Jenny        8.20
Tom          6.75

Another powerful feature is the ability to group on the basis of a specified mapping, provided, for example, by a dictionary. Suppose each student is undertaking a different degree programme:

In [x]: degree_programmes = {'Anu': 'Chemistry',
                             'Jenny': 'Chemistry',
                             'Tom': 'Pharmacology'}

First, turn the 'Student' column into an Index and then group, not by the Index itself but using the provided mapping:

In [x]: df.set_index('Student', inplace=True)
In [x]: df.groupby(degree_programmes).mean()
Out[x]:
              Yield /g
Chemistry         7.80
Pharmacology      6.75