A simple class for linear regression

Question P4.6.4

Design and implement a class, Experiment, to read in and store a simple series of $(x,y)$ data as pylab (i.e. NumPy) arrays from a text file. Include in your class methods for transforming the data series by some simple function (for example, $x' = \ln x$, $y' = 1/y$) and to perform a linear least-squares regression on the transformed data (returning the gradient and intercept of the best-fit line, $y_\mathrm{fit}' = mx'+c$). NumPy provides methods for performing linear regression (see Section 6.5.3 of the book), but for this exercise the following equations can be implemented directly: \begin{align*} m &= \frac{\overline{xy} - \bar{x}\bar{y}}{\overline{x^2} - \bar{x}^2}\\ c &= \bar{y} - m\bar{x} \end{align*} where the bar notation, $\bar{\cdot}$, denotes the arithmetic mean of the quantity under it. (Hint: use pylab.mean(arr) to return the mean of array arr).

Chloroacetic acid is an important compound in the synthetic production of phamaceuticals, pesticides and fuels. At high concentration under strong alkaline conditions its hydrolysis may be considered as the following reaction: $$ \mathrm{ClCH_2COO^-} + \mathrm{OH^-} \rightleftharpoons \mathrm{HOCH_2COO^-} + \mathrm{Cl^-}. $$ Data giving the concentration of $\mathrm{ClCH_2COO^-}$, $c$ (in M), as a function of time, $t$ (in s), are provided for this reaction carried out in excess alkalai at five different temperatures in the data files caa-T.txt (T = 40, 50, 60, 70, 80 in $\mathrm{^\circ C}$): these may be obtained from this zip archive. The reaction is known to be second order and so obeys the integrated rate law $$ \frac{1}{c} = \frac{1}{c_0} + kt $$ where $k$ is the effective rate constant and $c_0$ the initial ($t=0$) concentration of chloroacetic acid.

Use your Experiment class to interpret these data by linear regression of $1/c$ against $t$, determining $m(\equiv k)$ for each temperature. Then, for each value of $k$ determine the activation energy of the reaction through a second linear regression of $\ln k$ against $1/T$ in accordance with the Arrhenius law: $$ k = Ae^{-E_\mathrm{a}/RT} \quad \Rightarrow \quad \ln k = \ln A - \frac{E_\mathrm{a}}{RT}, $$ where $R = 8.314\;\mathrm{J\,K^{-1}\,mol^{-1}}$ is the gas constant.

Solution P4.6.4

The Experiment class is defined in the following program, experiment.py:

import sys
import pylab

class Experiment:
    def __init__(self, description):
        self.description = description
        # The data
        self.x, self.y = pylab.array([]), pylab.array([])
        # The coefficients of the best-fit straight line, y = mx + c
        self.m, self.c = None, None

    def load_data(self, filename):
        """ Load x, y data from two-column text file with header, filename. """
        try:
            with open(filename, 'r') as fi:
                fi.readline()       # Skip header
                x, y = [], []
                for line in fi.readlines():
                    fields = line.split()
                    try:
                        x.append(float(fields[0]))
                        y.append(float(fields[1]))
                    except (IndexError, ValueError):
                        # e.g. blank line, or field does not evaluate to float
                        print('Corrupt or missing data line:', line)
                        sys.exit(1)
        except FileNotFoundError:
            print('File not found:', filename)
            sys.exit(1)

        # Store the data as arrays to allow vectorization
        self.x = pylab.array(x)
        self.y = pylab.array(y)

    def set_data(self, x, y):
        self.x = pylab.array(x)
        self.y = pylab.array(y)

    def best_fit_line(self, funcx=lambda x: x, funcy=lambda y: y):
        """
        Determine the equation of the line of best fit (which minimises the
        root mean squared residual) and return it as m and c where y = mx + c.
        If x or y are to be transformed in some way to create a straight line
        plot, functions to effect these transformations are provided as funcx
        and funcy.

        """

        # Apply the transformation functions
        self.tx, self.ty = funcx(self.x), funcy(self.y)

        # Quantities needed for the least squares regression
        txbar, tybar = pylab.mean(self.tx), pylab.mean(self.ty)
        tx2bar, txtybar = pylab.mean(self.tx**2), pylab.mean(self.tx *self.ty)

        # gradient and intercept of best fit line
        self.m = (txtybar - txbar*tybar) / (tx2bar - txbar**2)
        self.c = tybar - self.m * txbar

        # Also calculate the coefficient of determination, r2
        ty2bar = pylab.mean(self.ty**2)
        r2 = (txtybar - txbar*tybar)**2/(tx2bar - txbar**2)/(ty2bar - tybar**2)

        return self.m, self.c, r2

    def plot_data(self, best_fit=False, fig_filename=None):
        """
        Plot the data on a scatter plot, along with the best-fit straight
        line, if requested. Save the figure to fig_filename if not None,
        otherwise show the plot interactively.

        """

        if best_fit:
            # Plot the transformed data and best fit line
            if self.c is None or self.m is None:
                self.best_fit_line()
            pylab.scatter(self.tx, self.ty)
            xfit = pylab.array([min(self.tx), max(self.tx)])
            yfit = self.m * xfit + self.c
            pylab.plot(xfit, yfit, 'k', lw=2)
        else:
            # Just plot the untransformed data
            pylab.scatter(self.x, self.y)

        if fig_filename:
            pylab.savefig(fig_filename)
        else:
            pylab.show()

To test it with the provided data, run experiment_test.py:

import pylab
from experiment import Experiment

R = 8.314   # Gas constant, J.K-1.mol-1

def description(T):
    return 'Rate of Chloroacetic acid hydrolysis at {} degC'.format(T)

temperatures = (40, 50, 60, 70, 80)
expts = []
for T in temperatures:
    filename = 'caa-{}.txt'.format(T)
    expt = Experiment(description(T))
    expt.load_data(filename)
    expt.best_fit_line(funcy=lambda c: 1/c)    
    expts.append(expt)
    print('k = {:9.3e} M-1.s-1'.format(expt.m))
#expts[4].plot_data(True)

TK = pylab.array(temperatures) + 273.15
k = pylab.array([expt.m for expt in expts])
arrhenius = Experiment('Arrhenius plot for Chloroacetic acid hydrolysis')
arrhenius.set_data(TK, k)
print(arrhenius.x)
print(arrhenius.y)
m, c, r2 = arrhenius.best_fit_line(funcx=lambda T: 1/T,
                                   funcy=lambda k: pylab.log(k))
A, Ea = pylab.exp(c), -m * R
print('A = {:10.3e} M-1.s-1, Ea = {:5.1f} kJ.mol-1'.format(A, Ea/1000))
arrhenius.plot_data(True)