In ordinary least squares fitting, a model function, \(\hat{y}\), which depends linearly on a set of \(p\) parameters, \(\beta_1, \beta_2, \ldots, \beta_p\) is fit to \(n\) observations of the dependent variable, \(y_1, y_2 \ldots, y_n\). The parameters are treated as the coefficients to the independent variables, \(x_{i1}, x_{i2}, \ldots, x_{ip}\) for each observation \(i=1,2,\ldots,n\):
$$ \hat{y} = \sum_j \beta_j x_{ij}. $$
For example, for the model function \(f(x) = a + b x + c \ln x\) there are \(p=3\) parameters: \(\beta_1 \equiv a, \beta_2 \equiv b, \beta_3 \equiv c\) and the independent variables for the \(i\)th observation may be represented as \(x_{i1} = 1\), \(x_{i2} = x\) and \(x_{i3} = \ln x\).
The least squares approach seeks to obtain the values of the \(\beta_j\) (\(j=1,2,\ldots p\)) which minimise the sum of the squared residuals (across the \(i = 1,2,\ldots, n\) observations):
$$ \chi^2 = \sum_i \left[ y_i - \sum_j \beta_j x_{ij} \right]^2. $$
This is most compactly represented in matrix form with the \(\chi^2\) quantity represented as a cost function to be minimized:
$$ \boldsymbol{J}(\boldsymbol{\beta}) = \lVert\boldsymbol{y} - \boldsymbol{X}\boldsymbol{\beta}\rVert^2, $$
where
$$ \boldsymbol{y} = \begin{pmatrix} y_1 \\ y_2 \\ y_3 \\ \vdots \\ y_n \end{pmatrix}, \quad \boldsymbol{\beta} = \begin{pmatrix} \beta_1 \\ \beta_2 \\ \vdots \\ \beta_n \end{pmatrix} $$
and the so-called design matrix,
$$ \boldsymbol{X} = \begin{pmatrix} x_{11} & x_{12} & \ldots & x_{1p} \\ x_{21} & x_{22} & \ldots & x_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & \ldots & x_{np} \\ \end{pmatrix}. $$
We have
\begin{align*} \boldsymbol{J}(\boldsymbol{\beta}) &= \lVert\boldsymbol{y} - \boldsymbol{X}\boldsymbol{\beta}\rVert^2 = (\boldsymbol{y} - \boldsymbol{X}\boldsymbol{\beta})^T(\boldsymbol{y} - \boldsymbol{X}\boldsymbol{\beta}) \\ &= \left( \boldsymbol{y}^T - \left(\boldsymbol{X}\boldsymbol{\beta}\right)^T \right)(\boldsymbol{y} - \boldsymbol{X}\boldsymbol{\beta}) \\ &= \boldsymbol{y}^T\boldsymbol{y} - \left( \boldsymbol{X}\boldsymbol{\beta} \right)^T\boldsymbol{y} - \boldsymbol{y}^T\left( \boldsymbol{X}\boldsymbol{\beta} \right) + \left( \boldsymbol{X}\boldsymbol{\beta} \right)^T\left( \boldsymbol{X}\boldsymbol{\beta} \right) \\ &= \boldsymbol{y}^T\boldsymbol{y} - 2\left( \boldsymbol{X}\boldsymbol{\beta} \right)^T \boldsymbol{y} + \boldsymbol{\beta}^T \boldsymbol{X}^T \boldsymbol{X} \boldsymbol{\beta}. \end{align*}
Now, at the minimum
\begin{align*} & \frac{\partial \boldsymbol{J}}{\partial \boldsymbol{\beta}} = -2\boldsymbol{X}^T\boldsymbol{y} + 2\boldsymbol{X}^T\boldsymbol{X}\boldsymbol{\beta} = 0 \\ & \Rightarrow \quad \boldsymbol{\beta} - \left( \boldsymbol{X}^T\boldsymbol{X}\right)^{-1}\boldsymbol{X}^T\boldsymbol{y} = 0 \\ & \Rightarrow \quad \boldsymbol{\beta} = \left( \boldsymbol{X}^T\boldsymbol{X}\right)^{-1}\boldsymbol{X}^T\boldsymbol{y}. \end{align*}
For the differentiation, note that:
$$ \frac{\partial}{\partial \boldsymbol{\beta}} \left[ \left( \boldsymbol{X}\boldsymbol{\beta} \right)^T \boldsymbol{y} \right] = \frac{\partial}{\partial \boldsymbol{\beta}} \left[ \boldsymbol{\beta}^T\boldsymbol{X}^T \boldsymbol{y} \right] = \frac{\partial}{\partial \boldsymbol{\beta}} \left[ \left( \boldsymbol{X}^T\boldsymbol{y} \right)^T \boldsymbol{\beta} \right] = \frac{\partial}{\partial \boldsymbol{\beta}} \left[ \left( \boldsymbol{y}^T\boldsymbol{X} \right) \boldsymbol{\beta} \right] = \left( \boldsymbol{y}^T\boldsymbol{X} \right)^T = \boldsymbol{X}^T\boldsymbol{y} $$ in denominator form, and:
$$ \frac{\partial}{\partial \boldsymbol{\beta}} \left[ \boldsymbol{\beta}^T\boldsymbol{X}^T\boldsymbol{X}\boldsymbol{\beta} \right] = 2\boldsymbol{X}^T\boldsymbol{X}\boldsymbol{\beta} $$ since \( \boldsymbol{X}^T\boldsymbol{X} \) is symmetric. See Wikipedia's page on matrix calculus for more useful identities.
The above expression gives the best fit parameters to the model function as the vector \(\hat{\boldsymbol{\beta}}\) in the matrix form of the "normal equations":
$$ \hat{\boldsymbol{\beta}} = \left( \boldsymbol{X}^T\boldsymbol{X}\right)^{-1}\boldsymbol{X}^T\boldsymbol{y} $$