This section features three videos on least squares. In the first video, we explore least squares for matrix problems, and see that it boils down to the problem of projections that we saw in the last section. In the second video, we tackle the most important application of least squares, namely linear regression. Setting up the equations reduces this to a matrix least squares problem. In the third video, we show how to do multiple regression and other sorts of curve fitting, such as fitting higher-order polynomials, or exponentials, or power laws, to data.
There is an exact solution to the matrix problem $$A {\bf x} = {\bf b}$$ if and only if b is in the column space of $A$. If b isn't in the column space, we can still ask for the value of ${\bf x}$ that brings $A {\bf x}$ as close as possible to ${\bf b}$. This value of ${\bf x}$ is called a least squares solution to $A{\bf x} = {\bf b}$.Key Theorem: Every least-squares solution to $A {\bf x} = {\bf b}$ is an exact solution to $$A^T A {\bf x} = A^T {\bf b}.$$ Likewise, every exact solution to $A^T A {\bf x} = A^T {\bf b}$ is a least-squares solution to $A{\bf x} = {\bf b}$.
Given a bunch of data points $(x_1,y_1), \ldots, (x_N,y_N)$, we want to find a "best fit" line $y = c_0 + c_1 x$. This is a least-squares solution to the equations \begin{eqnarray*} c_0 + c_1 x_1 &=& y_1 \cr c_0 + c_1 x_2 &=& y_2 \cr &\vdots & \cr c_0 + c_1 x_N &=& y_N \end{eqnarray*} In other words, $$A = \begin{pmatrix} 1 & x_1 \cr \vdots & \vdots \cr 1 & x_N \end{pmatrix},$$ so $$A^T A = \begin{pmatrix} N & \sum x_i \cr \sum x_i & \sum x_i^2 \end{pmatrix}; \qquad A^T {\bf y} = \begin{pmatrix} \sum y_i \cr \sum x_iy_i \end{pmatrix}.$$ The solutions to the $2 \times 2$ system of equations $A^T A {\bf c} = A^T {\bf y}$ can be written in closed form: \begin{eqnarray*} c_0 &=& \frac{(\sum x_i^2)(\sum y_i) - (\sum x_i)(\sum x_iy_i)}{N \sum{x_i^2} - (\sum x_i)^2} = \frac{\hbox{Avg}(x^2)\hbox{Avg}(y) - \hbox{Avg}(xy)\hbox{Avg}(y)}{\hbox{Avg}(x^2)- (\hbox{Avg}(x))^2}\cr c_1 &=& \frac{N(\sum x_iy_i) - (\sum x_i)(\sum y_i)}{N \sum x_i^2 - (\sum x_i)^2} = \frac{\hbox{Avg}(xy)-\hbox{Avg}(x)\hbox{Avg}(y)}{\hbox{Avg}(x^2) - (\hbox{Avg}(x))^2},\cr \end{eqnarray*} where "Avg" means the average value over the sample of $N$ points. It's often easier to think in terms of averages than sums. [Note: In probability and statistics, the average of a quantity $x$ is often denoted $E(x)$ or $\langle x \rangle$ or $\bar x$, but we're already using bars for complex conjugates and angle brackets for inner products, so we'll stick with "Avg".]
The quantity $\hbox{Var}(x)=\hbox{Avg}(x^2) - (\hbox{Avg}(x))^2$ is called the variance of $x$ and comes up a lot in probability and statistics. The quantity $\hbox{Cov}(x,y)= \hbox{Avg}(xy)-\hbox{Avg}(x)\hbox{Avg}(y)$ is called the covariance of $x$ and $y$. The dimensionless quantity $$r^2 = \frac{(\hbox{Cov}(x,y))^2}{\hbox{Var}(x) \hbox{Var}(y)}$$ measures how good a fit our best line is. It gives the fraction of the variation in $y$ that is "explained" by $x$. If you hear about correlations with $r$ values of .2 or .3 or -.2 or -.3, they don't mean much, while correlations with $r = 0.7$ or $0.8$ or $-0.7$ or $-0.8$ are much more meaningful.