Regression is a widely-used method in many fields like economics, engineering, biology, and, of course, data science. Therefore, countless regression books and tutorials have been written. So why read this one?
Well, sometimes a fresh take on an old subject can make all the difference for anyone who's spent hours, days, or even months searching for an elusive source that strikes the balance between theory and application. This post also aims to highlight the fundamental elements and promote "best of practices" of data science like statistically sound analytics, deeper understanding of underlying mathematical and computational considerations, and engaging use of dynamic data visualization. Code snippets of matrix-based regression implementations written in Java, Python, R, Scala, and MATLAB are provided. Finally, we can think of regression as a gateway method for understanding many of the other methods and algorithms in machine learning and data science.
I'll be covering a lot of ground in this blog tutorial, but will only scratch the surface of a vast subject. Detailed explanations and definitions of terms that are commonly thrown around (but not always understood) will be provided. After a brief introduction to regression, the major sections in this post are:
- Relevant Concepts from Probability and Statistics
- Back to Regression
- Introductory Matrix Concepts and Notation
- Computational Considerations and Implementation
- Code
- References
At the end, I'll provide a description of references used to help with the search for additional information. Supporting code for this post can be found here.
So what is regression? It is a term, typically used in an encompassing way, to describe a relation between a set of interest containing a dependent (endogenous or target or response or predicted or explained or regressand) variable and another set of interest containing an arbitrary number of independent (exogenous or predictor or explanatory or control or regressor) variables. A set is a collection of objects and is typically denoted with brackets like \(\{*\}\). A relation means that these sets are connected in some way. A regression analysis is primarily used to make predictions about the dependent variable or test hypotheses about a variable or a group of variables in the set of independent variables.
The term "regression" originated from 19th century English polymath Sir Francis Galton who, while studying the relationship between the heights of parents and children, found that very short parents (father and mother averaged) tend to have taller children and very tall parents tend to have shorter children[2]. Galton called this finding "regression towards mediocrity."
Let's make things more concrete. Many people associate regression with a process for drawing a line through a collection of data points in a way that best "fits" the observed points. An interactive graph like the one below might tell the story better. Click anywhere in the graph area to create data points, press the Draw Regression Line button, and then press Draw Residuals button to see the residuals associated with the given regression line.
The points that we click into the graph can be thought of as observations. We can think of these observations as coordinate pairs. So \(n\), the number of observations, can be written as \((x_1,y_1),\ldots,(x_n,y_n).\) By clicking the Draw Regression Line button, we draw a line through our observation points that makes the squared sum of the distances of all the vertical red lines in the graph smallest. The vertical lines are called residuals. We have just described the least squares method. More formally, given a line
\( y = a + bx \),
the least squares method minimizes the residual sum of squares(RSS) or
\(\text{min RSS} \equiv \min\limits_{a, b}\sum_{i=1}^{n} (y_i - (a + bx_i ))^2.\)
Said another way, the least squares method finds the values of \(a\) and \(b\) that minimize the RSS. The \(\equiv\) symbol is just a way of saying "defined." This is a purely mathematical optimization problem that can be solved with calculus, algebra, or geometry. I'll hold off on showing the derivations for finding \(a\) and \(b\) until later but will now provide the equations and definitions that can be used to calculate \(a\) and \(b\) by hand. Following the descriptions and notation in [1],
\( \bar{x} = \frac{1}{n}\sum_{i=1}^{n}x_i \) and \(\bar{y} = \frac{1}{n}\sum_{i=1}^{n}y_i\)
are the sample means.
\( S_{xx} = \sum_{i=1}^{n}(x_i - \bar{x})^2 \) and \(S_{yy} = \sum_{i=1}^{n}(y_i - \bar{y})^2\)
are the sum of squares.
\( S_{xy} = \sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})\)
is the sum of cross-products.
Putting all of those definitions together, we can define \(a\) and \(b\) as
\( b \equiv \frac{S_{xy}}{S_{xx}}\) and \(a \equiv \bar{y} - b\bar{x}.\)
Relevant Concepts from Probability and Statistics
The quantities just defined provide a natural transition from the subject of math to statistics. I'll use this section to introduce a few relevant concepts from probability and statistics to ensure that we're all on the same page. Readers should keep in mind that this is after all a blog post so some details are left out for the sake of brevity. Please see the listed references for more detailed explanations. With that, let's start with the objective of the field of statistics[8]:
The objective of statistics is to make an inference about a population based on information contained in a sample and to provide an associated measure of goodness for the inference.
So a population can be considered a universe of some kind of data and a sample is a subset of that universe. A sample space (often denoted \(S\)) is the set of all possible outcomes in a random experiment, which can be considered a trial that can be repeated many times under identical conditions and all possible outcomes in any given trial are known. Experiments (trials) are considered random if outcomes in a particular trial are not known in advance and there is some underlying random process governing those outcomes.
A random variable is a function that associates a real number with each element (outcome) in \(S\). Random variables are typically denoted with capitalization like \(X\) while a specific instance of a random variable is typically denoted like \(x\). For example, a random variable \(X\), can be any value in the set \({ \Xi} = \{5, 10, 15, 20\}\). So a particular instance of \({ X}\) would be written \(x = 5\), for example. We often represent random variables as a sequence, which is a collection of objects indexed (enumerated) with the set of natural numbers. For example, four random variables with possible outcomes from the set \({ \Xi}\) can be written as the sequence \(\{X_1, X_2, X_3, X_4\}.\)
A probability function(a probability mass function (pmf) for discrete random variables or probability density function (pdf) for a continuous random variable), provides a way to capture the probabilities that different values of a random variable can have. A cumulative distribution function (cdf) is a function used to specify the probability of a random variable being less than or equal to the real number given as the input of the cdf. The formal definition of a cdf makes things clearer I think and is
\(F_X (x) = P_X (X \leq x), \text{ for all }x.\)
This formalism should hopefully make more sense given the earlier definition and descriptions of random variables. In the discrete case, the cdf is a step function and the size of a "jump" is equal to \(P(X = x)\). This step condition allows us to calculate the probability of a specific value of \(x\) and leads to the formal definition of a probability mass function:
\(f_X (x) = P_X (X = x), \text{ for all }x.\)
Because pmfs are discrete, we can sum them over a given interval to calculate the cdf. In the continuous case, we can think of a cdf, \(F_X (x)\), as the area under the pdf, \(f_X (x)\), for some given interval. The formal definition of a pdf makes more sense if we note that the pdf definition is typically given in terms of an expanded definition of a cdf and if we remember some important notions from calculus like the Fundamental Theorem of Calculus and the connection between summation and integration. With all of that said, the expanded cdf definition that includes the pdf as the integrand (the function being integrated) is
\(P_X (X \leq x) = F_{X}(x) =\int\limits_{-\infty}^{x}f_X (t)dt\text{ for all }x.\)
With the notion of a probability distribution now introduced, we next consider the concept of expectation. It is not unusual to hear this concept quickly and simply described as an average or a mean even though the formulas that we see associated with expectation often appear to be much more complicated than those familiar concepts. Expectation provides a useful way to summarize information about the distribution of random variables. From our earlier example about a random variable \(X\) with a set of possible values \({ \Xi}\), the expected value of \(X\) or \(E(X)\) is a weighted average of the elements in \( \Xi\). The weights are given by the probability functions that we just introduced. More formally we write the expected value of \(X\) as
\(E(X) = \left\{\begin{array}{ll}\sum_{x}xf(x) & \quad \text{discrete case} \\ \int_{-\infty}^{\infty} xf(x)dx & \quad \text{continuous case.} \end{array}\right.\)
So for example, if we assume that each element in \( \Xi \) has equal probability of occurring, we can then assert that each element occurs with probability 0.25 since there are four elements in the set. Given that this example is the discrete case, we have
\(E(X) = x_1 f(x_1) + x_2 f(x_2)+x_3 f(x_3) + x_4 f(x_4) = \sum\limits_{i=1}^{4}x_i f(x_i) \\ = 5(0.25)+10(0.25) + 15(0.25) + 20(0.25) = 12.5. \)
The pmf used in this example is a well-known one and with a little reflection, the reader who doesn't know which pmf that is should be able guess the name of that distribution. We also see that the expected value matches the value that we get from the standard average formula that most people are familiar with.
A set of independent, identically distributed random variables, \({X_1, X_2,\ldots,X_n}\), can be defined as a random sample of size n. If a parameter is a numeric descriptive measure of a population or probability distribution, then an estimator is a rule or function of a random sample that approximates an unknown parameter. If we were to repeatedly draw samples from a population or probability distribution and then use those samples to evaluate an estimator, we should expect different outcomes across the random samples. We then use the concept of a sampling distribution to capture the probabilities of the various outcomes of the estimator. Our goal then is to determine the quality of the estimator given its properties.
Let's use a simulation study to describe these properties. For the sake of simplicity and because the least squares is often called a "mean" estimator, we'll use the sample mean estimator or
\(\bar{X} = \frac{1}{n} \sum\limits_{i=1}^{n} X_i \)
for this study.
First, we draw 1000 samples from a normal distribution with mean, \(\mu = 0\), variance, \(\sigma^2 = 1\), and sample size \(n = 50\). We then use the observations in each one of the 1000 samples to evaluate the mean estimator given above. So we'll calculate 1000 sample mean estimates, which will have a sampling distribution. We'll call the results from this first simulation run samples.A. We run two additional simulations with \(n = 100\) and \(n = 200\) while keeping the other parameters the same. We'll call these next two runs samples.B and samples.C, respectively. Finally, we change the variance to \(\sigma^2 = 3\) with mean, \(\mu = 0\) and sample size \(n = 200\). We'll call this run samples.D. The results from these simulation runs are presented in the graph below.
The results and graph demonstrate an important property of an estimator--unbiasedness. We see this because all four sampling distributions of the sample mean are centered on the given \(\mu\), 0. Said another way, an estimator is unbiased if the expected value of its probability distribution equals the parameter being estimated. More formally, we usually say
\(E(\hat{\theta}) = \theta,\)
where the estimator is denoted \(\hat{\theta}\). We use the hat symbol to denote an estimator.
Given this definition, we can easily see that bias is defined
\(\text{Bias}(\hat{\theta}) = E(\hat{\theta}) - \theta.\)
The sampling variance is another estimator property that we can observe in the graph. All samples except the last one had their variance parameter set to the value 1. When we increase the variance value to 3, we see that the sampling distribution becomes "flatter" and "wider" in a sense. If we consider this change in variance values as creating two distinct estimators and denote them as \(\hat{\theta}_1\) and \(\hat{\theta}_2\), the estimator with a smaller sampling variance is said to be relatively efficient. More formally, relative efficiency can be written as
\(Var(\hat{\theta}_1) < Var(\hat{\theta}_2).\)
We often want to make absolute rather than relative statements about the "optimality" of an estimator. That is, we want to be able to make a claim like "this or that unbiased estimator has the smallest variance in a class of potentially many other unbiased estimators." This most efficient unbiased estimator, if found, would then be considered the "best" or optimal estimator. One of the most famous theorems in mathematical statistics--the Cramer-Rao Lower Bound--allows us to make the optimality statement that we seek.
Assuming the conditions stated in [7], let \(Y_1, Y_2,\ldots,Y_n\) be a random sample from a continuous pdf \(f_Y (y;\theta)\) and \(\hat{\theta}\) be any unbiased estimator. Then
\(Var(\hat{\theta}) \geq {\left\{nE\left[{\left(\frac{\partial \text{ ln } f_Y (y;\theta)}{\partial \theta}\right)}^2 \right]\right\}}^{-1} = {\left\{-nE\left[{\frac{\partial^2 \text{ ln } f_Y (y;\theta)}{\partial^2 \theta}}^2 \right]\right\}}^{-1}.\)
This can be an intimidating looking result so let's break it down a bit. Starting from the left, we note or determine the variance of the given estimator. We then want to evaluate either of the two inversed terms to the right. The middle expression, which includes a well-known quantity called the Fisher Information, is the inverse of the expectation of the derivative of the natural logarithm of the assumed pdf with respect to the parameter being estimated--multiplied by the size of the sample. The second contains similar components but includes a negation and assumes, if what are called regularity conditions are satisfied, that the natural logarithm of the assumed pdf is twice differentiable. If the results of either of those evaluations equals \(Var(\hat{\theta})\), the theorem states that you have found the "best" estimator because the two inversed terms represent lower bounds. See [7] for a worked out example.
Again, this result about absolute efficiency only pertains to unbiased estimators. If we want to compare biased and unbiased estimators we'll need a different metric. The mean squared error (MSE) helps with this problem. We define it as
\(MSE(\hat{\theta}) = E[ (\hat{\theta} - \theta)^2],\)
which we can use to make statements about a minimum MSE estimator in a class of candidate estimators. It also brings the concepts of bias and spread together because we can show that
\(MSE(\hat{\theta}) = Var(\hat{\theta}) + [\text{Bias}(\hat{\theta})]^2.\)
The graph above also shows what happens to the sampling distribution of this estimator when we increase the sample size, \(n\). The sampling variance becomes smaller and the distribution "collapses" on the parameter that we're trying to estimate. This property is called consistency.
We call the behavior and characteristics of estimators, as sample size grows, asymptotic or large sample properties as opposed to finite or small sample properties. These designations can be a little confusing because finite sample properties hold for samples of any given size[12].
The last property that I'll discuss in this section is asymptotic normality. Let \(\{Y_1, Y_2,\ldots, Y_n\}\) be a sequence of random variables. This property says that the cumulative distribution function of \(Y_n\) converges to the cdf of the standard normal distribution as the sample size \(n\) grows large.
I'll conclude this brief review of relevant concepts from probability and statistics by returning to the simulation study of the sample mean. Given the preceding descriptions of estimator properties, it should hopefully be clear that a "dynamic" version of the graph that accompanied the simulations would greatly enhance the quality of that presentation. The dynamic graph could be an interactive one like the regression graph presented earlier or perhaps an animation. As an exercise, the reader is invited to implement this dynamic graph using the software tool of his or her choosing (there are many available options) as a way of describing these fundamental statistical concepts in a more engaging way.
Back to Regression
With relevant probability and statistics concepts introduced, we return to regression. Let's assume that the data points that you clicked into the interactive graph represent a random sample from a population. Next, we assume that the relationship between the variables \(x\) and \(y\) is given by
\(y = \beta_0 + \beta_1x + \varepsilon.\)
That is, we are hypothesizing that the "true" relationship between the variables \(x\) and \(y\) is approximated by the linear model that we have just proposed. Going back to the graph, this means that there exists another line, which we cannot observe, that is estimated by the visible regression line drawn onto the graph. Since we only have two variables in our current example, we consider it an example of bivariate regression or a simple linear regression. This model has two unknown parameters \(\beta_0\) and \(\beta_1\). There is also random error term \(\varepsilon\). In "regressing y on x," we estimate the unknown parameters. The random error term \(\varepsilon\) accounts for everything else other than \(x \) affecting \(y\) in this simple linear regression.
We'll now get a little more precise. The model is primarily called linear because the parameters, appearing as separate terms raised to the first power and multiplied by each element in the set of independent variables, are summed on the right-hand side of the model. Many sources call this condition "linearity in the parameters." That description assumes that the first product term has an independent variable that is set to one--creating the intercept term. Without this assumption, the simple linear model defined above is technically not a linear function of \(x\). Rather, it is an affine function, which is a linear function plus a constant. See[9] for formal and mathematically precise definitions and descriptions of linearity.
Here are two well-known examples from economics of functions that are "linear in the parameters."
\(Q_D(P) = \alpha - \beta P \\ U(x, y) = x^\alpha y^\beta\)
The first is a demand equation and second is the Cobb-Douglas function. On the surface, the Cobb-Douglas function does not appear to meet the stated requirements. A log transformation of the function produces this function:
\(U(x, y)^* = \alpha\text{log}(x) + \beta\text{log}(y).\)
The transformed function is now linear in the parameters like the demand equation above it. Here is another example of a model that might be assumed to be nonlinear, which is considered linear in the parameters:
\(y = \beta_0 + \beta_1x + \beta_2x^2 + \varepsilon.\)
Any function that cannot be transformed into a linear combination as shown above is considered a nonlinear function. Here is an example of a nonlinear model:
\(y = \beta_0 + e^{ \beta_1x} + \varepsilon.\)
For each value of the independent variable \(x\) in our simple linear model, there is a possible range of \(y\) values. A probability function(as defined and explained in the previous section), \(f_Y (y)\), provides a means to capture the distribution of \(y\). Next, we need a convenient way to summarize the random variable \(Y\) given the data we have about \(x\). The concept of conditional expectation--a conditional mean--provides that way. In plain English, conditioning two variables \(Y\) and \(x\) would be said: "Y given x." We now write the simple linear model as
\(y = E(Y| x) = \beta_0 + \beta_1x\)
which assumes(using the assumptions outlined in [12]):
- Linearity as described above
- Linearly independent columns of predictor variables
- Zero conditional mean: The error term is uncorrelated with the independent variables or \(E(\varepsilon|x) = 0 \).
- Homoscedasticity: Constant variance of errors or \(Var(\varepsilon|x) = \sigma^2\).
- Nonautocorrelation of errors: An individual element in the vector of errors(residuals) cannot be used to predict the value of another element in the vector of errors.
- Normal distribution: The errors are normally distributed
With that, we have just described the Classical Linear Regression Model and its assumptions. The first five assumptions support the Gauss-Markov Theorem, which asserts that, for finite samples, the least squares estimator from this Classical Linear Regression model has the smallest variance in the entire class of unbiased linear estimators. The last assumption on normality allows us to make a distributional assumption about the parameter vector \(\boldsymbol{\beta}\) that is helpful for creating confidence intervals and test statistics for hypothesis testing. The normality assumption is also a useful for the method of maximum likelihood, which is another method for estimating unknown parameters in a model.
The method of maximum likelihood calculates estimate values of the unknown parameter that maximize the "likelihood" of the observed sample. Given the definition of a random sample provided earlier, we define a likelihood function as
\(L = \prod_{i=1}^{n} f_Y|x_i (y_i )\),
which just a way of describing the product of probability density functions evaluated at each data point. If we assume that the pdf is the normal distribution, we have
\(L = \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi}\sigma}\exp{(-\frac{1}{2}(\frac{y_i - \beta_0 - \beta_1 x_i}{\sigma})^2)}\),
where the mean is given as \(\beta_0 + \beta_1 x_i \) and variance is \(\sigma^2 \). Because a likelihood function depends on the parameter, you typically see it written as \(L(\theta;y)\).
Next we want to take the natural log of the likelihood function, which gives
\(\text{ ln } L(\theta;y) = -\frac{n}{2}\text{ ln } 2\pi - \frac{n}{2}\text{ ln }\sigma^2 - \frac{1}{2\sigma^2}\sum_{i=1}^{n}(y_i - \beta_0 - \beta_1x_i)^2. \)
Then we take partial derivatives with respect to the desired parameters and set them equal to 0, which gives
\(\begin{align*}
\frac{\partial \text{ ln } L}{\partial \beta_0} &= -\frac{1}{2\sigma^2}(-2)\sum_{i=1}^{n}(y_i - \beta_0 - \beta_1x_i)=0, \\
\frac{\partial \text{ ln } L}{\partial \beta_1}&= -\frac{1}{2\sigma^2}(-2)\sum_{i=1}^{n}(y_i - \beta_0 - \beta_1x_i)=0, \\
\frac{\partial \text{ ln } L}{\partial \sigma^2} &= -\frac{n}{2\sigma^2}+\frac{1}{2\sigma^4}\sum_{i=1}^{n}(y_i - \beta_0 - \beta_1x_i)^2=0.
\end{align*}\)
Solving the equations simultaneously to obtain the MLE estimates, we get
\(\begin{align*}
\hat{\beta_0} &= \bar{y} - \hat{\beta_1}\bar{x}, \\
\hat{\beta_1} &= \frac{\sum_{i=1}^{n}(y_i - \bar{y})(x_i - \bar{x})}{\sum_{i=1}^{n}(x_i - \bar{x})^2} ,
\end{align*}\)
where
\(\bar{y}= \frac{1}{n}\sum_{i=1}^{n}y_i,{ }\bar{x}= \frac{1}{n}\sum_{i=1}^{n}x_i\text{ and }\hat{\sigma}^2=\frac{1}{n}\sum_{i=1}^{n}\hat{\varepsilon}_{i}^{2}.\)
We have just shown that given the assumption of a normal pdf, MLE (a statistical method) estimates are equivalent to the least squares estimates presented earlier that are derived from purely mathematical methods. This MLE result provides the asymptotic counterpart to the finite sample findings due to the Gauss-Markov Theorem[11]. It allows us to conclude that the least squares estimator is:
Taken all together, these properties provide a strong theoretical justification for the wide-spread use of least squares regression in statistics and applied science. The important caveat here is that these properties only hold true if the various assumptions used to derive these properties are valid for a given regression analysis. Many of these properties have been known for hundreds of years. So much of the actual work for the modern analyst involves testing assumptions.
Very often, we will want to estimate parameters or make predictions on a dependent variable in models that have many independent or predictor variables. With that in mind, this following equation is the standard way many people are introduced to multiple linear regression.
\(y =\beta _{0}+\beta _{1}x_{1} + \beta _{2}x_{2}+\cdots +\beta _{p-1}x_{p-1}+\varepsilon.\)
Let's make this model more concrete by introducing a well-known data set that will be used to support examples in the next few sections. The Boston Housing data set, which was first used in [5], and has since become something of a "Hello World" data set in the greater statistical education and research communities. As such, it often comes as a "built-in" data set in statistical software packages. The dependent variable, the left-hand side of our multiple linear regression model, is median home values in selected Boston suburbs. There are 13 predictor variables(ranging from area crime to rooms per house), which we associate with the parameters in the right-hand side of the model. Look here for additional details on the data set.
Introductory Matrix Concepts and Notation
We often find the material presented in the preceding sections represented in matrix form in more advanced academic literature and in the documentation that accompanies statistical software. And while matrix representation provides a convenient and compact way to present arbitrarily large tables or arrays of numbers, it often makes that literature inaccessible to readers without previous exposure to linear algebra and matrix operations. This section will introduce some basic concepts and notation that are relevant to linear regression. Readers without previous knowledge of this material should only view this section as a starting point. I recommend reading [9] and the appendices in [3, 12], but there many available resources to help people learn linear and matrix algebra.
First, we introduce some matrix notation. It is standard convention to denote the number of rows in a matrix with letter \(m\) and the number of columns with the letter \(n\). These are the dimensions of a generic \(m \times n\) matrix \(\boldsymbol{A}\). In spoken words, we say "A is an m by n matrix." Matrices are often capitalized and bolded to distinguish them from scalars(numbers). More formally, we represent this generic matrix \(\boldsymbol{A}\) like this:
\(\boldsymbol{A} = \begin{bmatrix}
a_{11}&a_{12}&\cdots &a_{1n} \\
a_{21}&a_{22}&\cdots &a_{2n} \\
\vdots & \vdots & \ddots & \vdots\\
a_{m1}&a_{m2}&\cdots &a_{mn}
\end{bmatrix} = [a_{ij}],\)
where the \(a_{ij}\) provides the position of an entry in a matrix at the \(i\)th row and the \(j\)th column.
For example, given the matrix
\(\boldsymbol{A}=\begin{bmatrix}
1&2&3\\
4&5&6\\
\end{bmatrix},\)
\(a_{12} = 2. \)
A square matrix is a matrix with the same number of rows and columns. We would call that an \(n \times n \) matrix.
A handy property of scalars is the ability of multiply a number by 1 and get the same number back. With matrices, we achieve this property with an square matrix called an identity matrix \(\boldsymbol{I}_{n}\).
\(\boldsymbol{I} \equiv \begin{bmatrix}
1 &0&\cdots &0 \\
0&1&\cdots &0 \\
\vdots & \vdots & \ddots & \vdots\\
0&0&\cdots &1
\end{bmatrix} \equiv \boldsymbol{I}_{n}.\)
I'll leave it to the reader to, if necessary, see the sources mentioned earlier about matrix operations like addition and multiplication. For the most part, these operations are similar for scalars and matrices. There are notable differences though like matrix multiplication not generally being commutative or \(\boldsymbol{AB} \neq \boldsymbol{BA}.\)
Another important operation is matrix inversion. This operation allows us to each achieve division in some sense for matrices. Given a square matrix \(\boldsymbol{A}_{n \times n}\), if there exists a square matrix \(\boldsymbol{B}_{n \times n}\) where \(\boldsymbol{AB}= \boldsymbol{I}_n\) and \(\boldsymbol{BA}= \boldsymbol{I}_n\), then \(\boldsymbol{B}\) is the inverse of \(\boldsymbol{A}\) or \(\boldsymbol{B} = \boldsymbol{A}^{-1}\). An invertible matrix is called nonsingular, while a square matrix with no inverse is singular. These are two conditions that frequent users of regression software packages can become painfully familiar with in the form of error messages.
Transposition is a simple but vital concept that involves interchanging the rows and columns of a matrix. So given a matrix \(\boldsymbol{A} = [a_{ij}]\), its transpose is \(\boldsymbol{A^{\prime} } = [a_{ji}]\). Rather than using the prime symbol, some texts will write the transpose as \(\boldsymbol{A}^T .\) So the transpose of our earlier matrix
\(\boldsymbol{A}=\begin{bmatrix}
1&2&3\\
4&5&6\\
\end{bmatrix}\)
is
\(\boldsymbol{A^{\prime}}=\begin{bmatrix}
1&4\\
2&5\\
3&6\\
\end{bmatrix}.\)
Two relevant properties (for later derivations in this section) of transposition are:
- \(\boldsymbol{(A+B)^{\prime}} = \boldsymbol{A^{\prime}+B^{\prime}}\)
- \(\boldsymbol{(AB)^{\prime}} = \boldsymbol{B^{\prime}A^{\prime}}\)
Given an \(m \times n \) matrix \(\boldsymbol{A}\), the rank of \(\boldsymbol{A}\) equals the number of linearly independent columns in \(\boldsymbol{A}\). A column is considered linearly independent if it cannot be expressed as a linear combination of any other columns in \(\boldsymbol{A}\).
In math and engineering, the equation for solving a system of equations is often given as \(\boldsymbol{Ax = b}\). The goal is often to find a the value(s) of \(\boldsymbol{x }\) that make the left-hand side and the right-hand side equal. Let's explicitly see how this system is formed. Given a system of \(m\) equations and \(n\) unknowns, the familiar representation of a system of equations is:
\(\begin{align} a_{11}x_1+a_{12}x_{2}+\cdots+a_{1n}x_{n}& = b_1 \\ a_{12}x_1+a_{22}x_{2}+\cdots+a_{2n}x_{n}& = b_2 \\ \vdots \\ a_{m1}x_1+a_{m2}x_{2}+\cdots+a_{mn}x_{n}& = b_m,\end{align}\)
where
\(\boldsymbol{A} = \begin{bmatrix}
a_{11}&a_{12}&\cdots &a_{1n} \\
a_{21}&a_{22}&\cdots &a_{2n} \\
\vdots & \vdots & \ddots & \vdots\\
a_{m1}&a_{m2}&\cdots &a_{mn}
\end{bmatrix} , \boldsymbol{x} =
\begin{bmatrix}
x_1\\
x_2\\
\vdots\\
x_n\\
\end{bmatrix},\boldsymbol{b} =
\begin{bmatrix}
b_1\\
b_2\\
\vdots\\
b_n\\
\end{bmatrix}.\)
We use a similar process to transform real-world data like the Boston data set into a mathematical object that can be manipulated in matrix form. That is, we create a system of \(n\) observations, \(p \) parameters, and a random, unknown error term per the multiple linear regression model. The representation below makes the distinction between predictor variables and the intercept term hence the \(p-1\) subscript.
\(\begin{align} y_1 & =\beta_0+\beta_1x_{1,1}+\beta_2x_{1,2}+\cdots+\beta_{p-1}x_{1,p-1}+\varepsilon_1 \\ y_2 & =\beta_0+\beta_1x_{2,1}+\beta_2x_{2,2}+\cdots+\beta_{p-1}x_{2,p-1}+\varepsilon_2 \\ \vdots \\ y_n & = \beta_0+\beta_1x_{n,1}+\beta_2x_{n,2}+\cdots+\beta_{p-1}x_{n,p-1}+\varepsilon_n \end{align}\)
To preclude confusion, I'll briefly point out some notational conventions. Since it is standard convention to denote the number of observations in a data set with subscript \(n\), I use \(n\) rather than \(m\) to denote the number of rows in a column vector, which is an \(n \times 1 \) matrix. Following the conventions in [3, 12], I use lowercase rather than uppercase for column and row vectors. Also, the distinction is made between unobservable errors \(\varepsilon\) and observable residuals \(e\). We use the representation below to extract the matrix form of the true model.
\(\begin{bmatrix} y_1\\ y_2\\ \vdots\\ y_n \end{bmatrix}=\begin{bmatrix} 1 & x_{1,1} & x_{1,2} & \cdots & x_{1,p-1}\\ 1 & x_{2,1} & x_{2,2} & \cdots & x_{2,p-1}\\\vdots&\vdots&\vdots&\ddots&\vdots\\1 & x_{n,1} & x_{n,2} & \cdots & x_{n,p-1} \end{bmatrix}\begin{bmatrix} \beta_0\\ \beta_1\\ \vdots\\ \beta_{p-1} \end{bmatrix}+\begin{bmatrix} \varepsilon_1\\ \varepsilon_2\\ \vdots\\ \varepsilon_{n} \end{bmatrix}\\\hspace{5cm}\Downarrow\\\hspace{4cm}\boldsymbol{y = X\beta+\varepsilon}\)
The matrix \(\boldsymbol{X}\)above is called a design matrix. In this case, it includes an intercept term, which is set to all ones, and all the data for the elements in the set of predictor variables.
Remembering that matrices have to be conformable to carry out various operations, note the subscripts in the matrix-based model below. Because the indices are conformable, we can multiply and add the matrix components in an intuitive way. In this case, we count the intercept term as a parameter.
\(\boldsymbol{Y}_{n\times1} = \boldsymbol{X}_{n\times p}\boldsymbol{\beta}_{p\times1}+\boldsymbol{\varepsilon}_{n\times1}\)
Now to solving the least squares problem in matrix form. Before proceeding, I'll briefly discuss the concept of a norm, which is the length or magnitude of a vector, in terms of least squares. Per the Pythagorean theorem, we can define the Euclidean norm or 2-norm of the residual vector \(\boldsymbol{e}\) as
\(\boldsymbol{||e|| = \sqrt{e^{\prime}e}}.\)
The least squares problem can now be restated as finding \(\boldsymbol{\hat{\beta}}\) such that
\(\boldsymbol{||e||^2 = ||y - X\hat{\beta}||^2}\)
is minimized. Given the respective comments about the transpose of scalar being a scalar and about matrix differentiation in [13], we get the following derivation of the matrix-based least squares method.
Since \(\boldsymbol{e=y-X\hat{\beta}}\), we have
\(\begin{align} \boldsymbol{e^{\prime}e} & =\boldsymbol{(y -X\hat{\beta})^{\prime}(y-X\hat{\beta})}\\ & =\boldsymbol{y^{\prime}y-\hat{\beta}^{\prime}X^{\prime}y - y^{\prime}y - y^{\prime}X\hat{\beta} + \hat{\beta}^{\prime}X^{\prime}X\hat{\beta}} \\ & = \boldsymbol{y^{\prime}y-}2\boldsymbol{\hat{\beta}^{\prime}X^{\prime}y+\hat{\beta}^{\prime}X^{\prime}X\hat{\beta}}. \end{align}\)
Taking the partial derivative with respect to \(\boldsymbol{\hat{\beta}}\) and setting the result equal to zero, we have
\( \dfrac{\partial \boldsymbol{e^{\prime}e}}{\partial \hat{\boldsymbol{\beta}}} =-2\boldsymbol{X^{\prime}y}+2\boldsymbol{X^{\prime}X\hat{\beta}=0.}\)
Rearranging and pre-multiplying both sides by \(\frac{1}{2}\), we get
\(\boldsymbol{(X^{\prime}X)\hat{\beta}=X^{\prime}y},\)
which are called the normal equations. Assuming that \(\boldsymbol{X^{\prime}X}\) is nonsingular, we conclude that
\(\boldsymbol{\hat{\beta}=(X^{\prime}X)^{-1}X^{\prime}y}.\)
This normal equations approach to matrix-based least squares is the standard derivation shown in most textbooks and tutorials. One of the main goals of this blog is describe how software packages actually calculate least squares estimates. Those packages generally do not use the normal equations. The problem with the normal equations approach is the requirement to calculate \(\boldsymbol{X^{\prime}X}\) and its inverse to solve for \(\boldsymbol{\hat{\beta}}\). In forming this matrix product, we have to consider rounding errors that come with potentially many floating point operations. It might also be the case that the matrix \(\boldsymbol{X}\) has measurement error. More technically, we are concerned with the stability of \(\boldsymbol{X}\) or the conditioning of the associated system. Without getting into a deep discussion of perturbation analysis, these two concepts pertain to the accuracy of a computed result if the input data contains "impurities" like roundoff or measurement error.
A conditioning number provides a relatively simple way to test the stability of a matrix and is written as
\(\kappa = ||\boldsymbol{A}^{-1}||\ ||\boldsymbol{A}||\),
where \(\boldsymbol{A}\) is a square matrix and ||*|| is the matrix 2-norm, which is different from the vector 2-norm defined earlier. See [9] for an explanation of the difference. You might ask how practical a concept this is given that many matrices that we deal with are not square. The design matrix for the Boston data set is definitely not square. The "trick" here is that \(\boldsymbol{X^{\prime}X}\) is always a square matrix. So we can find the conditioning number of that matrix and then take a square root to compute the number that we want. A conditioning number close to 1 suggests a stable matrix. As an exercise to test understanding, readers are invited to calculate the conditioning number of the Boston data set using the software package of their choice.
A conditioning number of exactly 1 implies an orthogonal matrix. An orthogonal matrix \(\boldsymbol{O}\) has the convenient properties:
\(\boldsymbol{ O^{ -1} = O^{\ \prime}\Leftrightarrow O^{\ \prime}O = OO^{\ \prime}=I}.\)
The concepts of a conditioning number and orthogonality are introduced here together to reinforce the notion that if we multiply sequences of orthogonal matrices, the conditioning number never increases. This is an intuitive rationale for the stable orthogonal reduction algorithms that are used in software packages instead of the textbook normal equations approach.
I'll conclude this section by introducing QR Factorization, a method commonly used in software packages for solving the least squares problem. Restating the the normal equations we have
\(\boldsymbol{(X^{\prime}X)\hat{\beta}=X^{\prime}y}\).
We want to solve for \(\boldsymbol{\hat{\beta}}\) without calculating \(\boldsymbol{X^{\prime}X}\) and its inverse.
QR Factorization states that every matrix \(\boldsymbol{X}_{m \times n}\) with linearly independent columns can be uniquely factored as \(\boldsymbol{X=QR}\), where the columns of \(\boldsymbol{Q}_{m \times n}\) are orthonormal and \(\boldsymbol{R}_{n \times n}\) is an upper triangular matrix with positive entries along its diagonal. Given this definition and remembering the properties of transposition, we can do the following manipulations to the left-hand side of the normal equations:
\(\begin{align} \boldsymbol{(X^{\prime}X)\hat{\beta}} & =\boldsymbol{(QR)^{\ \prime}(QR)\hat{\beta}}\\ & = \boldsymbol{R^{\ \prime}Q^{\ \prime}QR\hat{\beta}} \\ & = \boldsymbol{R^{\ \prime}R\hat{\beta}}.\end{align}\)
We can rewrite the right-hand side as
\(\boldsymbol{X^{\prime}}y =\boldsymbol{R^{\ \prime}Q^{\ \prime}y}.\)
Because triangular matrices with positive diagonal entries are nonsingular[9] (i.e.. you can't just cancel out matrices), we put both sides together and conclude
\(\begin{align}\boldsymbol{R^{\ \prime}R\hat{\beta}} & =\boldsymbol{R^{\ \prime}Q^{\ \prime}y}\\\boldsymbol{R\hat{\beta}} & = \boldsymbol{Q^{\ \prime}y}.\end{align}\)
The is an upper triangular system that can be efficiently solved with a back substitution algorithm like the one presented in [9]. With that, the least squares problem is solved without calculating \(\boldsymbol{X^{\prime}X}\) or its inverse. See a reference like [9] for more details on the orthogonalization algorithms like Gram-Schmidt, Householder, and Givens that actually generate the matrices \(\boldsymbol{Q}\) and \(\boldsymbol{R}\).
Computational Considerations and Implementation
We're now ready to talk implementation and code. It's hard to imagine a general purpose statistical package that does not offer linear regression. I'm therefore not suggesting that the code below is an alternative to available regression options. I do want to give insight into how linear regression is implemented in those software packages. I'll start by introducing the concept of abstraction. Software tends to be implemented in layers. That is, the level that the user interacts with is generally built on top of many underlying levels of functionality. For example, statistical software implementations of regression like R's lm function rely on underlying Fortran libraries to do matrix computations.
I've provided matrix-based regression implementations in Java, Python, Scala, R, and MATLAB to show direct application of the QR decomposition approach discussed above. The jar file required to run the Java and Scala versions can be found here (data-wrangler-1.0.1.jar). Getting these versions to run is left as an exercise for new programmers in these languages. Fair warning, doing so won't exactly be a copy, paste, and run drill.
I feel that the level of abstraction of the code matches the level of the descriptions in this blog. I've used the Boston housing data set introduced earlier as the example data set. All five code snippets generate identical output as shown in the Output tab below. For a variety reasons, I don't typically offer training in MATLAB, but it's a language and framework that I've enjoyed using over the years. I consider it something of a benchmark language in computational math and so, I've included it.
The code snippets also provide some uses beyond a regression tutorial. I've made a point to use an object-oriented approach for all five implementations. I've found this approach to particularly useful for organizing and managing large software projects. Many users of MATLAB have probably never seen an object-oriented implementation in that language and like many users of R, those users probably never think of using the language in an all-purpose way. The code snippets also provide basic examples of how to format string output in these different languages. I believe that facility with strings is a good indicator of a programmer's general ability.
Regression Code
/*****************************************************************
* Author: Ed Ekpoudom
* Last Updated: 18 January 2017
*
*****************************************************************/
package adnt.java;
import org.apache.commons.math3.linear.QRDecomposition;
import org.apache.commons.math3.linear.RealMatrix;
public class JavaRegression {
private final RealMatrix beta;
public JavaRegression (RealMatrix exog, RealMatrix endog) {
QRDecomposition qr = new QRDecomposition(exog);
beta = qr.getSolver().solve(endog);
}
public double getBeta(int p) {
return beta.getEntry(p, 0);
}
public static void main(String[] args) {
String filePath = "boston2.csv";
RealMatrix dataMatrix = DataWrangler.csv2Matrix(filePath);
RealMatrix exog = dataMatrix.getSubMatrix(0, 505, 0, 13);
RealMatrix endog = dataMatrix.getColumnMatrix(14);
JavaRegression fit = new JavaRegression(exog, endog);
System.out.format("Coefficients: \n\nIntcpt \t= %.4f "
+ "\ncrim \t= %.4f \nzn \t= %.4f \n"
+ "indus \t= %.4f \nchas \t= %.4f \nnox \t= %.4f \n"
+ "rm \t= %.4f \nage \t= %.4f \ndis \t= %.4f \n"
+ "rad \t= %.4f \ntax \t= %.4f \nptrtio \t= %.4f \n"
+ "black \t= %.4f \nlstat \t= %.4f\n",
fit.getBeta(0), fit.getBeta(1),
fit.getBeta(2), fit.getBeta(3),
fit.getBeta(4), fit.getBeta(5),
fit.getBeta(6), fit.getBeta(7),
fit.getBeta(8), fit.getBeta(9),
fit.getBeta(10), fit.getBeta(11),
fit.getBeta(12), fit.getBeta(13));
}
}
###################################################################
# Author: Ed Ekpoudom
# Last Updated: 18 January 2017
#
##################################################################
import numpy as np
class PythonRegression:
def __init__(self, endog, exog):
self.endog = endog
self.exog = exog
self.betahat = np.zeros((exog.shape[1],1))
def fit(self):
Q,R = np.linalg.qr(self.exog)
Qb = np.dot(Q.T, self.endog)
self.betahat = np.linalg.solve(R, Qb)
return self.betahat
if __name__ == '__main__':
data = np.loadtxt(open("boston2.csv", "rb"), delimiter=",")
endog = data[:,[14]]
exog = data[:, 0: 14]
lm = PythonRegression(endog, exog)
bhat = lm.fit()
results = [['Intcpt', bhat[0,0]],['crim', bhat[1,0]],['zn', bhat[2,0]],['indus', bhat[3,0]],
['chas', bhat[4,0]],['nox',bhat[5,0]],['rm', bhat[6,0]],['age', bhat[7,0]],
['dis', bhat[8,0]],['rad', bhat[9,0]],['tax', bhat[10,0]],['ptrtio', bhat[11,0]],
['black', bhat[12,0]],['lstat', bhat[13,0]]]
print('Coefficients:\n')
for item in results:
print("{:8}= {:06.4f}".format(item[0], item[1]))
###################################################################
# Author: Ed Ekpoudom
# Last Updated: 18 January 2017
#
##################################################################
library(R6)
R_Regression <- R6Class("R_Regression",
public = list(
endog = NULL,
exog = NULL,
betahat = NULL,
initialize = function(endog = NA, exog = NA) {
self$endog <- endog
self$exog <- exog
self$fit()
self$print_output()
},
fit = function() {
Q.R <- qr(self$exog)
Q <- qr.Q(Q.R)
R <- qr.R(Q.R)
Qy <- crossprod(Q, self$endog)
self$betahat <- solve(R, Qy)
},
print_output = function() {
cat("Coefficients:\n\n")
names <- rownames(self$betahat)
for(i in seq(1:nrow(self$betahat))){
cat(paste0(names[i],'\t= ',
round(self$betahat[i],
digits = 4),'\n'))
}
}
)
)
load(Boston)
data <- Boston
Intcpt <- rep(1,nrow(data))
exog <- data[,1:13]
exog <- as.matrix(cbind(Intcpt,exog))
endog <- as.matrix(data[,14])
lm <- R_Regression$new(endog, exog)
/*****************************************************************
* Author: Ed Ekpoudom
* Last Updated: 18 January 2017
*
*****************************************************************/
package adnt.scala
import org.apache.commons.math3.linear._
import org.apache.commons.math3.linear.QRDecomposition._
import org.apache.commons.math3.linear.RealMatrix._
import Array._
import adnt.java.DataWrangler
class ScalaRegression(exog: RealMatrix, endog: RealMatrix){
val qr = new QRDecomposition(exog)
val beta = qr.getSolver().solve(endog)
def getBeta(p: Int) = beta.getEntry(p, 0)
}
object ScalaRegression {
def main(args: Array[String]): Unit = {
val filePath = "boston2.csv"
val dataMatrix = DataWrangler.csv2Matrix(filePath)
val exog = dataMatrix.getSubMatrix(0, 505, 0, 13)
val endog = dataMatrix.getColumnMatrix(14)
val fit = new ScalaRegression(exog, endog)
println("Coefficients: \n\nIntcpt \t= %.4f ".format(fit.getBeta(0)))
println("crim \t= %.4f \nzn \t= %.4f".format(fit.getBeta(1),
fit.getBeta(2)))
println("indus \t= %.4f \nchas \t= %.4f".format(fit.getBeta(3),
fit.getBeta(4)))
println("nox \t= %.4f \nrm \t= %.4f".format(fit.getBeta(5),
fit.getBeta(6)))
println("age \t= %.4f \ndis \t= %.4f".format(fit.getBeta(7),
fit.getBeta(8)))
println("rad \t= %.4f \ntax \t= %.4f".format(fit.getBeta(9),
fit.getBeta(10)))
println("ptrtio \t= %.4f \nblack \t= %.4f".format(fit.getBeta(11),
fit.getBeta(12)))
println("lstat \t= %.4f".format(fit.getBeta(13)))
}
}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Author: Ed Ekpoudom
% Last Updated: 27 January 2017
% MATLAB is a registered trademark of The MathWorks, Inc.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
classdef MATLABRegression
properties
endog;
exog;
betahat;
end
methods
function self = MATLABRegression(filename)
if(nargin > 0)
M = csvread(filename);
self.exog = M(:,1:14);
self.endog = M(:,15);
self.betahat = [];
self.betahat = self.fit();
self.print_output(self);
end
end
end
methods(Access = 'private')
function beta = fit(self)
if issparse(self.exog)
R = qr(self.exog);
else
R = triu(qr(self.exog));
end
beta = R\(R'\(self.exog'*self.endog));
r = self.endog - self.exog*beta;
err = R\(R'\(self.exog'*r));
beta = beta + err;
end
end
methods(Static = true)
function print_output(self)
fprintf('Coefficients:\n\n')
fprintf('Intcpt \t= %6.4f\n',self.betahat(1,1))
fprintf('crim \t= %6.4f\n',self.betahat(2,1))
fprintf('zn \t= %6.4f\n',self.betahat(3,1))
fprintf('indus \t= %6.4f\n',self.betahat(4,1))
fprintf('chas \t= %6.4f\n',self.betahat(5,1))
fprintf('nox \t= %6.4f\n',self.betahat(6,1))
fprintf('rm \t= %6.4f\n',self.betahat(7,1))
fprintf('age \t= %6.4f\n',self.betahat(8,1))
fprintf('dis \t= %6.4f\n',self.betahat(9,1))
fprintf('rad \t= %6.4f\n',self.betahat(10,1))
fprintf('tax \t= %6.4f\n',self.betahat(11,1))
fprintf('ptrtio \t= %6.4f\n',self.betahat(12,1))
fprintf('black \t= %6.4f\n',self.betahat(13,1))
fprintf('lstat \t= %6.4f\n',self.betahat(14,1))
end
end
end
I'll wrap up this blog post with a short discussion of my references for this post. It's always tempting to promise a laundry list of future blog posts about this or that topic. It's particularly tempting in this case because regression is a big topic with many interesting aspects. Life happens and promised posts get delayed or never appear. In lieu of those promises, I'll provide some hopefully useful descriptions of my references, which almost certainly cover in greater detail anything that I may write about in future posts on regression.
I'll start with references on topics that I didn't really cover in much detail in this post. Testing the assumptions of linear regression is vitally important to the credibility of any serious analysis. See the Spanos books [10,11] for intensive coverage of assumptions testing for linear regression. I also didn't cover inference, hypothesis testing, and prediction. All the statistical texts listed in the references section provide more than adequate details on those topics. Most of those texts also provide basic introductions to Bayesian statistics, which I didn't introduce in this blog. That said, I believe a good grasp of the concepts covered in this blog makes the process of wading into Bayesian statistics easier.
My background is in applied economics and so I'll admit that I have a "bias" towards econometrics texts. The Wooldridge book is a classic text in applied econometrics. The Greene book is as complete a text on regression and, really, statistics as you're going to find. The appendices in these two books are excellent starting points into the world of mathematical statistics. A classic text in mathematical statistics is Casella-Berger[1] while [7] provides a complete introduction to the subject with numerous applications that span many fields.
To see regression applied beyond economics, see [6, 4]. The former is one of the most accessible and practical statistical texts that I've come across. The latter is a classic text in machine and statistical learning. The matrix theory behind regression is covered in the great detail in Meyer's book [9]. It is widely read in the engineering and applied math communities.
Finally, I implemented the interactive regression graph from this D3.js project by Sohail Khan. You can learn more about building Data-Driven Documents (D3) here.
1. Casella, George and Roger L. Berger. Statistical Inference. 2nd ed., Duxbury/Thomson Learning, 2001.
2. Galton, Francis. "Regression Towards Mediocrity in Hereditary Stature." Journal of the Anthropological Institute of Great Britain and Ireland, vol. 15, 1886, pp. 246-263.
3. Greene, William H. Econometric Analysis. 7th ed., Prentice Hall, 2012.
4. Hastie, Trevor, Robert Tibshirani, and Jerome H. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2001.
5. Harrison, David and Daniel L. Rubinfeld. "Hedonic Prices and the Demand for Clean Air." Environ. Economics & Management, vol. 5, 1978, pp. 81-102.
6. James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An Introduction to Statistical Learning: With Applications in R. Springer, 2013.
7. Larsen, Richard and Morris L. Marx. An Introduction to Mathematical Statistics and Its Applications. 5th ed., Prentice-Hall, 2012.
8. Mendenhall, William, Richard Scheaffer, and Dennis Wackerly. Mathematical Statistics with Applications. 2nd ed., PWS Publishers, 1981.
9. Meyer, Carl D. Matrix Analysis and Applied Linear Algebra. Society for Industrial and Applied Mathematics, 2001.
10. Spanos, Aris. Probability Theory and Statistical Inference: Econometric Modeling with Observational Data. Cambridge University Press, 1999.
11. Spanos, Aris. Statistical foundations of econometric modelling. Cambridge University Press, 1986.
12. Wooldridge, Jeffrey M. Introductory Econometrics: A Modern Approach. 4th ed., Thomson/South-Western, 2009.
13. "OLS in Matrix Form." New York University, http://web.stanford.edu/~mrosenfe/soc_meth_proj3/matrix_OLS_NYU_notes.pdf.
This post can be cited as:
Ekpoudom, Ed. "Regression: From the Basics to Coding Matrix-Based Least Squares." AD&T Consulting, LLC, 2 June 2017, https://www.adntconsulting.com/blog/2017/06/02/regression-basics-coding-matrix-based-least-squares.