Someone recently asked on the statistics Stack Exchange why the squared error is used in statistics. This is something I’d been wondering about myself recently, so I decided to take a crack at answering it. The post below is adapted from that answer.
Why squared error?
It’s true that one could choose to use, say, the absolute error instead of the squared error. In fact, the absolute error is often closer to what you “care about” when making predictions from your model. For instance, if you buy a stock expecting its future price to be \(P_{predicted}\) and its future price is \(P_{actual}\) instead, you lose money proportional to \((P_{predicted} - P_{actual})\), not its square! The same is true in many other contexts.
However, the squared error has much nicer mathematical properties. For instance:
If \(X\) is a random variable, then the estimator of \(X\) that minimizes the squared error is the mean, \(E(X)\). On the other hand, the estimator that minimizes the absolute error is the median, \(m(X)\). The mean has much nicer properties than the median; for instance, \(E(X + Y) = E(X) + E(Y)\), but there is no general expression for \(m(X + Y)\).
If you have a vector \(\vec X = (X_1, X_2)\) estimated by \(\vec x = x_1, x_2\), then for the squared error it doesn’t matter whether you consider the components separately or together: \(\left|\left|\vec X - \vec x\right|\right|^2 = (X_1 - x_1)^2 + (X_2 - x_2)^2\), so the squared error of the components just adds. You can’t do that with absolute error. This means that the squared error is independent of re-parameterizations: for instance, if you define \(\vec Y_1 = (X_1 + X_2, X_1 - X_2)\), then the minimum-squared-deviance estimators for \(Y\) and \(X\) are the same, but the minimum-absolute-deviance estimators are not.
For independent random variables, variances (expected squared errors) add: \(Var(X + Y) = Var(X) + Var(Y)\). The same is not true for expected absolute error.
For a sample from a multivariate Gaussian distribution (where probability density is exponential in the squared distance from the mean), all of its coordinates are Gaussian, no matter what coordinate system you use. For a multivariate Laplace distribution (like a Gaussian but with absolute, not squared, distance), this isn’t true.
The squared error of a probabilistic classifier is a proper scoring rule. If you had an oracle telling you the actual probability of each class for each item, and you were being scored based on your Brier score, your best bet would be to predict what the oracle told you for each class. This is not true for absolute error. (For instance, if the oracle tells you that \(P(Y=1) = 0.9\), then predicting that \(P(Y=1) = 0.9\) yields an expected score of \(0.9\cdot 0.1 + 0.1 \cdot 0.9 = 0.18\); you should instead predict that \(P(Y=1) = 1\), for an expected score of \(0.9\cdot 0 + 0.1 \cdot 1 = 0.1\).)
I would say that these nice properties are merely “convenient”—we might choose to use the absolute error instead if it didn’t pose technical issues when solving problems. But some mathematical coincidences involving the squared error are more important. They don’t just pose technical problem-solving issues; rather, they give us intrinsic reasons why minimizing the square error might be a good idea:
When fitting a Gaussian distribution to a set of data, the maximum-likelihood fit is that which minimizes the squared error, not the absolute error.
When doing dimensionality reduction, finding the basis that minimizes the squared reconstruction error yields principal component analysis, which is nice to compute, coordinate-independent, and has a natural interpretation for multivariate Gaussian distributions (finding the axes of the ellipse that the distribution makes). There’s a variant called “robust PCA” that is sometimes applied to minimizing absolute reconstruction error, but it seems to be less well-studied and harder to understand and compute.
Looking deeper
One might well ask whether there is some deep mathematical truth underlying the many different conveniences of the squared error. As far as I know, there are a few (which are related in some sense, but not, I would say, the same):
Differentiability
The squared error is everywhere differentiable, while the absolute error is not (its derivative is undefined at 0). This makes the squared error more amenable to the techniques of mathematical optimization. To optimize the squared error, you can just set its derivative equal to 0 and solve; to optimize the absolute error often requires more complex techniques.
Inner products
The squared error is induced by an inner product on the underlying space. An inner product is basically a way of “projecting vector \(x\) along vector \(y\),” or figuring out “how much does \(x\) point in the same direction as \(y\).” In finite dimensions this is the standard (Euclidean) inner product \(\langle a, b\rangle = \sum_i a_ib_i\). Inner products are what allow us to think geometrically about a space, because they give a notion of:
- a right angle (\(x\) and \(y\) are right angles if \(\langle x, y\rangle = 0\));
- and a length (the length of \(x\) is \(\left|\left|x\right|\right| = \sqrt{\langle x, x\rangle}\)).
By “the squared error is induced by the Euclidean inner product” I mean that the squared error between \(x\) and \(y\) is \(\left|\left|x-y\right|\right|^2\), the (squared) Euclidean distance between them. In fact the Euclidean inner product is in some sense the “only possible” axis-independent inner product in a finite-dimensional vector space, which means that the squared error has uniquely nice geometric properties.
For random variables, in fact, you can define is a similar inner product: \(\langle X, Y\rangle = E(XY)\). This means that we can think of a “geometry” of random variables, in which two variables make a “right angle” if \(E(XY) = 0\). Not coincidentally, the “length” of \(X\) is \(E(X^2)\), which is related to its variance. In fact, in this framework, “independent variances add” is just a consequence of the Pythagorean Theorem:
$$Var(X + Y) = \left|\left|(X - \mu_X) + (Y - \mu_Y)\right|\right|^2 = \left|\left|X - \mu_X\right|\right|^2 + \left|\left|Y - \mu_Y\right|\right|^2 = Var(X) + Var(Y).$$
Beyond squared error
Given these nice mathematical properties, would we ever not want to use squared error? Well, as I mentioned at the very beginning, sometimes absolute error is closer to what we “care about” in practice. For instance, if your data has tails that are fatter than Gaussian, then minimizing the squared error can cause your model to spend too much effort getting close to outliers, because it “cares too much” about the one large error component on the outlier relative to the many moderate errors on the rest of the data.
The absolute error is less sensitive to such outliers. (For instance, if you observe an outlier in your sample, it changes the squared-error-minimizing mean proportionally to the magnitude of the outlier, but hardly changes the absolute-error-minimizing median at all!) And although the absolute error doesn’t enjoy the same nice mathematical properties as the squared error, that just means absolute-error problems are harder to solve, not that they’re objectively worse in some sense. The upshot is that as computational methods have advanced, we’ve become able to solve absolute-error problems numerically, leading to the rise of the subfield of robust statistical methods.
In fact, there’s a fairly nice correspondence between some squared-error and absolute-error methods:
Squared error | Absolute error |
---|---|
Mean | Median |
Variance | Expected absolute deviation |
Gaussian distribution | Laplace distribution |
Linear regression | Quantile regression |
PCA | Robust PCA |
Ridge regression | LASSO |
As we get better at modern numerical methods, no doubt we’ll find other useful absolute-error-based techniques, and the gap between squared-error and absolute-error methods will narrow. But because of the connection between the squared error and the Gaussian distribution, I don’t think it will ever go away entirely.
—December 2014