Regression analysis is any statistical method where the mean of one or more random variables is predicted conditioned on other (measured) random variables. In particular, there is linear regression, logistic regression, Poisson regression, supervised learning, and unit-weighted regression. Regression analysis is more than curve fitting (choosing a curve that best fits given data points); it involves fitting a model with both deterministic and stochastic components. The deterministic component is called the predictor and the stochastic component is called the error term.
The simplest form of a regression model contains a dependent variable (also called "outcome variable," "endogenous variable," or "Y-variable") and a single independent variable (also called "factor," "exogenous variable," or "X-variable").
Typical examples are the dependence of the blood pressure Y on the age X of a person, or the dependence of the weight Y of certain animals on their daily ration of food X. This dependence is called the regression of Y on X.
See also: multivariate normal distribution, important publications in regression analysis.
Regression is usually posed as an optimization problem as we are attempting to find a solution where the error is at a minimum. The most common error measure that is used is the least squares: this corresponds to a Gaussian likelihood of generating observed data given the (hidden) random variable. In a certain sense, least squares is an optimal estimator: see the Gauss-Markov theorem.
The optimization problem in regression is typically solved by algorithms such as the gradient descent algorithm, the Gauss-Newton algorithm, and the Levenberg-Marquardt algorithm. Probabilistic algorithms such as RANSAC can be used to find a good fit for a sample set, given a parametrized model of the curve function. For more complex, non-linear regression artificial neural networks are commonly used.
Regression can be expressed as a maximum likelihood method of estimating the parameters of a model. However, for small amounts of data, this estimate can have high variance. Bayesian methods can also be used to estimate regression models. A prior is placed over the parameters, which incorporates everything known about the parameters. (For example, if one parameter is known to be non-negative a non-negative distribution can be assigned to it.) A posterior distribution is then obtained for the parameter vector. Bayesian methods have the advantages that they use all the information that is available and they are exact, not asymptotic, and thus work well for small data sets. Some practitioners use maximum a posteriori (MAP) methods, a simpler method than full Bayesian analysis, in which the parameters are chosen that maximize the posterior. MAP methods are related to Occam's Razor: there is a preference for simplicity among a family of regression models (curves) just as there is a preference for simplicity among competing theories.
Purpose and formulation
The goal of regression is to describe a set of data as accurately as possible. To do this, we set the following mathematical context:
href="
will denote a probability space and (Γ,S) will be a measure space.
is a set of coefficients.
Very often (but not always),
and
, the borel σ-algebra on the real numbers.
The response variable (or vector of observations) Y is a random variable, i.e. a measurable function:
.
This variable will be "explained" using other random variables called factors. Some people say Y is a dependent variable (because it depends on the factors) and call the factors independent variables. However, the factors can very well be statistically dependent (for example if one takes X and X2) and the response variables can be statistically independent. Therefore, the terminology "dependent" and "independent" can be confusing and should be avoided.
Let
. p is called number of factors.
.
Let
.
We finally define the error
, which means that
or more concisely:

(E)
where
.
We suppose that there exists a 'true parameter
such that
, which means we suppose we have chosen the model accurately because the best prediction we can make of Y given X is
. This true parameter
is unknown and it is the aim of regression to estimate it with the data at hand.
The necessity of this abstract formalism arrises because the response variables Y and the factors
can be of very different nature and therefore take values in very different sets. For example, Y could be the number of correct answers to a test and X = X1 could be the age of the person undertaking the test. But the factors don't even have to be numbers: for example, they can take values in a finite set such as {low,medium,high}.
The last term,
, is a random variable called error which is supposed to model the variability in the experiment (i.e., in exactly the same conditions, the output Y of the experiment might differ slightly from experiment to experiment). This term actually represents the part of Y not explained by the model η.
The general form of the function η is known. In fact, the only element we don't know in the equation (E) is θ. The aim of regression is, given a set of data, to find an estimator
of
satisfying some criterion.
The first step is to decide on the form of the model. Then choose an estimator for
and compute it.
Choice of the regression function
Linear regression
Linear regression is the most common case in practice because it is the easiest to compute and gives good results. Indeed, by restraining the variations of the factors to a "small enough" domain, the response variable can be approximated localy by a linear function. Note that by "linear", we mean "linear in θ", not "linear in X". When we do a linear regression, we are implicitly supposing that given a set of factors
, the best approximation of the response variable Y we can find is a linear combination of theses factors
. The aim of linear regression is to find a good estimator of the right coefficients
of this linear combination.
We choose η the following way:

Logistic regression
If the variable y has only discrete values (for example, a Yes/No variable), logistic regression is preferred. It is equivalent to making a linear regression on the log of the odds. The outcome of this type of regression is a function which describes how the probability of a given event (e.g. probability of getting "yes") varies with the factors.
In order to solve this problem efficiently, several methods exist. The most common one is the Gauss-Markov method, but it requires extra hypotheses.
Choice of an estimator
href="
We now suppose that for each factor
, we have a sample of size
and that we have the corresponding sample of Y:
. Then we can build a matrix
where each line represents a sample of the p factors, i.e. an experiment:
This is a matrix of random variables often called design matrix (for experimental designs). Each line represents an experiment (or trial) and each column represents a factor. As we have n trials and p factors, it is a
matrix. We also have a corresponding error vector (of size n):
.
Based on the sample
and on the design matrix
, we would like to estimate the unkown parameters
(one per factor).
Under assumptions which are met relatively often, there exists an optimal solution to the linear regression problem. These assumptions are called Gauss-Markov assumptions. See also Gauss-Markov theorem.
The Gauss-Markov assumptions
We suppose that
and that
(uncorrelated, but not necessarily independent) where
and
is the
identity matrix.
Least-squares estimation of the coefficients
href="
The linear regression problem is equivalent to an orthogonal projection: we project the response variable Y on a subspace of linear functions generated by
. Supposing the matrix
is of full rank, it can be shown (for a proof of this, see least-squares estimation of linear regression coefficients) that a good estimator of the parameters
is the least-squares estimator
:

and 
Limitations and alternatives to least-squares
The least-squares estimator is extremely efficient: in fact, the Gauss-Markov theorem states that under the Gauss-Markov assumptions, of all unbiased estimators of the linear regression coefficients, depending linearly on
, the least-square ones are the most efficient ones (best linear unbiased estimator or BLUE). Unfortunately, the Gauss-Markov assumptions are often not met in practical cases (for example, in the study of time series and departure from these assumptions can corrupt the results quite significantly. A rather naïve illustration of this is given on the figure below:
All points lie on a straight line, except one and the regression line is shown in red. Just one observation has flawed the entire regression line: this method is said to be non-robust.
Severall methods exist to solve this problem, the simplest of which is to assign weights to each observation (see weighted least-squares). Indeed, if we know that the i-th sample is likely to be unreliable, we will downweigh it. This supposes that we know which observations are flawed, which is often a bit optimistic. Another approach is to use recursively reweighted least-squares where we compute the weights iteratively. The disadvantage of this method is that this kind of estimator cannot be computed explicitly (only recursively) and that it is much more difficult to ensure convergence, let alone accuracy. The study of such estimators has lead to a branch of statistics now called robust statistics.
Robust estimators being a bit fiddly people tend to overlook the Gauss-Markov assumptions and use least-squares even in situations where it can be ill-suited.
Confidence interval for estimation assuming normality, homoscedasticity, and uncorrelatedness
How much confidence can we have in the values of
we estimated from the data? To answer, we need to suppose that:

Then we can get the distribution of the least-square estimation of the parameters.
If
and
(with
), then


- and

For
, if we name sj the j-th diagonal element of the matrix
, a 1 − α confidence interval for each θj is therefore:
![[\widehat{\theta_j}-\widehat{\sigma}\sqrt{s_j}t_{n-p;1-\frac{\alpha}{2}};\widehat{\theta_j}+\widehat{\sigma}\sqrt{s_j}t_{n-p;1-\frac{\alpha}{2}}].](http://upload.wikimedia.org/math/0/5/c/05c6d446ebe3c1ba0c89f35c5099c19a.png)
Examples
First example
The following data set gives the average heights and weights for American women aged 30-39 (source: The World Almanac and Book of Facts, 1975).
| Height (in) |
58 |
59 |
60 |
61 |
62 |
63 |
64 |
65 |
66 |
67 |
68 |
69 |
70 |
71 |
72 |
| Weight (lbs) |
115 |
117 |
120 |
123 |
126 |
129 |
132 |
135 |
139 |
142 |
146 |
150 |
154 |
159 |
164 |
We would like to see how the weight of these women depends on their height. We are therefore looking for a function η such that
, where Y is the weight of the women and X their height. Intuitively, we can guess that if the women's proportions are constant and their density too, then the weight of the women must depend on the cube of their height. A plot of the data set confirms this supposition:
We can suppose the heights of the women are independant from each other and have constant variance, which means the Gauss-Markov assumptions hold. We can therefore use the least-squares estimator, i.e. we are looking for coefficients θ0,θ1 and θ2 satisfying as well as possible (in the sense of the least-squares estimator) the equation:

Geometrically, what we will be doing is an orthogonal projection of Y on the subspace generated by the variables 1,X and X3. The matrix \mathbf{X} is constructed simply by putting a first column of 1's (the constant term in the model) a column with the original values (the X in the model) and a third column with these values cubed (X3). The realization of this matrix (i.e. for the data at hand) can be written:
| 1 |
x |
x3 |
| 1 |
58 |
195112 |
| 1 |
59 |
205379 |
| 1 |
60 |
216000 |
| 1 |
61 |
226981 |
| 1 |
62 |
238328 |
| 1 |
63 |
250047 |
| 1 |
64 |
262144 |
| 1 |
65 |
274625 |
| 1 |
66 |
287496 |
| 1 |
67 |
300763 |
| 1 |
68 |
314432 |
| 1 |
69 |
328509 |
| 1 |
70 |
343000 |
| 1 |
71 |
357911 |
| 1 |
72 |
373248 |
The matrix
(sometimes called "information matrix" or "dispersion matrix") is:
![\left[\begin{matrix} 1927.3&-44.6&3.5e-3\\ -44.6&1.03&-8.1e-5\\ 3.5e-3&-8.1e-5&6.4e-9 \end{matrix}\right]](http://upload.wikimedia.org/math/8/1/3/813241f88b73bbe0f24f5e37d1f606f1.png)
Vector
is therefore:

hence η(X) = 147 − 1.98X + 4.27 * 10 − 4X3
A plot of this function shows that it lies quite closely to the data set:
The confidence intervals are computed using:
![[\widehat{\theta_j}-\widehat{\sigma}\sqrt{s_j}t_{n-p;1-\frac{\alpha}{2}};\widehat{\theta_j}+\widehat{\sigma}\sqrt{s_j}t_{n-p;1-\frac{\alpha}{2}}]](http://upload.wikimedia.org/math/c/1/0/c10c7d075275be3256a506971b33c76c.png)
with:




Therefore, we can say that with a probability of 0.95,
![\theta^0\in[112.0 , 181.2]](http://upload.wikimedia.org/math/c/8/4/c843554ac8089a21b62e1dc755f4217b.png)
![\theta^1\in[-2.8 , -1.2]](http://upload.wikimedia.org/math/3/a/a/3aa3a31559f1a1ef9e1ca1af8b43e1f8.png)
![\theta^2\in[3.6e-4 , 4.9e-4]](http://upload.wikimedia.org/math/3/1/7/317c67aa7f7ef168c9186c282601dd03.png)
Second example
We are given a vector of x values and another vector of y values and we are attempting to find a function f such that f(xi) = yi.
- let

Let's assume that our solution is in the family of functions defined by a 3rd degree Fourier expansion written in the form:
- f(x) = a0 / 2 + a1cos(x) + b1sin(x) + a2cos(2x) + b2sin(2x) + a3cos(3x) + b3sin(3x)
where ai,bi are real numbers. This problem can be represented in matrix notation as:

filling this form in with our given values yields a problem in the form Xw = y

This problem can now be posed as an optimization problem to find the minimum sum of squared errors.
3rd degree Fourier function


solving this with least squares yields:

thus the 3rd-degree Fourier function that fits the data best is given by:
- f(x) = 4.25cos(x) − 6.13cos(2x) + 2.88cos(3x).
See also
References
- Audi, R., Ed. (1996) The Cambridge Dictionary of Philosophy. Cambridge, Cambridge University Press. curve fitting problem p.172-173.
- Birkes, David and Yadolah Dodge, Alternative Methods of Regression (1993), ISBN 0-471-56881-3
- Chatfield, C. (1993) "Calculating Interval Forecasts," Journal of Business and Economic Statistics, 11 121-135.
- Fox, J., Applied Regression Analysis, Linear Models and Related Methods. (1997), Sage
- Hardle, W., Applied Nonparametric Regression (1990), ISBN 0-521-42950-1
- Meade, N. and T. Islam (1995) "Prediction Intervals for Growth Curve Forecasts," Journal of Forecasting, 14 413-430.
External links