html regstats hypertext

Regression Diagnostics

The literature suggests many diagnostic statistics for evaluating multiple linear regression. The usual regression model is:

y = Xb + e

where y is an n by 1 vector of responses,

x is an n by p matrix of predictors,

b is an p by 1 vector of parameters,

e is an n by 1 vector of random disturbances.

Let X = Q*R where Q and R come from a QR Decomposition of X. Q is orthogonal and R is triangular. Both these matrices are useful for calculating many regression diagnostics.

The least squares estimator for b is:

b = R\(Q'*y);


Q from QR Decomposition.
R from QR Decomposition.
Regression Coefficients .
Covariance of regression coefficients.
Fitted values of the response data.
Residuals .
Mean Squared Error.
Leverage.
Hat Matrix.
Delete-1 Variance.
Delete-1 Coefficients.
Standardized Residuals .
Studentized Residuals .
Change in Regression Coefficients.
Change in Fitted Values.
Scaled Change in Fitted Values.
Change in Covariance.
Cook's Distance.

Designed Experiments

Q from QR Decomposition

MATLAB code:

[Q,R] = qr(X,0);

This is the so-called economy sized QR decomposition. Q is n by p. Q is also orthogonal. That is:

Q'*Q = I (the identity matrix)

R from QR Decomposition

R from QR Decomposition

MATLAB code:

[Q,R] = qr(X,0);

This is the so-called economy sized QR decomposition. R is p by p and triangular. This makes solving linear systems simple.

Regression Coefficients

Regression Coefficients

MATLAB code:

b = R\(Q'*y);

If you only wanted the coefficients and did not want to use Q and R later, then:

b = X\y;

is the simplest code. covariance

Covariance of Regression Coefficients

MATLAB code:

%Inverse of R

ri = R\eye(p);

%inv(X'*X)

xtxi = ri*ri';

%Covariance of the coefficients.

covb = xtxi*mse;

MSE is mean squared error.

covb is a p by p matrix. The diagonal elements are the variances of the individual coefficients.

fitted values

Fitted Values of the Response Data

The usual regression model is:

y = Xb + e

where y is an n by 1 vector of responses,

x is an n by p matrix of predictors,

b is an p by 1 vector of parameters,

e is an n by 1 vector of random disturbances.

Let X = Q*R where Q and R come from a QR Decomposition of X. Q is orthogonal and R is triangular. Both these matrices are useful for calculating many regression diagnostics.

The least squares estimator for b is:

b = R\(Q'*y);

Plugging this estimate for b into the model equation (leaving out e) we have:

yhat = X*b = X*(R\(Q'*y))

where yhat is an n by 1 vector of fitted (or predicted) values of y.

Residuals

Residuals

Let y be an n by 1 vector of responses, and

yhat be an n by 1 vector of fitted (or predicted) values of y.

Then the residuals, also n by 1 are:

r = y - yhat

In English, the residuals are the observed values minus the predicted values. Mean Squared Error

Mean Squared Error

The mean squared error is an estimator of the variance of the random disturbances, which is assumed constant.

The formula is:

mse = r'*r./(n-p)

where r is the n by 1 vector of residuals.

n is the number of observations.

p is the number of unknown parameters.

Leverage

Leverage

The leverage is a measure of the effect of a particular observation on the regression predictions due to the position of that observation in the space of the inputs.

In general, the farther a point is from the center of the input space, the more leverage it has.

The leverage vector is an n by 1 vector of the leverages of all the observations. It is the diagonal of the hat matrix.

MATLAB code

h = diag(Q*Q');

Hat Matrix

Hat (Projection) Matrix

The hat matrix is an n by n matrix that projects the vector of observations, y, onto the vector of predictions, yhat, thus putting the "hat" on y.

MATLAB code.

H = Q*Q';

yhat = H*y;

Delete-1 Variance

Delete-1 Variance

The delete-1 variance is an n by 1 vector. Each element of the vector is the mean squared error of the regression obtained by deleting that observation.

Delete-1 Coefficients

Delete-1 Coefficients

The delete-1 coefficents a an p by n matrix. Each column of the matrix is the coefficients of the regression obtained by deleting corresponding observation.

Standardized Residuals

Standardized Residuals

The standardized residuals are the residuals normalized by a measure of their standard deviation.

MATLAB code:

stanres = residuals./(sqrt(mse*(1-h)));

where mse is the mean squared error.

and h is the leverage vector.

See also: Studentized Residuals .

Studentized Residuals

Studentized Residuals

The studentize residuals are the residuals normalized by a measure of their standard deviation.

MATLAB code:

stanres = residuals./(sqrt(s2i*(1-h)));

where s2i is the delete-1 variance.

and h is the leverage vector.

See also: Standardized Residuals .

df betas

Change in Regression Coefficients

DFBETA is an p by n matrix. Each column of DFBETA is the scaled effect of removing the corresponding point on the vector of coefficients.

df fits

Change in Fitted Values

DFFIT is an n by 1 vector. Each element is the change in the fitted value caused by deleting the corresponding observation.

MATLAB code:

dffit = h.*residuals./(1-h)

where h is the leverage vector.

scaled df fits

Scaled Change in Fitted Values

Scaled DFFIT is an n by 1 vector. Each element is the change in the fitted value caused by deleting the corresponding observation and scaled by the standard error.

MATLAB code:

sdffit = sqrt(h./(1-h)).*e_i)

where e_i is the vector of studentized residuals.

and h is the leverage vector.

Change in Covariance

Change in Covariance

DFCOV is an n by 1 vector. Each element is the ratio of the generalized variance of the estimated coefficients when the corresponding element is deleted to the generalized variance of the coefficients using all the data.

Cook's Distance

Cook's Distance

Cook's distance is an n by 1 vector. Each element is the normalized change in the vector of coefficients due to the deletion of an observation.

MATLAB code:

d = residuals.*residuals.*(h./(1-h))./p; where d is Cook's distance.

h is the leverage vector.

and p is the number of unknown parameters.