Fundamental statistical principles
Vectors and matrices
Vectors describe locations in multidimensional space. Each element of a vector describes a dimension in space. If a vector can take on any set of values in n-dimensional space, we say that is lies in
. By convention, we'll use column vectors in our notation, and row vectors will be indicated by their transpose, a'. Matrices, in the way we use them here in statistical models, generally describe a set of m observations in n-dimensional space. Rows usually represent observations, and columns dimensions. So an [m x n] matrix may describe a dataset of m subjects measured on n independent variables. If these measures are predictors in a linear model, these constitute a design matrix, X (or sometimes Z).
A vector transposed times itself is the sum of the element-wise products of the elements. This is the dot product or the scalar product in linear algebra. It is fundamental to measures of similarity, variance, and other aspects of the linear model equations. With a vector with three elements, which represents a point in three-dimensional space, this is:
In matrix notation, this works out to:
Vector similarity and distance
The dot product of two vectors x and y is related to both their magnitudes and the similarity of the angle between them. We can think of it as a scale-dependent measure of similarity.
The cosine of the angle between the vectors is called the cosine similarity, and it provides a "scale-free" measure of similarity:
This is a commonly used distance metric, where the distance is related to the similarity of the two vectors in multidimensional space.
% Plot two vectors in 3-D space: a in blue, b in green.
draw_vector(y, [.2 1 .2]);
% define an anonymous function that calculates cosine sim
cos_sim = @(x, y) x'*y ./ (norm(x) * norm(y));
% calculate the cosine sim
cossim_xy = cos_sim(x, y)
% calculate theta, the angle (in radians)
theta_hat = acos(cossim_xy)
% convert radians to degrees. radians = (pi/180) * degrees
theta_hat * 180/pi % in degrees
Euclidean distances are calculated as the straight-line distances between points. By the Pythagorean Theorem, c^2 = a^2 + b^2, so c = sqrt(a^2 + b^2). Here, a and b are dimensions, defined by elements of x and y. So the dot product
is a measure of the squared geometric distance between x and y. The Euclidean distance is the root mean square. eucliddist = @(x, y) ((x - y)' * (x - y)) .^ 0.5
eucliddist =
@(x,y)((x-y)'*(x-y)).^0.5
eucliddist2 = @(x, y) sqrt(sum((x - y).^2))
eucliddist2 =
@(x,y)sqrt(sum((x-y).^2))
% The stats toolbox function pdist calculates distances between any set of
% points (vectors), where rows define the points and columns the dimensions
d = pdist([x'; y'], 'Euclidean')
% Plot two vectors in 3-D space: a in blue, b in green.
% Show the distance as a dashed black line.
draw_vector(y, [.2 1 .2]);
han = plot3([x(1) y(1)]', [x(2) y(2)]', [x(3) y(3)]', '--','Color', 'k', 'LineWidth', 1);
Vector magnitude, norm, and sums of squares
is the sum of squared values. In geometric terms, the square root of
, or
, is the distance from zero. This is the vector's magnitude, also known as it's 2-norm or L2-norm, or Euclidean norm. A unit vector is a vector with a magnitude of 1, often used to represent a direction without regard to magnitude. The dot product
is a measure of the squared geometric distance from 0. The Euclidean distance from zero is its square root. In a linear model, the 3 dimensions here can represent observations or model predictors, depending on the equation in question.
If zero is a meaningful value, then
takes on particular meaning. For example, if x is an [m x 1] vector of residuals from m observations, e, from a linear model, then
is the sums of squared errors (SSE). So the SSE is (1) the norm of the error vector and (2) the distance of the error vector from 0 in a space defined my m independent observations. The mean squared error (MSE) is the average squared error per observation (element of e), or 
Covariance and correlation
Often, the mean of a vector is uninteresting, and we want to understand how observations vary around the mean. This is generally true in statistics, where we are usually interested in analyzing variation among observations.
Variance is a measure of deviations around a sample mean. It is defined as the mean squared deviation from the sample mean, divided by the degrees of freedom (df). The df is the number of independent dimensions along which scores can vary. Here this is the number of independent observations - 1 (as we have estimated the mean). Here we use m - 1 instead of m to account for the loss of a degree of freedom when estimating the sample mean:
The standard deviation is the square root of variance. If e is a vector of errors from a model:
Covariance is a measure of similarity of shared deviations from a mean. The covariance of two m-length vectors x and y is:
Covariance is often defined in terms of the expected values of random variables, which do not have discrete values but instead are :
% Generate instances of two normally distributed random variables a and b,
% with m = 10 observations each
cv = cov(a, b) % covariance
0.7390 0.4001
0.4001 2.0194
cv2 = (a-mean(a))' * (b-mean(b)) / (length(a) - 1); % covariance
Like the dot product, the units of covariance depend on the scale of the variables. For example, we can measure reaction times in seconds or minutes, and changing the units affects the scale of the response (and thus the mean, variance, and covariance), but not the similarity in reaction times across two tasks. Correlation additionally scales each variable by its variance.
For a z-scored variable standardized to mean 0 and standard deviation 1:
Pearson's correlation (r) is the covariance of the standardized variables:
cov(zscore(a), zscore(b))
1.0000 0.3275
0.3275 1.0000
r = cv(1, 2) ./ prod(diag(cv) .^ .5)
Orthogonality, correlation, and independence
Orthogonality refers to the concept of two vectors being perpendicular to each other in a given vector space. Mathematically, two vectors a and b are orthogonal if and only if their dot product is zero. This means that they are at right angles to one another.
Below, we'll plot two vectors in 3-D space: a in blue, b in green.
draw_vector(b, [.2 1 .2]);
Does this mean that they are uncorrelated?
Does it mean that they are linearly independent?
These are different concepts:
- X and Y are linearly independent iff there is no constant a such that aX - Y = 0. Unless vectors are perfectly aligned so that
, vectors are linearly independent. - X and Y are orthogonal iff X'Y = 0. All orthogonal vectors are independent, so orthogonality is a special case of independence.
- X and Y are uncorrelated iff (X-mean(X))'(Y-mean(Y)) = 0. Vectors can be orthogonal but not uncorrelated. If both varables are mean-zero and uncorrelated, they must be orthogonal. But if they are not mean-zero, they may be uncorrelated but nonorthogonal.
Consider the vectors a and b above.
% Consider the correlation
draw_vector(b, [.2 1 .2]);
han = draw_vector(a-mean(a), 'b'); set(han, 'LineStyle','--')
han = draw_vector(b-mean(b), [.2 1 .2]); set(han, 'LineStyle','--')
title('Mean-centered vectors (dashed) not orthogonal')
Here is a useful Venn diagram of the relationships from this paper:
Projection
The projection of vector a onto vector b is the length of vector a along the direction defined by vector b. We can think of it like the "shadow" of a in the direction defined by b. This link or this one has more. This gives the scalar projection, a magnitude, to be multiplied by the unit vector in the direction of b, 
In statistical modeling, b will be a predictor (or a set of them defining a model subspace), and a will be a data vector that we are projecting onto the model subspace.
a_onto_b = (a'*b / norm(b)) * (b ./ norm(b));
draw_vector(b, [.2 1 .2]);
han = draw_vector(a, 'b'); set(han, 'LineStyle','--')
han = draw_vector(b, [.2 1 .2]); set(han, 'LineStyle','--')
han = draw_vector(a_onto_b, [.7 .7 1]);
title('Light blue: Projection of a onto b')
Extension to matrices
Linear combinations
A linear combination is an expression constructed from a set of terms (variables) by multiplying each term by a scalar (positive or negative) and adding the results. In the context of matrices, it involves multiplying vectors by scalars and adding or subtracting the resulting products.
The coefficients
define the linear combination. The set of all possible linear combinations of p independent variables includes all points within a p-dimensional subspace. In 2-D, this would be a plane, or in 3-D a volume.
Matrix variance/covariance
If X has multiple columns, X'X is a square matrix with the sum of squares on the diagonal, and dot products of its columns on the off-diagonal. This is the normal matrix or Gram matrix. It is related to the variance/covariance matrix of the columns of X. As above, if columns are mean-zero, X'X is the covariance matrix * df. The df is the number of independent dimensions along which scores can vary, which is the independent observations - 1 (as we have estimated the mean).
Thus, if X columns are mean-centered, 
((X-mean(X))' * (X-mean(X))) / (size(X, 1) - 1)
Independence of columns
The number of independent column vectors defines the dimensionality of the space that a matrix spans. This is called the rank of the matrix. If X is a design matrix, this is the number of independent predictors (i.e., regressors).
If the values in the rows fill all the dimensions, the matrix will be full rank.
Only if a matrix is full-rank is its normal matrix invertible. This inversion is critical to solving the normal equation and estimating regression coefficients
. If the matrix is not full rank, there will be no unique solution for
, but rather an infinite family of solutions. An intuition for this is that two perfectly redundant predictors should each explain the data (y) equally well, so it's unclear which should be assigned a higher
. Consider Y, which is a copy of X with an additional column. The additional column is a perfect linear combination of the other columns, so it "lives within" the existing 4-dimensional space and does not add a new dimension.
Y = X; Y(:, 5) = X * [.5 .5 -.4 0]'; % The new column of Y is a linear combination of other variables
inv(Y' * Y) % this doesn't work, it's rank deficient
Basis sets
In linear algebra, a basis set is a set of vectors that span a vector space V. This defines a coordinate system for vectors in the space. A basis set consists of p vectors that satisfies two main properties:
- The vectors are linearly independent, meaning that no vector is a linear combination of the others.
- The vectors span the space V, meaning that there is a linear combination of the basis vectors that can reach any point in the space V.
For example,
is the standard basis for
, 3-D geometric space. These can be collected into a matrix: If the basis vectors are orthogonal and have a norm = 1, this is an orthonormal basis set, and the matrix E is an orthogonal matrix. Here, E is the identity matrix, the simplest kind of orthogonal matrix.
In the GLM, columns of the design matrix (X) are vectors that provide a basis set. These are usually not orthogonal, though they can be if the design matrix reflects an experimental design with orthogonal factors. The data vector is projected onto the basis set using an optimal linear combination
to produce fitted values. Rotations
Rotation matrices are square matrices, with real entries. More specifically, they can be characterized as orthogonal matrices with determinant 1. (The determinant can be thought of as related to the volume described by the columns of a matrix. The determinant of the identity matrix is 1). The columns of R form a basis set for the space in which the rotation occurs. Orthogonal matrices representing rotations (as opposed to reflections) have a determinant of +1. If the determinant is -1, it represents a reflection. A few properties of rotation matrices include:
- They are square matrices.
- They are orthogonal matrices, meaning their rows and columns are orthonormal unit vectors. Consequently, the inverse of a rotation matrix is simply its transpose (
). - Their determinant is 1 (or -1 if they include reflection).
If the 3 dimensions of R are x, y, and z spatial dimensions, respectively, this matrix represents a counterclockwise rotation of θ radians around the z-axis. The matrix can be extended for other axes and more dimensions. Zeros on the off-diagonals
An orthogonal rotation is a linear transformation that preserves the lengths of vectors and the angles between them. The transformation is carried out through multiplication by an orthogonal matrix.
Projections
A projection matrix is a square matrix that implements a projection onto a subspace spanned by the columns of P. This generalizes the idea of a projection onto a 1-D line above. Unlike rotation matrices, projection matrices can rescale the data to which they are applied. Projection matrices: - Do not necessarily preserve the length of vectors.
- Are idempotent, which means applying the matrix multiple times has the same effect as applying it once:
. - Their determinant can be 0, in the case where they project onto a subspace of lower dimensionality than the original space.
A symmetric projection matrix is an orthogonal projection matrix (P), sometimes also called the influence matrix or hat matrix (H). This is also the matrix that produces the model fits in the GLM when applied to data.
An orthogonal projection matrix is a special kind of projection matrix used to project vectors orthogonally onto a subspace. In this context, "orthogonally" means that the vectors are projected onto the closest point in the subspace, resulting in the residual vector being perpendicular to the subspace.
An orthogonal projection matrix Psatisfies two main properties:
- It is symmetric:

- It is idempotent, which means applying the matrix multiple times has the same effect as applying it once:

The formula for constructing an orthogonal projection matrix onto the column space of a given matrix X (with linearly independent columns) is:
We can think of this as being broken down into the outer product
, and the inner product
. If if X were an orthonormal basis set,
would be the identity matrix, and it would not change the product relationship. % Consider orthogonal predictors
X = [1 1 1 0 0 0; 0 0 0 1 1 1]'
% generate an arbitrary data vector
X'*X % normal matrix. % off diag are 0 because ortho.
disp('Inverse normal matrix')
disp('Projection matrix')
H = X * inv((X'*X)) * X' % Hat matrix, projection matrix
0.3333 0.3333 0.3333 0 0 0
0.3333 0.3333 0.3333 0 0 0
0.3333 0.3333 0.3333 0 0 0
0 0 0 0.3333 0.3333 0.3333
0 0 0 0.3333 0.3333 0.3333
0 0 0 0.3333 0.3333 0.3333
disp('Three ways of calculating the scalar projections for each column')
Three ways of calculating the coefficients (betas)
disp('Because predictors are orthogonal, we can apply the 1-D projection separately and get the same thing:')
Because predictors are orthogonal, we can apply the 1-D projection separately and get the same thing:
y_onto_x1 = (y'*X(:, 1) / norm(X(:, 1))^2)
y_onto_x1 = (y'*X / norm(X)^2)
disp('This is the same as the vector of sample means:')
This is the same as the vector of sample means:
[mean(y(X(:, 1) == 1)) mean(y(X(:, 2) == 1))]
disp('The resulting projection:')
Questions:
- What is the dimensionality of the space spanned by X?
- Is X an orthonormal basis set? Why/why not?
- How would you test the idempotency property?
If X is not orthonormal,
includes several "normalizing" elements: - It includes a normalizing factor that is related to 1 / norm(X) (and is exactly this for orthogonal matrices).
- It rotates the axes of the space into an orthogonal space defined by the columns of X, so that the betas reflect the unique projection onto the corresponding column of X.
By using the inverse, we are rotating the space, in this case to an orthonormal basis that spans the model subspace defined by X.
Multivariate distance (Mahalanobis)
There is a close relationship between multivariate distance and the solution to beta-hat. We can think of this as rotating the axes of the space to the axes defined by the covariance of X.
function han = draw_vector(x, color)
han = plot3([0 x(1)]', [0 x(2)]', [0 x(3)]', 'Color', color, 'LineWidth', 5);
han2 = plot3([x(1) x(1)]', [x(2) x(2)]', [x(3) x(3)]', 'o','Color', color, 'LineWidth', 8);
function draw_3d_axes(axlim)
create_figure('geometry');
axislines = plot3(axlim * [-1 1; 0 0; 0 0]', axlim * [0 0; -1 1; 0 0]', axlim * [0 0; 0 0; -1 1]', 'k', 'LineWidth', 2);
% han = fill3(axlim * [-1 1 1 -1]', axlim * [-1 -1 1 1]', axlim * [0 0 0 0], [.8 .8 .8], 'FaceAlpha', .8);
% han = fill3(axlim * [-1 1 1 -1]', axlim * [0 0 0 0], axlim * [-1 -1 1 1]', [.8 .8 .8], 'FaceAlpha', .8);
% han = fill3(axlim * [0 0 0 0], axlim * [-1 -1 1 1]', axlim * [-1 1 1 -1]', [.8 .8 .8], 'FaceAlpha', .8);
% arrowhan = arrow(zeros(size(x)), x);
% set(arrowhan, 'EdgeColor', 'b', 'LineWidth', 6);