Subspaces, rank, and projection

Column spaces, null spaces, and what a linear map preserves and destroys.

PublishedMarch 2026

Reading time

SeriesLinear Algebra for Neural Data, Part 4

What gets destroyed

In the previous post, we had a decoder matrix that mapped 3-dimensional neural activity to 2-dimensional hand velocity:

A = \begin{bmatrix} 0.3 & 0.1 & -0.2 \\ -0.1 & 0.4 & 0.1 \end{bmatrix}

(1)

Three numbers go in. Two come out. That means something is being lost. Not just "some precision" or "some detail" — an entire dimension of information is being annihilated. There must be directions in the 3-dimensional input space that this matrix sends to zero, directions where the neural activity changes but the predicted velocity does not.

Can we find one? We need a vector $x$ such that $Ax = 0$ . That means $0.3x_1 + 0.1x_2 - 0.2x_3 = 0$ and $-0.1x_1 + 0.4x_2 + 0.1x_3 = 0$ . From the first equation, $x_3 = 1.5x_1 + 0.5x_2$ . Substituting into the second: $-0.1x_1 + 0.4x_2 + 0.1(1.5x_1 + 0.5x_2) = 0$ , which simplifies to $0.05x_1 + 0.45x_2 = 0$ , giving $x_1 = -9x_2$ . Setting $x_2 = 1$ : $x = (-9,\; 1,\; -13)$ .

Let's check. Does $Ax = 0$ ? First row: $0.3(-9) + 0.1(1) + (-0.2)(-13) = -2.7 + 0.1 + 2.6 = 0$ . Second row: $(-0.1)(-9) + 0.4(1) + 0.1(-13) = 0.9 + 0.4 - 1.3 = 0$ . Yes. This vector is invisible to the decoder. A neuron can change its firing rate along this direction and the predicted velocity will not budge.

This post is about what a matrix preserves and what it destroys. Every matrix carries this structure: a set of directions that survive the map and a set that get annihilated. Understanding this split is what makes dimensionality reduction, least-squares fitting, and subspace identification precise.

The column space

Start with the output side. What are all the possible outputs of the decoder $A$ ?

From the column picture in the previous post, we know that $Ax$ is always a linear combination of the columns of $A$ . So the set of all possible outputs is the span of those columns. This set is the column space of $A$ .

Our decoder has columns $a_1 = (0.3, -0.1)$ , $a_2 = (0.1, 0.4)$ , $a_3 = (-0.2, 0.1)$ . These are three vectors in $\mathbb{R}^2$ . Can three 2-dimensional vectors be independent? No. In two dimensions, at most two vectors can be independent. The third must be a combination of the first two.

Let's check. Is $a_3$ a combination of $a_1$ and $a_2$ ? We need $\alpha(0.3, -0.1) + \beta(0.1, 0.4) = (-0.2, 0.1)$ . First component: $0.3\alpha + 0.1\beta = -0.2$ . Second: $-0.1\alpha + 0.4\beta = 0.1$ . From the first: $\alpha = (-0.2 - 0.1\beta) / 0.3$ . Substituting and solving: $\beta \approx 0.154$ , $\alpha \approx -0.718$ . So yes, $a_3$ is in the span of $a_1$ and $a_2$ .

The column space is all of $\mathbb{R}^2$ : this decoder can produce any velocity vector. Two independent columns in a 2-dimensional output space is enough to fill it. The third column adds no new reachable outputs.

Now consider a different situation. Suppose the decoder has columns $(1, 2)$ , $(2, 4)$ , $(3, 6)$ . Every column is a multiple of $(1, 2)$ . The column space is a single line through the origin. This decoder can only produce velocities along one direction, regardless of what the neural population does. The output is trapped on that line.

Drag the columns of the matrix. The column space (green) is their span. When the columns become dependent, the rank drops and a null-space direction (gold) appears.

The rank of a matrix is the dimension of its column space: the number of independent columns. Our original decoder has rank 2 (two independent columns in a 2-dimensional output). The degenerate decoder has rank 1 (all columns lie along one line). Rank tells you the effective dimensionality of the map's output, not the nominal size of the matrix.*For neural data, the useful notion is "effective rank." A 500-by-100 data matrix is nearly always full rank because of noise, but most of that rank is noise-driven. The useful rank is the number of dimensions carrying substantial variance, which is what PCA eigenvalues quantify. Cunningham and Yu [4] review methods for estimating effective dimensionality in neural populations.

The null space

We already found one vector that the decoder sends to zero: $(-9, 1, -13)$ . Any scalar multiple of this vector also gets sent to zero (the map is linear: if $Ax = 0$ , then $A(cx) = cAx = c \cdot 0 = 0$ ). The set of all inputs that the matrix kills is the null space.

For our decoder, the null space is a line in 3-dimensional space: all multiples of $(-9, 1, -13)$ . Neural activity can vary freely along this line without changing the decoded velocity at all. The decoder is blind to it.

This is not just a mathematical curiosity. In neuroscience, the null space of a decoder has a specific interpretation: it is the set of neural activity patterns that produce no behavioral output. Kaufman et al. showed that preparatory activity in motor cortex lives largely in the null space of the neural-to-muscle map.*Kaufman et al. [8] demonstrated that preparatory neural activity in motor cortex is largely confined to directions that do not drive muscles. The language of null spaces makes this finding precise: preparation involves large changes in neural state along directions that the motor output mapping annihilates. The population state changes significantly during planning, but those changes are confined to directions the downstream readout ignores. Large neural changes, zero behavioral effect.

Drag the neural state. Movement along the null-space direction (gold) does not change the decoded output. Movement along the output direction (teal) does. The decoder is blind to the null space.

The dimension of the null space is the nullity. For our 2-by-3 decoder, the null space is 1-dimensional (a line). If the matrix were 2-by-100 (a realistic decoder), the null space would be 98-dimensional. Out of 100 possible directions of neural activity, only 2 would affect the output. The other 98 would be invisible.

Rank and nullity

Look at the numbers. Our decoder is 2-by-3. It has rank 2 (two independent columns) and nullity 1 (a 1-dimensional null space). Notice: 2 + 1 = 3, the number of columns.

This is not a coincidence. It is a theorem:

\text{rank}(A) + \text{nullity}(A) = n

(2)

where $n$ is the number of columns (the dimension of the input space). Directions preserved plus directions destroyed equals the total number of input dimensions. Nothing is unaccounted for. Every input direction either survives the map or gets annihilated; none falls through the cracks.

For a 2-by-100 neural decoder: rank at most 2, so nullity at least 98. The vast majority of neural activity patterns are invisible to the decoder. This is not a deficiency of the decoder. It is arithmetic: a 2-dimensional output simply cannot distinguish among 100-dimensional inputs. At most 2 directions survive.

For a 500-by-100 data matrix (500 time points, 100 neurons): if the rank is 10, then the nullity is 90. The population's activity is confined near a 10-dimensional subspace. The other 90 dimensions are the directions along which the data does not move (or moves only due to noise). Finding those 10 directions is what dimensionality reduction does.

The four subspaces

The column space lives in the output. The null space lives in the input. Is there a way to see both sides at once?

An $m \times n$ matrix connects two spaces: the input $\mathbb{R}^n$ and the output $\mathbb{R}^m$ . Each of these spaces splits into two perpendicular pieces, giving four subspaces in all [6].

On the input side: the row space (the span of the rows of $A$ ) and the null space. These are perpendicular to each other and together they fill the entire input space. Every input vector splits uniquely into a row-space part and a null-space part.

On the output side: the column space and the left null space (vectors $y$ satisfying $A^\top y = 0$ ). These are also perpendicular, and together they fill the output space.

\begin{aligned} \mathbb{R}^n &= \text{row space} \;\oplus\; \text{null space} \\[4pt] \mathbb{R}^m &= \text{column space} \;\oplus\; \text{left null space} \end{aligned}

(3)

The $\oplus$ means "direct sum": every vector in the space can be written as one piece from each subspace, the decomposition is unique, and the two pieces are perpendicular.

Now watch what the matrix does. Take an input vector and split it into its row-space part and its null-space part. The matrix maps the row-space part to the column space and sends the null-space part to zero. That is the entire story. The row space is what the matrix reads; the null space is what it ignores; the column space is what it can produce; the left null space is what it cannot.

The four fundamental subspaces. Left: the input space, split into row space and null space. Right: the output space, split into column space and left null space. The matrix maps the row space to the column space and crushes the null space to zero.

The rank appears everywhere in this picture. The row space and column space always have the same dimension: both equal the rank. This must be true because the matrix maps the row space onto the column space without crushing anything (the crushing happens only in the null space).*The singular value decomposition, which we develop in a later post, finds orthonormal bases for the row space and column space such that the matrix maps one to the other by pure scaling. Each scale factor is a singular value. The number of nonzero singular values equals the rank.

For PSID, the four-subspace picture shows up directly. The cross-covariance between neural activity and behavior has a row space that identifies the behaviorally relevant directions in neural space. The null space of the same matrix captures everything the brain does that behavior does not reflect. PSID's two-stage algorithm is, at heart, a procedure for finding these two perpendicular pieces.

Projection and least squares

You record from 100 neurons over 500 time points and want to predict hand velocity from the neural state. You have a system $Ax = b$ , where $A$ is the 500-by-100 data matrix (each row is a time point, each column is a neuron), $x$ is the 100-dimensional weight vector you want, and $b$ is the 500-dimensional vector of velocity measurements.

You have 500 equations and 100 unknowns. In general, no exact solution exists: there is no $x$ that satisfies all 500 equations. The target $b$ does not lie in the column space of $A$ . You are asking the matrix to produce something outside its repertoire.

What do you do? You find the closest point in the column space. You project $b$ onto the column space of $A$ , and solve for the $x$ that produces that projection.

Think about what projection means geometrically. You have a target point $b$ floating somewhere in the 500-dimensional output space. The column space of $A$ is a 100-dimensional subspace within that space (assuming the columns are independent). The projection is the point in the column space that is closest to $b$ . The residual, the vector from the projection to $b$ , is perpendicular to the column space.

Drag the green line to rotate the subspace. The projection of each data point is shown, along with the residual. The dashed line shows the optimal direction (PC₁).

If the subspace is spanned by orthonormal columns $U = [u_1 \mid \cdots \mid u_k]$ , the projection formula is one we have seen before, from the basis post:

\hat{b} = UU^\top b

(4)

Two steps. $U^\top b$ computes $k$ dot products, giving the coordinates of the projection in the subspace. Then $U$ reconstructs the projected vector in the full space. The matrix $P = UU^\top$ is a projection matrix. Apply it twice and nothing changes: $P^2 = P$ . Once you are in the subspace, projecting again leaves you there.

When the columns of $A$ are not orthonormal (they almost never are), the formula adjusts to account for the internal geometry:

\hat{x} = (A^\top A)^{-1} A^\top\, b

(5)

This is the least-squares solution. Where does the formula come from? One geometric fact: the residual of the best approximation is perpendicular to the column space. Perpendicular to every column means $A^\top(b - A\hat{x}) = 0$ . Expand: $A^\top b = A^\top A\, \hat{x}$ . These are the normal equations. If $A^\top A$ is invertible, solve for $\hat{x}$ and the formula drops out. The entire derivation was perpendicularity.

And now something connects. Look at the structure: $(A^\top A)^{-1} A^\top$ is the pseudoinverse of $A$ . The projection of $b$ onto the column space is $A\hat{x} = A(A^\top A)^{-1}A^\top b$ .*Compare with the orthonormal case: $UU^\top b$ . When $A$ has orthonormal columns, $A^\top A = I$ , so the formula collapses to $AA^\top b$ . Same formula, no correction needed. This is the payoff of orthonormality, again.

Two ideas from different posts turned out to be the same operation. Least-squares fitting, which seemed like an optimization problem (minimize squared error), is also projection onto the column space, which is a geometric operation (find the nearest point). The formula is the same. This is not a coincidence. The residual of the least- squares fit, $b - A\hat{x}$ , is perpendicular to the column space. Perpendicularity is what makes the approximation closest. Geometry and optimization are saying the same thing in different languages.*The distinction between orthogonal and oblique projection matters for some neural data methods. PSID uses an oblique projection to separate behaviorally relevant and irrelevant subspaces, which need not be perpendicular. Van Overschee and De Moor [9] develop the theory of oblique projections for subspace identification, where they are essential for separating past and future information in time-series data.

What comes next

We now have a precise language for what a matrix does to a vector space. It reads the row-space component of the input, maps it to the column space, and ignores the rest. The rank tells you how many dimensions survive. The null space tells you what gets destroyed. When the target lies outside the column space, projection gives you the best approximation.

Projection chooses the nearest point in a fixed subspace. But we have not said how to choose the subspace itself. Which 10-dimensional subspace of a 100-dimensional neuron space should you project onto? The answer depends on the question. If you want to preserve variance, you want the subspace where the data spread out the most. If you want behavioral relevance, you want the subspace most predictive of a target variable.

Both questions lead to eigendecomposition. The directions of maximum variance turn out to be the eigenvectors of the covariance matrix. The directions of maximum behavioral relevance turn out to come from the eigenvectors of a cross-covariance matrix. In the next post, we will see what eigenvectors are, why they are the natural coordinates for symmetric matrices, and why every covariance matrix has a clean eigendecomposition that leads directly to PCA.

References

Strang, G. Introduction to Linear Algebra, 6th ed. Wellesley-Cambridge Press, 2023.
3Blue1Brown. "Essence of Linear Algebra" video series, 2016.
Churchland, M. M., Cunningham, J. P., Kaufman, M. T., et al. "Neural population dynamics during reaching," Nature, vol. 487, pp. 51-56, 2012.
Cunningham, J. P. and Yu, B. M. "Dimensionality reduction for large-scale neural recordings," Nature Neuroscience, vol. 17, pp. 1500-1509, 2014.
Axler, S. Linear Algebra Done Right, 4th ed. Springer, 2024.
Strang, G. "The fundamental theorem of linear algebra," The American Mathematical Monthly, vol. 100, no. 9, pp. 848-855, 1993.
Safaie, M., Chang, J. C., Park, J., et al. "Preserved neural dynamics across animals performing similar behaviour," Nature, vol. 623, pp. 765-771, 2023.
Kaufman, M. T., Churchland, M. M., Ryu, S. I., and Shenoy, K. V. "Cortical activity in the null space: permitting preparation without movement," Nature Neuroscience, vol. 17, pp. 440-448, 2014.
Van Overschee, P. and De Moor, B. Subspace Identification for Linear Systems. Kluwer Academic Publishers, 1996.

← Back to home