Matrices as linear maps

A matrix is not a grid of numbers. It is a rule that turns one vector into another.

Decoding a reach

You record from three neurons in motor cortex while a monkey reaches toward a target. At one moment during the reach, the firing rates are x=(12,  7,  31)x = (12,\; 7,\; 31) spikes/s. You want to predict the hand's velocity at that moment: two numbers, horizontal and vertical.

A collaborator hands you a decoder. She says: take the firing rates, multiply them by these weights, and add up. Horizontal velocity is 0.3(12)+0.1(7)+(0.2)(31)=3.6+0.76.2=1.90.3(12) + 0.1(7) + (-0.2)(31) = 3.6 + 0.7 - 6.2 = -1.9. Vertical velocity is (0.1)(12)+0.4(7)+0.1(31)=1.2+2.8+3.1=4.7(-0.1)(12) + 0.4(7) + 0.1(31) = -1.2 + 2.8 + 3.1 = 4.7. So the predicted velocity is (1.9,  4.7)(-1.9,\; 4.7) cm/s.

You could write this more compactly as:

[1.94.7]=[0.30.10.20.10.40.1][12731]\begin{bmatrix} -1.9 \\ 4.7 \end{bmatrix} = \begin{bmatrix} 0.3 & 0.1 & -0.2 \\ -0.1 & 0.4 & 0.1 \end{bmatrix} \begin{bmatrix} 12 \\ 7 \\ 31 \end{bmatrix}
(1)

The grid of weights in the middle is a matrix. Call it AA. The computation you just did is matrix-vector multiplication: y=Axy = Ax. Input: a 3-dimensional neural state. Output: a 2-dimensional velocity prediction. The matrix is the rule connecting them.

That is the entire idea of this post. A matrix is not a table of numbers you happen to store in a rectangle. It is a map from one vector space to another. It takes in a vector and gives back a vector, and the way it does so is completely determined by those numbers.

But this framing raises a question. The computation above was a specific recipe: multiply corresponding entries and add. Why that recipe? Why not something else? To understand what a matrix is really doing, we need to look at the product AxAx more carefully.

Two ways to read a matrix

There are two ways to understand what happens when you multiply a matrix by a vector. Both give the same answer. But they expose different structure, and for different problems, different pictures are more useful.

The row picture. Look at the computation we just did. Each output entry was a dot product: the horizontal velocity was a dot product of the first row of AA with the input, and the vertical velocity was a dot product of the second row with the input.

Think about what each row is doing. The first row, (0.3,  0.1,  0.2)(0.3,\; 0.1,\; -0.2), is a pattern of weights across the three neurons. Dotting the firing rates with this pattern produces a single number: a similarity score. How much does the current population state look like this particular pattern? The second row is a different pattern, producing a different score. Each row is a detector, and the matrix computes all the detectors at once.This is exactly how a linear decoder works in a BCI [8]. Each row of the decoding matrix is a "readout direction" in neural space. The decoded output on each channel is the projection of the population state onto that direction. It is also how the first layer of a neural network works: each row of the weight matrix is a feature detector.

The column picture. Now rearrange the same calculation. Write out the columns of AA separately: a1=(0.3,  0.1)a_1 = (0.3,\; -0.1), a2=(0.1,  0.4)a_2 = (0.1,\; 0.4), a3=(0.2,  0.1)a_3 = (-0.2,\; 0.1). The product is:

Ax=12[0.30.1]+7[0.10.4]+31[0.20.1]Ax = 12 \begin{bmatrix} 0.3 \\ -0.1 \end{bmatrix} + 7 \begin{bmatrix} 0.1 \\ 0.4 \end{bmatrix} + 31 \begin{bmatrix} -0.2 \\ 0.1 \end{bmatrix}
(2)

The output is a linear combination of the columns of AA, with the entries of the input vector as the weights. Each column represents the contribution of one neuron to the output. The first column is what neuron 1 "pushes" when it fires; the input entry x1=12x_1 = 12 tells you how hard it pushes. The output is the combined effect of all neurons pushing at once.

Let's check that both pictures give the same answer. Column picture: 12(0.3,0.1)+7(0.1,0.4)+31(0.2,0.1)=(3.6,1.2)+(0.7,2.8)+(6.2,3.1)=(1.9,4.7)12(0.3, -0.1) + 7(0.1, 0.4) + 31(-0.2, 0.1) = (3.6, -1.2) + (0.7, 2.8) + (-6.2, 3.1) = (-1.9, 4.7). Same as before.

a₁a₂a₃x₁a₁x₂a₂x₃a₃yy = (3.10, 0.20)
Adjust the entries of xx to see the linear combination of columns. The output vector is always in the span of the columns.

The column picture is the one to internalize. It says something geometric: the output of a matrix-vector product is always a linear combination of the columns. That means the output always lies in the span of the columns, no matter what input you feed in. The set of all possible outputs has a name, the column space, which we will study in a later post. For now, the point is that the columns of a matrix define the building blocks of the output, and the input vector is a recipe for combining them.

One piece of notation before we move on. The transpose of a matrix, written AA^\top, flips rows and columns: entry (i,j)(i, j) of AA^\top is entry (j,i)(j, i) of AA. For column vectors, transposing turns a column into a row. This gives a compact notation for the dot product: uv=uvu \cdot v = u^\top v. You will see this everywhere.

What a matrix does to basis vectors

What does our decoder matrix do to the standard basis vectors?

Feed in e1=(1,0,0)e_1 = (1, 0, 0), meaning only neuron 1 fires at 1 spike/s and the rest are silent. The column picture gives Ae1=1a1+0a2+0a3=a1=(0.3,0.1)A e_1 = 1 \cdot a_1 + 0 \cdot a_2 + 0 \cdot a_3 = a_1 = (0.3, -0.1). The output is just the first column. Similarly, Ae2=a2Ae_2 = a_2 and Ae3=a3Ae_3 = a_3.

So the columns of AA are literally where the matrix sends the standard basis vectors. The first column is where e1e_1 goes. The second is where e2e_2 goes. And so on.

Now think about what this means. Any input is a linear combination of basis vectors: x=x1e1+x2e2+x3e3x = x_1 e_1 + x_2 e_2 + x_3 e_3. If you know what happens to each basis vector, you know what happens to every vector, because the map preserves linear combinations. That is: if doubling the input doubles the output, and adding inputs adds outputs, then knowing the output on the ingredients tells you the output on any recipe.

This connects directly to the previous post. A basis is a set of reference directions. A matrix is completely determined by where it sends those reference directions. Change the basis, and the same geometric map gets described by a different matrix. The map did not change. The description did. We will make this precise with change-of-basis matrices in a later post.

Linear transformations

Not every function from vectors to vectors can be written as a matrix. Only a special class: functions where doubling the input doubles the output, and where transforming a sum gives the sum of the transforms. These are linear transformations.

Rotations are linear. Reflections are linear. Stretching along one axis is linear. Shearing (tilting a grid) is linear. Projecting onto a subspace is linear. Translating (shifting everything by a fixed amount) is not.The test is simple: does the origin stay put? A linear transformation maps the zero vector to the zero vector. A translation moves it somewhere else.

A linear transformation [2]warps the coordinate grid, but in a very constrained way. Straight lines stay straight. The origin stays fixed. Parallel lines remain parallel. Grid lines stay evenly spaced. The grid can stretch, rotate, shear, or flip, but it cannot bend.

AeAeA =1.00 0.000.00 1.00det = 1.00drag column vector tips to build a custom transformation
Pick a transformation (rotation, shear, reflection, projection) and watch the grid deform. The matrix updates to reflect where the basis vectors land.

Play with the figure. Try a rotation: the grid rotates rigidly. Try a shear: the grid tilts, but lines stay straight and parallel. Try a projection: the grid collapses onto a line, and information is lost. Every one of these is a matrix, and every matrix does one of these things.

For neural data, think of it this way. A linear decoder maps population activity into a behavioral readout. A projection matrix collapses a high-dimensional population state onto a low-dimensional subspace. A rotation matrix switches coordinate systems without distorting distances. A dynamics matrix maps the population state at time tt to its state at time t+1t+1. All of these are linear transformations. All of them are matrices.

Composition

Suppose your analysis pipeline has two steps. First, you project the neural state into a 10-dimensional latent space: z=Bxz = Bx, where BB is a 10-by-100 matrix. Then, you decode hand velocity from the latent state: y=Azy = Az, where AA is a 2-by-10 matrix.

Combining both steps: y=A(Bx)y = A(Bx). Is there a single matrix that goes straight from the 100-dimensional neural state to the 2-dimensional velocity, doing both steps at once?

Yes. The product ABAB is a 2-by-100 matrix, and (AB)x=A(Bx)(AB)x = A(Bx). One matrix, one multiplication, same result. This is why matrix multiplication exists. The formula for computing it is not an arbitrary rule someone invented. It is the rule you are forced to use if you want composition of maps to work.This is worth verifying. If you define the product ABAB by requiring (AB)x=A(Bx)(AB)x = A(Bx) for all xx, and you expand both sides using the column picture, the entry-by-entry formula falls out. The formula is a consequence of the requirement, not an independent definition.

The entry in row ii, column jj of the product is:

(AB)ij=kAikBkj(AB)_{ij} = \sum_{k} A_{ik}\, B_{kj}
(3)

A dot product between the ii-th row of AA and the jj-th column of BB. But the entry-by-entry formula is less useful than the column-level picture: each column of ABAB is what AA does to the corresponding column of BB.

AB=A[b1b2bp]=[Ab1Ab2Abp]AB = A \begin{bmatrix} \mid & \mid & & \mid \\[-4pt] b_1 & b_2 & \cdots & b_p \\[-4pt] \mid & \mid & & \mid \end{bmatrix} = \begin{bmatrix} \mid & \mid & & \mid \\[-4pt] Ab_1 & Ab_2 & \cdots & Ab_p \\[-4pt] \mid & \mid & & \mid \end{bmatrix}
(4)

This makes the composition tangible. The columns of BB are where BB sends the basis vectors. The columns of ABAB are where the composition AA-after-BB sends them: first BB moves each basis vector, then AA moves the result.

AB e₁AB e₂A[0.87 -0.50][0.50 0.87]B[1.00 0.50][0.00 1.00]=AB[0.87 -0.07][0.50 1.12]first B, then AAB ≠ BA
A:
B:
showing AB = first B, then A
Two transformations in sequence. Drag the basis vectors of AA and BB to see how ABAB updates. Swap the order to confirm ABBAAB \neq BA.

Try swapping the order in the figure. Rotate then shear is different from shear then rotate: ABBAAB \neq BA in general. Composition is order-dependent, which is why matrix multiplication is not commutative. But it is associative: (AB)C=A(BC)(AB)C = A(BC). You can regroup a chain of transformations without changing the result.Non-commutativity matters in neural data pipelines. "Project onto a latent space, then decode" is a different operation from "decode, then project." The order of matrix multiplications determines the analysis result, and swapping steps can give qualitatively different answers.

One algebraic fact is useful often enough to note now. The transpose of a product reverses the order: (AB)=BA(AB)^\top = B^\top A^\top. Think of it as taking off two layers of clothing: if you put on a shirt then a jacket, you take off the jacket first, then the shirt. We will use this when computing covariance.You will see this identity constantly. The covariance matrix 1TXX\tfrac{1}{T}X^\top X uses it. The normal equations AAx^=AbA^\top A \hat{x} = A^\top b use it. The SVD derivation uses it. Every time a product gets transposed in this series, the reversal rule is doing the work.

Inverses

You apply your decoder to a population state and get a velocity prediction. Can you recover the original population state from the prediction? Can you go backward?

Think about what the decoder did. It took a 3-dimensional input and produced a 2-dimensional output. Three numbers went in; two came out. Some information was lost. Different population states could produce the same velocity. There is no way to tell which one you started from.

What about a square matrix, where the input and output have the same dimension? It depends. If the columns of the matrix are linearly independent, the transformation maps distinct inputs to distinct outputs. Nothing collapses. In that case, there exists an inverse matrix A1A^{-1} that reverses the map: A1(Ax)=xA^{-1}(Ax) = x for every xx.

But if the columns are dependent, some direction gets crushed. The transformation flattens a plane to a line, or a volume to a plane. Distinct points merge into the same output. Once that happens, the information is gone. No inverse exists.

c₁c₂det = 2.85invertibledrag arrow tips to move column vectors
Drag the two column vectors. The parallelogram shows the area the transformation maps the unit square to. When the columns become collinear, the area collapses to zero and no inverse exists.

There is a single number that tells you whether a square matrix is invertible: the determinant. In two dimensions, it equals the signed area of the parallelogram spanned by the two columns. Determinant zero means the parallelogram has collapsed: the columns are dependent. Determinant nonzero means the transformation can be reversed.You rarely compute A1A^{-1} explicitly in practice. Gaussian elimination, LU decomposition, and iterative solvers are faster and more numerically stable. The conceptual importance of the inverse is knowing whether a map can be undone, not the mechanics of undoing it.

In neural data, you almost never have a square, invertible matrix. You record 100 neurons over 500 time points and want to predict a 2-dimensional behavioral variable. The decoder is 2-by-100. Not square, not invertible. But you can still ask: what is the best decoder? What set of weights minimizes the prediction error? The answer is least squares, and its formula involves a construction called the pseudoinverse. That is for the next post.

There is one type of square matrix that deserves special mention. When the columns are orthonormal, the matrix is called orthogonal, and its inverse is its transpose: Q1=QQ^{-1} = Q^\top. Free. No computation. Orthogonal matrices preserve lengths and angles: Qx=x\|Qx\| = \|x\| for any xx. They are pure rotations (or reflections). This is why PCA's change of basis preserves the distances between data points: the basis-change matrix is orthogonal.

What comes next

With this machinery you can describe, at least in outline, every linear method in computational neuroscience. A linear decoder is a matrix: it maps neural states to behavioral predictions. A projection onto a low-dimensional subspace is a matrix: it maps the full population state onto a few coordinates. A change of basis is a matrix: it redescribes the same state in a different coordinate system. A linear dynamics model says that xt+1=Axtx_{t+1} = Ax_t: the state at the next time step is the current state, transformed by a matrix. Even PCA, which we have been building toward, is a matrix multiplication: you project your data onto the principal component directions, and those directions are the columns of a matrix.

What varies from method to method is what the matrix is optimized to do. PCA finds the matrix that captures the most variance. CCA finds the matrices that produce the most correlated projections across two datasets. PSID finds the matrix that separates behaviorally relevant dynamics from irrelevant ones. But the object, a linear map from one vector space to another, is the same.

We left one question open. When the decoder maps 100-dimensional neural activity to 2-dimensional velocity, what happens to the other 98 dimensions? Which directions survive the map, and which ones get crushed? To answer that, we need the concepts of column space, null space, and rank, which tell you exactly what a matrix preserves and what it destroys. That is the next post.

References

  1. Strang, G. Introduction to Linear Algebra, 6th ed. Wellesley-Cambridge Press, 2023.
  2. 3Blue1Brown. "Essence of Linear Algebra" video series, 2016.
  3. Churchland, M. M., Cunningham, J. P., Kaufman, M. T., et al. "Neural population dynamics during reaching," Nature, vol. 487, pp. 51-56, 2012.
  4. Cunningham, J. P. and Yu, B. M. "Dimensionality reduction for large-scale neural recordings," Nature Neuroscience, vol. 17, pp. 1500-1509, 2014.
  5. Axler, S. Linear Algebra Done Right, 4th ed. Springer, 2024.
  6. Strang, G. "The fundamental theorem of linear algebra," The American Mathematical Monthly, vol. 100, no. 9, pp. 848-855, 1993.
  7. Safaie, M., Chang, J. C., Park, J., et al. "Preserved neural dynamics across animals performing similar behaviour," Nature, vol. 623, pp. 765-771, 2023.
  8. Musallam, S., Corneil, B. D., Greger, B., Scherberger, H., and Andersen, R. A. "Cognitive control signals for neural prosthetics," Science, vol. 305, no. 5681, pp. 258-262, 2004.
← Back to home