Matrices as linear maps
A matrix is not a grid of numbers. It is a rule that turns one vector into another.
Decoding a reach
You record from three neurons in motor cortex while a monkey reaches toward a target. At one moment during the reach, the firing rates are spikes/s. You want to predict the hand's velocity at that moment: two numbers, horizontal and vertical.
A collaborator hands you a decoder. She says: take the firing rates, multiply them by these weights, and add up. Horizontal velocity is . Vertical velocity is . So the predicted velocity is cm/s.
You could write this more compactly as:
The grid of weights in the middle is a matrix. Call it . The computation you just did is matrix-vector multiplication: . Input: a 3-dimensional neural state. Output: a 2-dimensional velocity prediction. The matrix is the rule connecting them.
That is the entire idea of this post. A matrix is not a table of numbers you happen to store in a rectangle. It is a map from one vector space to another. It takes in a vector and gives back a vector, and the way it does so is completely determined by those numbers.
But this framing raises a question. The computation above was a specific recipe: multiply corresponding entries and add. Why that recipe? Why not something else? To understand what a matrix is really doing, we need to look at the product more carefully.
Two ways to read a matrix
There are two ways to understand what happens when you multiply a matrix by a vector. Both give the same answer. But they expose different structure, and for different problems, different pictures are more useful.
The row picture. Look at the computation we just did. Each output entry was a dot product: the horizontal velocity was a dot product of the first row of with the input, and the vertical velocity was a dot product of the second row with the input.
Think about what each row is doing. The first row, , is a pattern of weights across the three neurons. Dotting the firing rates with this pattern produces a single number: a similarity score. How much does the current population state look like this particular pattern? The second row is a different pattern, producing a different score. Each row is a detector, and the matrix computes all the detectors at once.This is exactly how a linear decoder works in a BCI [8]. Each row of the decoding matrix is a "readout direction" in neural space. The decoded output on each channel is the projection of the population state onto that direction. It is also how the first layer of a neural network works: each row of the weight matrix is a feature detector.
The column picture. Now rearrange the same calculation. Write out the columns of separately: , , . The product is:
The output is a linear combination of the columns of , with the entries of the input vector as the weights. Each column represents the contribution of one neuron to the output. The first column is what neuron 1 "pushes" when it fires; the input entry tells you how hard it pushes. The output is the combined effect of all neurons pushing at once.
Let's check that both pictures give the same answer. Column picture: . Same as before.
The column picture is the one to internalize. It says something geometric: the output of a matrix-vector product is always a linear combination of the columns. That means the output always lies in the span of the columns, no matter what input you feed in. The set of all possible outputs has a name, the column space, which we will study in a later post. For now, the point is that the columns of a matrix define the building blocks of the output, and the input vector is a recipe for combining them.
One piece of notation before we move on. The transpose of a matrix, written , flips rows and columns: entry of is entry of . For column vectors, transposing turns a column into a row. This gives a compact notation for the dot product: . You will see this everywhere.
What a matrix does to basis vectors
What does our decoder matrix do to the standard basis vectors?
Feed in , meaning only neuron 1 fires at 1 spike/s and the rest are silent. The column picture gives . The output is just the first column. Similarly, and .
So the columns of are literally where the matrix sends the standard basis vectors. The first column is where goes. The second is where goes. And so on.
Now think about what this means. Any input is a linear combination of basis vectors: . If you know what happens to each basis vector, you know what happens to every vector, because the map preserves linear combinations. That is: if doubling the input doubles the output, and adding inputs adds outputs, then knowing the output on the ingredients tells you the output on any recipe.
This connects directly to the previous post. A basis is a set of reference directions. A matrix is completely determined by where it sends those reference directions. Change the basis, and the same geometric map gets described by a different matrix. The map did not change. The description did. We will make this precise with change-of-basis matrices in a later post.
Linear transformations
Not every function from vectors to vectors can be written as a matrix. Only a special class: functions where doubling the input doubles the output, and where transforming a sum gives the sum of the transforms. These are linear transformations.
Rotations are linear. Reflections are linear. Stretching along one axis is linear. Shearing (tilting a grid) is linear. Projecting onto a subspace is linear. Translating (shifting everything by a fixed amount) is not.The test is simple: does the origin stay put? A linear transformation maps the zero vector to the zero vector. A translation moves it somewhere else.
A linear transformation [2]warps the coordinate grid, but in a very constrained way. Straight lines stay straight. The origin stays fixed. Parallel lines remain parallel. Grid lines stay evenly spaced. The grid can stretch, rotate, shear, or flip, but it cannot bend.
Play with the figure. Try a rotation: the grid rotates rigidly. Try a shear: the grid tilts, but lines stay straight and parallel. Try a projection: the grid collapses onto a line, and information is lost. Every one of these is a matrix, and every matrix does one of these things.
For neural data, think of it this way. A linear decoder maps population activity into a behavioral readout. A projection matrix collapses a high-dimensional population state onto a low-dimensional subspace. A rotation matrix switches coordinate systems without distorting distances. A dynamics matrix maps the population state at time to its state at time . All of these are linear transformations. All of them are matrices.
Composition
Suppose your analysis pipeline has two steps. First, you project the neural state into a 10-dimensional latent space: , where is a 10-by-100 matrix. Then, you decode hand velocity from the latent state: , where is a 2-by-10 matrix.
Combining both steps: . Is there a single matrix that goes straight from the 100-dimensional neural state to the 2-dimensional velocity, doing both steps at once?
Yes. The product is a 2-by-100 matrix, and . One matrix, one multiplication, same result. This is why matrix multiplication exists. The formula for computing it is not an arbitrary rule someone invented. It is the rule you are forced to use if you want composition of maps to work.This is worth verifying. If you define the product by requiring for all , and you expand both sides using the column picture, the entry-by-entry formula falls out. The formula is a consequence of the requirement, not an independent definition.
The entry in row , column of the product is:
A dot product between the -th row of and the -th column of . But the entry-by-entry formula is less useful than the column-level picture: each column of is what does to the corresponding column of .
This makes the composition tangible. The columns of are where sends the basis vectors. The columns of are where the composition -after- sends them: first moves each basis vector, then moves the result.
Try swapping the order in the figure. Rotate then shear is different from shear then rotate: in general. Composition is order-dependent, which is why matrix multiplication is not commutative. But it is associative: . You can regroup a chain of transformations without changing the result.Non-commutativity matters in neural data pipelines. "Project onto a latent space, then decode" is a different operation from "decode, then project." The order of matrix multiplications determines the analysis result, and swapping steps can give qualitatively different answers.
One algebraic fact is useful often enough to note now. The transpose of a product reverses the order: . Think of it as taking off two layers of clothing: if you put on a shirt then a jacket, you take off the jacket first, then the shirt. We will use this when computing covariance.You will see this identity constantly. The covariance matrix uses it. The normal equations use it. The SVD derivation uses it. Every time a product gets transposed in this series, the reversal rule is doing the work.
Inverses
You apply your decoder to a population state and get a velocity prediction. Can you recover the original population state from the prediction? Can you go backward?
Think about what the decoder did. It took a 3-dimensional input and produced a 2-dimensional output. Three numbers went in; two came out. Some information was lost. Different population states could produce the same velocity. There is no way to tell which one you started from.
What about a square matrix, where the input and output have the same dimension? It depends. If the columns of the matrix are linearly independent, the transformation maps distinct inputs to distinct outputs. Nothing collapses. In that case, there exists an inverse matrix that reverses the map: for every .
But if the columns are dependent, some direction gets crushed. The transformation flattens a plane to a line, or a volume to a plane. Distinct points merge into the same output. Once that happens, the information is gone. No inverse exists.
There is a single number that tells you whether a square matrix is invertible: the determinant. In two dimensions, it equals the signed area of the parallelogram spanned by the two columns. Determinant zero means the parallelogram has collapsed: the columns are dependent. Determinant nonzero means the transformation can be reversed.You rarely compute explicitly in practice. Gaussian elimination, LU decomposition, and iterative solvers are faster and more numerically stable. The conceptual importance of the inverse is knowing whether a map can be undone, not the mechanics of undoing it.
In neural data, you almost never have a square, invertible matrix. You record 100 neurons over 500 time points and want to predict a 2-dimensional behavioral variable. The decoder is 2-by-100. Not square, not invertible. But you can still ask: what is the best decoder? What set of weights minimizes the prediction error? The answer is least squares, and its formula involves a construction called the pseudoinverse. That is for the next post.
There is one type of square matrix that deserves special mention. When the columns are orthonormal, the matrix is called orthogonal, and its inverse is its transpose: . Free. No computation. Orthogonal matrices preserve lengths and angles: for any . They are pure rotations (or reflections). This is why PCA's change of basis preserves the distances between data points: the basis-change matrix is orthogonal.
What comes next
With this machinery you can describe, at least in outline, every linear method in computational neuroscience. A linear decoder is a matrix: it maps neural states to behavioral predictions. A projection onto a low-dimensional subspace is a matrix: it maps the full population state onto a few coordinates. A change of basis is a matrix: it redescribes the same state in a different coordinate system. A linear dynamics model says that : the state at the next time step is the current state, transformed by a matrix. Even PCA, which we have been building toward, is a matrix multiplication: you project your data onto the principal component directions, and those directions are the columns of a matrix.
What varies from method to method is what the matrix is optimized to do. PCA finds the matrix that captures the most variance. CCA finds the matrices that produce the most correlated projections across two datasets. PSID finds the matrix that separates behaviorally relevant dynamics from irrelevant ones. But the object, a linear map from one vector space to another, is the same.
We left one question open. When the decoder maps 100-dimensional neural activity to 2-dimensional velocity, what happens to the other 98 dimensions? Which directions survive the map, and which ones get crushed? To answer that, we need the concepts of column space, null space, and rank, which tell you exactly what a matrix preserves and what it destroys. That is the next post.
References
- Strang, G. Introduction to Linear Algebra, 6th ed. Wellesley-Cambridge Press, 2023.
- 3Blue1Brown. "Essence of Linear Algebra" video series, 2016.
- Churchland, M. M., Cunningham, J. P., Kaufman, M. T., et al. "Neural population dynamics during reaching," Nature, vol. 487, pp. 51-56, 2012.
- Cunningham, J. P. and Yu, B. M. "Dimensionality reduction for large-scale neural recordings," Nature Neuroscience, vol. 17, pp. 1500-1509, 2014.
- Axler, S. Linear Algebra Done Right, 4th ed. Springer, 2024.
- Strang, G. "The fundamental theorem of linear algebra," The American Mathematical Monthly, vol. 100, no. 9, pp. 848-855, 1993.
- Safaie, M., Chang, J. C., Park, J., et al. "Preserved neural dynamics across animals performing similar behaviour," Nature, vol. 623, pp. 765-771, 2023.
- Musallam, S., Corneil, B. D., Greger, B., Scherberger, H., and Andersen, R. A. "Cognitive control signals for neural prosthetics," Science, vol. 305, no. 5681, pp. 258-262, 2004.