Kevin Ray Scott | TensorCat

TLDR. A matrix can look full rank in ambient coordinates while the useful training motion lies near a smaller curved manifold. The important object is not only rank(W), but the effective dimension of the reachable weight trajectory.

The phrase "weight space is curved" is a useful way to think about modern large models. A neural network may have an enormous parameter vector W in an ambient space R^D, but the meaningful region explored by training can be much smaller than D. This is not the same as saying every matrix is low rank. It is saying that useful updates may lie near a structured manifold embedded inside the full parameter space.

Recent architecture discussions often use language such as curved weight spaces, intrinsic dimension, low-rank structure, and manifold-constrained update paths. In that spirit, a DeepSeek-style question is useful: what if a model appears full-rank in raw coordinates, but its useful training and adaptation trajectory occupies a much smaller geometric surface? The key idea is not that a model magically becomes simple, but that constraints on residual mixing, routing, low-rank projections, or update directions can reduce the effective degrees of freedom used by the system.

One way to express the distinction is:

W in R^D, but training path W(t) approximately lies on M, with dim(M) = d << D

If we only inspect a matrix locally, it may appear full rank. But full rank is an ambient-coordinate statement. It does not necessarily tell us how many directions the optimization process actually uses. The tangent space matters:

dW/dt in T_W M

where T_W M is the tangent space of the manifold at the current parameter point. If most useful updates live in that tangent space, then the effective optimization geometry can be far lower-dimensional than the raw parameter count suggests.

This explains why low-rank adaptation, mode connectivity, sparse subnetworks, and manifold-constrained training can all feel related. They are different tools for expressing a similar intuition: overparameterized networks may contain many redundant coordinates, while the task-relevant motion can be concentrated in a smaller set of directions.

For research, the practical question is not "is the matrix full rank?" The more interesting questions are:

Which directions change the model's behavior?
Which directions are redundant under the data distribution?
Which constraints preserve expressivity while improving stability?
Can adaptation be done in a small coordinate system without losing task performance?

This is why the statement "full rank is an illusion" is attractive, but it needs precision. Full rank may be real in linear algebra. The illusion is believing that full rank alone describes the geometry of learning.

DeepSeek V4 and Curved Weight Space: Full Rank Is Not the Whole Geometry