Matrix and Vector
Why matrices and vectors matter
Vectors give us a way to represent quantities with direction and magnitude. Matrices let us act on those vectors in controlled ways. Put the two together and you get a compact language for geometry, physics, graphics, optimization, and machine learning. A 2D point (x, y) becomes a column vector, and a matrix maps that point to a new one by a linear rule. In higher dimensions, nothing fundamental changes, only the bookkeeping.
Engineers often describe a matrix as distorting a coordinate system. That’s a helpful picture. A matrix can stretch space along some directions, rotate it, reflect it, shear it, or collapse it onto a lower-dimensional subspace. The same object that rotates a triangle on your screen also encodes the weighted sums inside a neural network layer.
This isn’t just philosophical. The properties that make matrices great for geometry also make them great for learning, because linearity gives us predictable composition and differentiation rules.
Matrices as linear operators
A matrix is a linear operator. That means it respects two rules. Add two vectors first or apply the matrix first and add later, you get the same result. Scale a vector first or apply the matrix first and scale later, you also get the same result. Formally, for any matrix A and vectors u, v, and scalar c, we have A(u + v) = Au + Av and A(cv) = c(Av).
Two immediate consequences follow.
- You can recover the action of A on any vector by knowing how A acts on a basis. For standard 2D coordinates, a basis is e₁ = (1, 0)ᵀ and e₂ = (0, 1)ᵀ. The first column of A is Ae₁, the second is Ae₂. That’s why the columns of a transformation matrix describe where the axes go.
- Composition is again a matrix. If you do B first, then A, the result is A ∘ B, represented by the product AB. The order matters. Rotate then shear is not the same as shear then rotate.
Some geometric effects fall out of one number. The determinant tells you how areas (or volumes) scale. In 2D, if = 2, any shape’s area doubles after the transform. If is negative, there’s a handedness flip, meaning a reflection is involved.
Classic 2D transformations
Most 2D graphics is a mix of a small set of building blocks.
- Scaling by factors sₓ and sᵧ uses the matrix . Circles become ellipses, squares become rectangles.
- Rotation by angle θ uses . Distances and angles are preserved. The determinant is 1.
- Shear along x with factor k uses . Rectangles become parallelograms, areas are preserved.
- Reflection across the x-axis uses . The determinant is −1.
Combining them is just multiplication. Want a rotation then a scaling? Multiply the scaling matrix by the rotation matrix in that order, assuming column-vector convention (vectors on the right). If you’re coming from some graphics APIs that use row vectors, the order flips. Always check the convention in the library you’re using.
Rotation matrices and their structure
The 2D rotation matrix deserves special attention. Call it R(θ).
R(θ) =
Two facts are gold for both theory and practice.
- R(θ) is orthogonal, meaning R(θ)ᵀ R(θ) = I. Multiplying by R(θ) doesn’t change lengths or angles. The inverse is simply the transpose.
- Angles add under composition. Rotating by β, then by α, is equivalent to rotating by α + β. So R(α) R(β) = R(α + β).
That second fact will give us the standard trigonometric addition formulas with almost no trigonometry, only matrix algebra.
Proving trig addition formulas with matrices
We’ll use the property that rotations compose by adding angles. Write the product R(α)R(β) explicitly and compare it to R(α + β).
R(α)R(β) equals
Multiply them. The top-left entry is cos α cos β − sin α sin β. The top-right entry is −(cos α sin β + sin α cos β). The bottom-left entry is sin α cos β + cos α sin β. The bottom-right entry is cos α cos β − sin α sin β.
But R(α + β) by definition is
.
Matching corresponding entries gives two identities at once.
- cos(α + β) = cos α cos β − sin α sin β
- sin(α + β) = sin α cos β + cos α sin β
Engineers often memorize these from trigonometry classes. The matrix view explains why they’re true geometrically. Rotations add. The algebra rides along.
If you’ve seen a minus sign on the second formula in some notes, that’s the subtraction version. For subtraction, R(α − β) = R(α)R(−β) and sin(α − β) = sin α cos β − cos α sin β. Positive for addition, negative for subtraction.
Vectors, coordinate frames, and change of basis
There’s a subtle, but useful distinction. A vector is an object independent of how we describe it. A coordinate vector is the list of numbers we get when we write that object in a particular basis. A change-of-basis matrix C converts coordinates from one basis to another. If x has coordinates in basis B and we want them in basis E, we apply C to x. Suddenly, some matrices that looked complicated become diagonal in the right basis. That’s the idea behind eigenvectors and diagonalization.
Why bring this up? Because in optimization and learning, picking a good representation of your data can turn a hard problem into an easy one. Whitening inputs with PCA is exactly a change of basis to decorrelate features. The same linear algebra governs both geometric transforms and statistical preprocessing.
Neural networks through the lens of matrices
At the core of a dense neural layer sits an affine map. Given an input vector x in Rⁿ, we compute y = Wx + b, where W is an weight matrix and b is a bias vector. That’s a linear operator followed by a translation. We then pass y through a nonlinearity such as ReLU, GELU, or sigmoid. Stack many of these, and you get a feedforward network.
What is W doing in plain terms? It is forming m learned linear combinations of the n input features. Large-magnitude entries in a row of W say that row cares about those input directions. Small entries downweight unhelpful directions. If two input features are strongly correlated, a trained W may compress them into a single direction with a scaled sum, much like a principal component would. The network alternates between reshaping the space with matrices and carving with nonlinearities.
Backpropagation depends on this structure. The gradient of a loss L with respect to W and x runs backward through the same matrix chain with transposes. If z = Wx, then ∂L/∂x = Wᵀ ∂L/∂z and ∂L/∂W = (∂L/∂z) xᵀ. This is one reason frameworks like PyTorch and JAX represent layers as explicit affine maps under the hood. Compilers can fuse and optimize these GEMM calls using cuBLAS or MKL.
There’s a filtering analogy people use. A weight matrix can filter out noise or irrelevant directions by sending them to small values. That intuition holds as long as you remember it’s not a fixed filter. It’s learned jointly with all other layers and only makes sense relative to the nonlinearity that follows and the distribution of inputs.
Practical matrices in graphics and robotics
The same algebra that powers backprop draws your UI buttons. 2D and 3D graphics use homogeneous coordinates to fold translation into matrix multiplication. A in 2D or a in 3D can represent rotation, scaling, shear, and translation together.
A 2D homogeneous transform looks like this.
Apply it to a 3-vector ᵀ to rotate, scale, and translate. In OpenGL and WebGL, you’ll push matrices for model, view, and projection stages. The GPU multiplies your vertex positions by these in a specific order. If your object is mirrored unexpectedly, check if your transform chain has a negative determinant.
Robotics uses the same concepts with SE(3) for rigid motions and SO(3) for pure rotations. Direction cosine matrices, quaternions, and axis-angle are all cousins. Orthogonality checks like Rᵀ and ≈ 1 are diagnostics you run on a real robot as often as you do in a simulation.
Worked example in Python
Here is a short NumPy snippet that composes a rotation and a scaling, applies it to a set of points, and checks properties we expect.
Python
1 import numpy as np 2 3 # Column-vector convention 4 5 def R(theta): 6 c, s = np.cos(theta), np.sin(theta) 7 return np.array([[c, -s], [s, c]], dtype=np.float64) 8 9 S = np.diag([2.0, 0.5]) # scale x by 2, y by 0.5 10 A = S @ R(np.deg2rad(30.0)) # rotate 30 degrees, then scale 11 12 # A should have det = det(S) * det(R) = 2 * 0.5 * 1 = 1.0 13 print('det(A) ≈', np.linalg.det(A)) 14 15 # Apply to a square 16 square = np.array([[0,0],[1,0],[1,1],[0,1]]).T # shape (2,4) 17 transformed = A @ square 18 19 # Check area scaling using the parallelogram formed by two edges 20 area_original = 1.0 21 area_transformed = abs(np.linalg.det(A)) * area_original 22 print('area scaling ≈', area_transformed) 23 24 # Compose two rotations and verify angle addition 25 A1 = R(np.deg2rad(20)) @ R(np.deg2rad(15)) 26 A2 = R(np.deg2rad(35)) 27 print('||A1 - A2||_F ≈', np.linalg.norm(A1 - A2))
Even without any plotting, this prints a determinant near 1, an area scaling near 1, and a tiny Frobenius norm between A1 and A2, reflecting R(20°)R(15°) = R(35°).
Common pitfalls and sanity checks
Linear algebra gives you sharp tools. A few checks save hours of debugging.
- Multiplication order is not commutative. If your visuals or model outputs look wrong, reverse the order of two transforms and see whether the result matches your intent.
- Conventions can differ. NumPy uses row-major storage but is agnostic to column vs row vector math. PyTorch tensors follow the shapes you give them. Be explicit about whether your data is shaped as (batch, features) or (features, batch) and which side you multiply on.
- Rotations must be orthonormal. For any candidate rotation R, verify Rᵀ and ≈ 1. Numerical drift can break these over time in simulations; re-orthonormalize with SVD if needed.
- Scaling near zero collapses dimensions. Inverse transforms become unstable. In optimization, similar effects happen when your inputs have wildly different scales. Normalize features to keep conditioning manageable.
- Backprop shapes can bite you. The gradient of a loss with respect to a weight matrix has the same shape as that weight matrix. If broadcasting “helped” your code run, double-check whether you silently dropped a dimension.
From trig to tensors
There’s a nice arc running through everything here. Rotation matrices explain trigonometric identities because adding angles means multiplying the corresponding matrices. The same notion of a matrix as an operator that remaps axes underpins graphics transforms. Neural networks use matrices to remap feature axes and combine them, then inject nonlinearity to move beyond pure geometry.
When a subject shows up naturally in so many places, it’s a sign you’ve got the right abstraction. If you can see a point or a data vector as a list of coordinates and a square array as a machine that redraws the axes, you’re set. The rest is just learning which machines to chain together, and in what order.
Extra: deriving sine and cosine subtraction
Since the composition property holds for any angles, set β to −β and repeat the same match of entries.
- cos(α − β) equals cos α cos β + sin α sin β
- sin(α − β) equals sin α cos β − cos α sin β
This comes straight from R(α − β) = R(α)R(−β), with sin(−β) = −sin β and cos(−β) = cos β. No geometric pictures needed, though they help build intuition.
Where to go next
If you want to push further, a few directions pay off quickly.
- Study eigenvectors, eigenvalues, and the spectral theorem. These tell you how a symmetric matrix acts by aligning with principal directions, which connects to PCA in machine learning.
- Learn homogeneous coordinates and projective transforms for 3D graphics. You’ll understand how cameras, perspective, and clipping are implemented in practice.
- Look at numerical conditioning and the role of SVD. This helps explain why some linear systems are hard to solve and why regularization stabilizes learning.
The tools are straightforward. NumPy and SciPy cover the basics in Python. PyTorch and JAX extend them with autodiff and GPU acceleration. On the graphics side, glm for C++ or gl-matrix for JavaScript give you well-tested routines for building and combining transform matrices.