Tensor Operations
Key Points
- •A tensor is a multi-dimensional array that generalizes scalars (0-D), vectors (1-D), and matrices (2-D) to higher dimensions.
- •Core tensor operations include indexing, elementwise operations, outer products, axis permutations (transposes), reshapes, and contractions (generalized dot products).
- •Efficient tensor code relies on memory layout (row-major vs column-major) and strides, which map multi-indices to linear offsets.
- •Broadcasting lets you perform elementwise operations on different-shaped tensors by virtually expanding size-1 dimensions without copying data.
- •Contraction sums over matching axes and powers many algorithms, including matrix multiplication and neural network layers.
- •Time complexity usually scales with the total number of elements (the product of dimensions), so shape choices and axis orders matter for performance.
- •C++ implementations benefit from contiguous storage (std::vector), precomputed strides, and cache-friendly iteration orders.
- •Understanding Einstein summation and stride math helps you write correct, fast tensor code without high-level libraries.
Prerequisites
- →Arrays and pointers in C/C++ — Tensors are stored contiguously and indexed via pointer arithmetic; understanding arrays and memory is essential.
- →Big-O notation — Analyzing tensor operation costs requires reasoning about products of dimensions and asymptotics.
- →Matrix operations — Many tensor operations generalize matrix concepts like transpose and multiplication.
- →Lambda functions and templates in C++ — Generic elementwise and contraction code uses templates and higher-order functions.
- →Cache and memory hierarchy — Performance depends heavily on access order and locality in multi-dimensional traversals.
Detailed Explanation
Tap terms for definitions01Overview
Tensors are multi-dimensional arrays that extend familiar ideas of scalars (single numbers), vectors (1D lists), and matrices (2D grids) to more axes. A tensor’s order (also called rank) is the number of axes it has, and its shape lists the size along each axis. For example, a color image with height H, width W, and channels C is naturally a 3rd-order tensor of shape (H, W, C). In computing, we store tensors in linear memory and use strides to convert multi-indices into a single offset. This allows us to implement fast indexing, slicing, and bulk operations. Common tensor operations fall into two categories. Elementwise operations (like add, multiply, ReLU) operate independently at each index and thus scale with the number of elements. Structural operations (like reshape and permute/transpose) reorder or reinterpret indices without changing values; when implemented as views they can be O(1). Algebraic operations combine or reduce values across axes: outer products expand dimensionality, while contractions (generalized dot products) reduce it by summing over matched axes. Broadcasting rules enable elementwise operations between different-shaped tensors by treating size-1 dimensions as repeatable without materializing copies. Tensors show up across fields: in machine learning (mini-batches of data, parameters, activations), physics (coordinate transformations), graphics (volumes and animations), and scientific computing. Mastering shapes, strides, and algebraic rules is crucial for writing correct and efficient numeric code in C++.
02Intuition & Analogies
Imagine a spreadsheet (a 2D matrix). Each cell has a row and column. Now stack many such spreadsheets—say, one for every day of the year—forming a 3D block. Each entry now needs three coordinates: day, row, and column. If you add more context—like multiple sensors—you might get a 4D block. That block is a tensor: a neatly organized collection of numbers addressable by multiple indices. Elementwise operations are like applying a simple rule to every cell independently—color-correcting every pixel or adding a constant bias to each measurement. Broadcasting is the lazy assistant who refuses to copy the same bias for every day; instead, they remember one small row or column and pretend it repeats whenever needed. This keeps memory small and code fast. Reshape is a new way of counting without moving boxes: if you have 12 chocolates in a 3×4 tray, you can repack them as a 2×2×3 box—same chocolates, new layout. Permuting axes is like rotating or reordering how you stack your trays—rows become columns, days become channels, etc. Outer products are like turning a shopping list (vector) and a price list (vector) into a full price table (matrix): every item with every price. Contractions are the opposite: they combine along matching categories, summing over one axis—like computing total cost by summing quantity × price over items. Under the hood, strides are the step sizes you use to find the right chocolate in the box when counting along each axis in linear memory. With this mental model—boxes of boxes, repeated patterns (broadcasting), and careful counting (strides)—the formal math becomes much easier.
03Formal Definition
04When to Use
Use tensors whenever your data naturally has multiple axes and you want to preserve that structure for clarity and performance. Examples include:
- Machine learning: mini-batches of images (N, H, W, C), weight tensors for convolutions (K, R, S, C), and activations; broadcasting is used for adding biases or scaling channels.
- Scientific computing: discretized fields over space and time (T, X, Y, Z), or higher-order stiffness/compliance tensors in mechanics; contractions implement coordinate changes and physical laws.
- Computer graphics and imaging: color volumes, video (T, H, W, C), and filters; transposes and reshapes help adapt to algorithm-specific memory layouts.
- Data analytics: user × item × time tensors for recommendation systems; tensor factorizations (CP/Tucker) reveal latent structure. Reach for elementwise operations when each entry can be processed independently, outer products to pair every element of one tensor with every element of another, and contractions when you need to reduce along matching categories (generalized dot products). Choose permutes and reshapes to adapt layout for libraries or to improve cache locality without changing values. Broadcasting is ideal to apply per-axis parameters (e.g., per-channel gain) without explicit tiling.
⚠️Common Mistakes
- Ignoring memory layout: Mixing row-major (C/C++) and column-major (Fortran/Matlab) assumptions leads to wrong offsets and poor cache behavior. Always compute and test strides explicitly.
- Misusing broadcasting: Assuming two shapes are broadcastable when they are not, or expecting size-1 axes to duplicate values physically. Remember that broadcasting is a rule for indexing, not an automatic copy.
- Axis mismatch in contractions: Contracting axes with unequal lengths or inserting the summed index in the wrong position in the output shape. Validate shapes and be explicit about which axes are reduced.
- Confusing reshape with reorder: Reshape changes how indices map to the same linear memory; it does not move data. To reorder data values, you need a permutation (transpose) or explicit copy.
- Off-by-one and index order bugs: Building linear offsets with incorrect stride orders or 1-based indices. Adopt a consistent zero-based convention and test on small shapes.
- Hidden copies: Creating temporaries during views, slices, or broadcasting that inflate memory. Prefer views (O(1)) and in-place operations when safe.
- Numerical issues: Large outer products or deep contractions can overflow or accumulate floating-point error. Consider scaling, mixed precision, and stable reduction orders.
Key Formulas
Tensor size
Explanation: The total number of elements in an n-dimensional tensor equals the product of its axis lengths. This helps estimate memory and running time for elementwise operations.
Row-major strides and offset
Explanation: In row-major order, the stride of axis k is the product of sizes of all later axes. The linear offset for index (i1,…,in) is the dot product of indices with strides.
Elementwise operation
Explanation: An elementwise operation applies a scalar function f to corresponding entries. It generalizes addition, subtraction, multiplication, and activation functions.
Outer (tensor) product
Explanation: The outer product multiplies every element of A with every element of B and concatenates indices, increasing order by p.
Single-axis contraction
Explanation: Contracting A and B along one pair of axes of length r sums over that index. Matrix multiplication is a special case with r equal to the shared inner dimension.
Einstein notation for matmul
Explanation: Repeated index i is implicitly summed. This compactly expresses contractions without writing the summation symbol explicitly.
Axis permutation (transpose)
Explanation: A permutation of axes reorders how indices map to entries. It changes shape but preserves the multiset of values.
Memory usage
Explanation: The memory footprint of a tensor equals its number of elements times the size of its scalar type T. This is key to capacity planning.
Broadcasted offset
Explanation: For broadcasting, if B’s dimension is 1 along axis k, its index there is fixed at 0; otherwise it matches the output index. This defines how to read from broadcasted operands.
Frobenius norm
Explanation: The Frobenius norm is the Euclidean norm of all elements of A. It is used to measure tensor magnitude and appears in optimization objectives.
Hadamard product
Explanation: Elementwise multiplication requires broadcast-compatible shapes. It is common in gating and feature-wise scaling.
Kronecker product (matrix form)
Explanation: This formula shows how the Kronecker product expands a matrix A into blocks scaled by entries of B, producing a larger matrix.
Elementwise time complexity
Explanation: Any operation that touches each element once is linear in the number of elements. This bounds cost for simple passes like add or ReLU.
Contraction time complexity
Explanation: Contracting A along axis p with B along axis q (length r) costs proportional to r times the sizes of the remaining axes. Matrix multiply is a specific instance.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // A simple row-major Tensor with dynamic rank. 5 // Focus: shape/strides, linear offset, broadcasting-aware elementwise ops. 6 template <typename T> 7 class Tensor { 8 public: 9 vector<size_t> shape; // sizes per axis 10 vector<size_t> strides; // row-major strides 11 vector<T> data; // contiguous storage 12 13 Tensor() = default; 14 explicit Tensor(const vector<size_t>& shp) : shape(shp) { 15 compute_strides(); 16 data.assign(numel(), T{}); 17 } 18 19 size_t rank() const { return shape.size(); } 20 size_t numel() const { 21 return accumulate(shape.begin(), shape.end(), (size_t)1, multiplies<size_t>()); 22 } 23 24 // Compute row-major strides: s_k = product of later dimensions. 25 void compute_strides() { 26 strides.assign(rank(), 1); 27 for (int k = (int)rank() - 2; k >= 0; --k) { 28 strides[k] = strides[k+1] * shape[k+1]; 29 } 30 } 31 32 // Convert a multi-index to linear offset. 33 size_t offset(const vector<size_t>& idx) const { 34 size_t off = 0; 35 for (size_t k = 0; k < idx.size(); ++k) { 36 off += idx[k] * strides[k]; 37 } 38 return off; 39 } 40 41 // Safe element accessors using a multi-index. 42 T get(const vector<size_t>& idx) const { return data[offset(idx)]; } 43 void set(const vector<size_t>& idx, const T& value) { data[offset(idx)] = value; } 44 45 // Fill with a lambda f(idx_flat) or f(multi-index) 46 template <class F> 47 void fill_flat(F f) { 48 for (size_t i = 0; i < numel(); ++i) data[i] = f(i); 49 } 50 51 // Broadcasting utilities 52 static vector<size_t> broadcast_shape(vector<size_t> a, vector<size_t> b) { 53 // Right-align shapes 54 size_t na = a.size(), nb = b.size(); 55 size_t n = max(na, nb); 56 a.insert(a.begin(), n - na, 1); 57 b.insert(b.begin(), n - nb, 1); 58 vector<size_t> out(n); 59 for (size_t k = 0; k < n; ++k) { 60 if (a[k] == b[k] || a[k] == 1) out[k] = b[k]; 61 else if (b[k] == 1) out[k] = a[k]; 62 else throw runtime_error("Shapes not broadcastable"); 63 } 64 return out; 65 } 66 67 // Compute effective strides for broadcasting: stride is 0 if dim==1. 68 static vector<size_t> broadcast_strides(const vector<size_t>& in_shape, 69 const vector<size_t>& in_strides, 70 const vector<size_t>& out_shape) { 71 size_t n = out_shape.size(); 72 vector<size_t> a = in_shape; 73 a.insert(a.begin(), n - a.size(), 1); 74 vector<size_t> s = in_strides; 75 s.insert(s.begin(), n - s.size(), 0); 76 vector<size_t> eff(n); 77 for (size_t k = 0; k < n; ++k) eff[k] = (a[k] == 1 ? 0 : s[k]); 78 return eff; 79 } 80 81 // Generic broadcasted binary elementwise op: out = f(A, B) 82 template <class F> 83 static Tensor apply_binary(const Tensor& A, const Tensor& B, F f) { 84 vector<size_t> out_shape = broadcast_shape(A.shape, B.shape); 85 Tensor out(out_shape); 86 // Precompute effective strides aligning to out_shape 87 vector<size_t> sA = broadcast_strides(A.shape, A.strides, out_shape); 88 vector<size_t> sB = broadcast_strides(B.shape, B.strides, out_shape); 89 // Iterate linearly over out tensor using mixed-radix counter 90 size_t N = out.numel(); 91 vector<size_t> idx(out_shape.size(), 0); 92 for (size_t flat = 0; flat < N; ++flat) { 93 // Compute linear offsets for A and B via effective strides 94 size_t offA = 0, offB = 0; 95 for (size_t k = 0; k < idx.size(); ++k) { 96 offA += idx[k] * sA[k]; 97 offB += idx[k] * sB[k]; 98 } 99 out.data[flat] = f(A.data[offA], B.data[offB]); 100 // Increment mixed-radix counter (row-major) 101 for (int k = (int)idx.size() - 1; k >= 0; --k) { 102 if (++idx[k] < out_shape[k]) break; 103 idx[k] = 0; 104 } 105 } 106 return out; 107 } 108 }; 109 110 int main() { 111 // Example: A (2,3) plus B (1,3) via broadcasting along axis 0. 112 Tensor<double> A({2,3}); 113 Tensor<double> B({1,3}); 114 115 // Fill A with row i times 10 plus column j, and B with a bias per column. 116 A.fill_flat([&](size_t t){ return (double)t; }); // simple increasing values 117 B.set({0,0}, 100.0); 118 B.set({0,1}, 200.0); 119 B.set({0,2}, 300.0); 120 121 auto C = Tensor<double>::apply_binary(A, B, [](double x, double y){ return x + y; }); 122 auto D = Tensor<double>::apply_binary(C, C, [](double x, double y){ return x * y; }); // Hadamard square 123 124 // Print results to verify broadcasting and elementwise ops 125 cout << "C shape: (2,3)\n"; 126 for (size_t i = 0; i < 2; ++i) { 127 for (size_t j = 0; j < 3; ++j) cout << C.get({i,j}) << ' '; 128 cout << '\n'; 129 } 130 cout << "\nD = C .* C\n"; 131 for (size_t i = 0; i < 2; ++i) { 132 for (size_t j = 0; j < 3; ++j) cout << D.get({i,j}) << ' '; 133 cout << '\n'; 134 } 135 } 136
We implement a dynamic-rank, row-major tensor using std::vector for data and store strides to compute linear offsets. The apply_binary function aligns shapes from the right, computes an output shape using broadcasting rules, and then iterates over the output using a mixed-radix counter. Effective strides become zero for broadcasted axes, so multiple output positions read the same input element without copying. This supports elementwise add, multiply, and arbitrary binary functions.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 template <typename T> 5 struct Tensor { 6 vector<size_t> shape, strides; vector<T> data; 7 Tensor() {} 8 explicit Tensor(const vector<size_t>& shp): shape(shp) { 9 strides.assign(shape.size(), 1); 10 for (int k = (int)shape.size()-2; k >= 0; --k) strides[k] = strides[k+1]*shape[k+1]; 11 data.assign(numel(), T{}); 12 } 13 size_t numel() const { return accumulate(shape.begin(), shape.end(), (size_t)1, multiplies<size_t>()); } 14 size_t offset(const vector<size_t>& idx) const { size_t off=0; for (size_t k=0;k<idx.size();++k) off += idx[k]*strides[k]; return off; } 15 T get(const vector<size_t>& idx) const { return data[offset(idx)]; } 16 void set(const vector<size_t>& idx, const T& v){ data[offset(idx)] = v; } 17 }; 18 19 // Iterate over all indices of a shape, calling fn(index). 20 void for_each_index(const vector<size_t>& shape, function<void(const vector<size_t>&)> fn){ 21 vector<size_t> idx(shape.size(),0); 22 size_t N = accumulate(shape.begin(), shape.end(), (size_t)1, multiplies<size_t>()); 23 for (size_t flat=0; flat<N; ++flat){ 24 fn(idx); 25 for (int k=(int)shape.size()-1; k>=0; --k){ if (++idx[k] < shape[k]) break; idx[k]=0; } 26 } 27 } 28 29 // Contract A along axisA with B along axisB. Returns a new tensor. 30 template <typename T> 31 Tensor<T> contract(const Tensor<T>& A, int axisA, const Tensor<T>& B, int axisB){ 32 if (A.shape[axisA] != B.shape[axisB]) throw runtime_error("Contracted axes must match"); 33 size_t r = A.shape[axisA]; 34 // Build output shape = A.shape without axisA concatenated with B.shape without axisB 35 vector<size_t> out_shape; out_shape.reserve(A.shape.size()+B.shape.size()-2); 36 for (size_t i=0;i<A.shape.size();++i) if ((int)i!=axisA) out_shape.push_back(A.shape[i]); 37 for (size_t j=0;j<B.shape.size();++j) if ((int)j!=axisB) out_shape.push_back(B.shape[j]); 38 Tensor<T> C(out_shape); 39 40 // Iterate over all output indices and accumulate over k in [0, r) 41 for_each_index(out_shape, [&](const vector<size_t>& out_idx){ 42 // Split out_idx into parts corresponding to A (without axisA) and B (without axisB) 43 vector<size_t> idxA_full(A.shape.size()); 44 vector<size_t> idxB_full(B.shape.size()); 45 // Fill idxA_full by inserting from out_idx 46 size_t p=0; 47 for (size_t i=0;i<A.shape.size();++i){ 48 if ((int)i==axisA) continue; 49 idxA_full[i] = out_idx[p++]; 50 } 51 // Fill idxB_full from the remaining suffix of out_idx 52 size_t q=0; 53 for (size_t j=0;j<B.shape.size();++j){ 54 if ((int)j==axisB) continue; 55 idxB_full[j] = out_idx[p + q++]; 56 } 57 // Accumulate over contracted index 58 T acc = T{}; 59 for (size_t k=0;k<r;++k){ 60 idxA_full[axisA] = k; 61 idxB_full[axisB] = k; 62 acc += A.get(idxA_full) * B.get(idxB_full); 63 } 64 C.data[C.offset(out_idx)] = acc; 65 }); 66 return C; 67 } 68 69 int main(){ 70 // Example: matrix multiplication via contraction 71 Tensor<double> A({2,3}); // (2x3) 72 Tensor<double> B({3,4}); // (3x4) 73 74 // Fill A and B with small values 75 for (size_t i=0;i<2;i++) for (size_t k=0;k<3;k++) A.set({i,k}, (i+1)*(k+1)); 76 for (size_t k=0;k<3;k++) for (size_t j=0;j<4;j++) B.set({k,j}, (k+1)+(j)); 77 78 // Contract A axis 1 with B axis 0 => output shape (2,4) 79 auto C = contract(A, 1, B, 0); 80 81 cout << "C (2x4) = A(2x3) * B(3x4)\n"; 82 for (size_t i=0;i<2;i++){ for (size_t j=0;j<4;j++) cout << C.get({i,j}) << ' '; cout << '\n'; } 83 } 84
The contract function reduces one axis from A and one from B by summing the product over that index. We build the output shape by removing contracted axes, iterate over all output indices, and map them back to full index tuples for A and B by inserting the summation index. Matrix multiplication is recovered when contracting A’s columns with B’s rows.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 template <typename T> 5 struct Tensor { 6 vector<size_t> shape, strides; vector<T> data; 7 Tensor(){} 8 explicit Tensor(const vector<size_t>& shp): shape(shp){ 9 strides.assign(shape.size(),1); 10 for (int k=(int)shape.size()-2;k>=0;--k) strides[k]=strides[k+1]*shape[k+1]; 11 data.assign(numel(), T{}); 12 } 13 size_t numel() const { return accumulate(shape.begin(), shape.end(), (size_t)1, multiplies<size_t>()); } 14 size_t offset(const vector<size_t>& idx) const { size_t off=0; for (size_t k=0;k<idx.size();++k) off += idx[k]*strides[k]; return off; } 15 }; 16 17 template <typename T> 18 Tensor<T> permute(const Tensor<T>& A, const vector<int>& axes){ 19 if (axes.size()!=A.shape.size()) throw runtime_error("axes length mismatch"); 20 // New shape order 21 vector<size_t> new_shape(axes.size()); 22 for (size_t i=0;i<axes.size();++i) new_shape[i]=A.shape[axes[i]]; 23 Tensor<T> B(new_shape); 24 // Iterate over all indices of B and map back to A 25 vector<size_t> idxB(new_shape.size(),0); 26 size_t N=B.numel(); 27 for (size_t flat=0; flat<N; ++flat){ 28 vector<size_t> idxA(A.shape.size()); 29 for (size_t i=0;i<axes.size();++i) idxA[axes[i]] = idxB[i]; 30 B.data[flat] = A.data[A.offset(idxA)]; 31 for (int k=(int)idxB.size()-1;k>=0;--k){ if (++idxB[k] < new_shape[k]) break; idxB[k]=0; } 32 } 33 return B; 34 } 35 36 // Reshape with copy (for simplicity). A view-based reshape would share data. 37 template <typename T> 38 Tensor<T> reshape_copy(const Tensor<T>& A, const vector<size_t>& new_shape){ 39 size_t newN = accumulate(new_shape.begin(), new_shape.end(), (size_t)1, multiplies<size_t>()); 40 if (newN != A.numel()) throw runtime_error("element count mismatch in reshape"); 41 Tensor<T> B(new_shape); 42 B.data = A.data; // copy assignment; same order preserved 43 return B; 44 } 45 46 int main(){ 47 Tensor<int> A({2,3,4}); 48 // Fill A with 0..23 49 for (int i=0;i<24;++i) A.data[i]=i; 50 51 // Permute axes from (0,1,2) to (2,0,1): shape (4,2,3) 52 auto P = permute(A, {2,0,1}); 53 54 // Reshape P to (4,6) 55 auto R = reshape_copy(P, {4,6}); 56 57 cout << "P shape (4,2,3) in row-major order first row of last 2D view by printing first 6 elements after reshape:\n"; 58 for (size_t j=0;j<6;++j) cout << R.data[j] << ' '; 59 cout << "\nTotal elements: " << R.numel() << "\n"; 60 } 61
Permutation reorders axes and thus changes strides. We create a new tensor with permuted shape and copy elements by mapping each output index back to its corresponding input index. The reshape_copy function changes the shape while preserving linear order; for simplicity it copies data, but a high-performance library would return a view that shares storage and updates strides in O(1).