∑MathIntermediate

Tensor Operations

Key Points

•
A tensor is a multi-dimensional array that generalizes scalars (0-D), vectors (1-D), and matrices (2-D) to higher dimensions.
•
Core tensor operations include indexing, elementwise operations, outer products, axis permutations (transposes), reshapes, and contractions (generalized dot products).
•
Efficient tensor code relies on memory layout (row-major vs column-major) and strides, which map multi-indices to linear offsets.
•
Broadcasting lets you perform elementwise operations on different-shaped tensors by virtually expanding size-1 dimensions without copying data.
•
Contraction sums over matching axes and powers many algorithms, including matrix multiplication and neural network layers.
•
Time complexity usually scales with the total number of elements (the product of dimensions), so shape choices and axis orders matter for performance.
•
C++ implementations benefit from contiguous storage (std::vector), precomputed strides, and cache-friendly iteration orders.
•
Understanding Einstein summation and stride math helps you write correct, fast tensor code without high-level libraries.

Prerequisites

→Arrays and pointers in C/C++ — Tensors are stored contiguously and indexed via pointer arithmetic; understanding arrays and memory is essential.
→Big-O notation — Analyzing tensor operation costs requires reasoning about products of dimensions and asymptotics.
→Matrix operations — Many tensor operations generalize matrix concepts like transpose and multiplication.
→Lambda functions and templates in C++ — Generic elementwise and contraction code uses templates and higher-order functions.
→Cache and memory hierarchy — Performance depends heavily on access order and locality in multi-dimensional traversals.

Detailed Explanation

Tap terms for definitions

01Overview

Tensors are multi-dimensional arrays that extend familiar ideas of scalars (single numbers), vectors (1D lists), and matrices (2D grids) to more axes. A tensor’s order (also called rank) is the number of axes it has, and its shape lists the size along each axis. For example, a color image with height H, width W, and channels C is naturally a 3rd-order tensor of shape (H, W, C). In computing, we store tensors in linear memory and use strides to convert multi-indices into a single offset. This allows us to implement fast indexing, slicing, and bulk operations. Common tensor operations fall into two categories. Elementwise operations (like add, multiply, ReLU) operate independently at each index and thus scale with the number of elements. Structural operations (like reshape and permute/transpose) reorder or reinterpret indices without changing values; when implemented as views they can be O(1). Algebraic operations combine or reduce values across axes: outer products expand dimensionality, while contractions (generalized dot products) reduce it by summing over matched axes. Broadcasting rules enable elementwise operations between different-shaped tensors by treating size-1 dimensions as repeatable without materializing copies. Tensors show up across fields: in machine learning (mini-batches of data, parameters, activations), physics (coordinate transformations), graphics (volumes and animations), and scientific computing. Mastering shapes, strides, and algebraic rules is crucial for writing correct and efficient numeric code in C++.

02Intuition & Analogies

Imagine a spreadsheet (a 2D matrix). Each cell has a row and column. Now stack many such spreadsheets—say, one for every day of the year—forming a 3D block. Each entry now needs three coordinates: day, row, and column. If you add more context—like multiple sensors—you might get a 4D block. That block is a tensor: a neatly organized collection of numbers addressable by multiple indices. Elementwise operations are like applying a simple rule to every cell independently—color-correcting every pixel or adding a constant bias to each measurement. Broadcasting is the lazy assistant who refuses to copy the same bias for every day; instead, they remember one small row or column and pretend it repeats whenever needed. This keeps memory small and code fast. Reshape is a new way of counting without moving boxes: if you have 12 chocolates in a 3×4 tray, you can repack them as a 2×2×3 box—same chocolates, new layout. Permuting axes is like rotating or reordering how you stack your trays—rows become columns, days become channels, etc. Outer products are like turning a shopping list (vector) and a price list (vector) into a full price table (matrix): every item with every price. Contractions are the opposite: they combine along matching categories, summing over one axis—like computing total cost by summing quantity × price over items. Under the hood, strides are the step sizes you use to find the right chocolate in the box when counting along each axis in linear memory. With this mental model—boxes of boxes, repeated patterns (broadcasting), and careful counting (strides)—the formal math becomes much easier.

03Formal Definition

A tensor of order n over a field F (e.g., real numbers) can be defined abstractly as a multilinear map T:

V_{1}

V_{2}

× … ×

V_{n}

→ F. In computational practice, a tensor is represented as an n-dimensional array A with shape (

d_{1}

d_{2}

, …,

d_{n}

), where each entry is addressed by a multi-index (

i_{1}

i_{2}

, …,

i_{n}

) with 0 ≤

i_{k} < d_{k}

. The total number of elements is

∣ A ∣

= ∏_{

k=1}^{n}

d_{k}

. To store A in contiguous memory, we fix an order (typically row-major as in C/C++), define strides

s_{k}

= ∏_{

j=k+1}^{n}

d_{j}

, and map a multi-index to a linear offset off(i) = ∑_{

k=1}^{n}

i_{k}

s_{k}

. Elementwise operations define a new tensor C with the same broadcast-compatible shape where

C_{i} = f (A_{i}

B_{i}

) for a pointwise function f. The tensor (outer) product A ⊗ B creates a tensor whose order is the sum of orders with entries (A ⊗ B)_{

i_{1}

,…,

i_{m}

j_{1}

,…,

j_{p}

} =

A_{i_{1}, \dots, i_{m}}

B_{j_{1}, \dots, j_{p}}

. A contraction sums over matched axes; for axes p of A and q of B with equal lengths r, the contracted tensor C has entries

C_{\dots}

= ∑_{

k=0}^{r-1}

A_{\dots, k, \dots}

B_{\dots, k, \dots}

. Permutations are bijections π on {1,…,n} with B_{

i_{π (1)}

,…,

i_{π (n)}

} =

A_{i_{1}, \dots, i_{n}}

. Reshape is a bijection between index tuples preserving total size. Broadcasting extends elementwise definitions to tensors whose shapes differ only by allowing size-1 dimensions to match any size; indices along a broadcasted axis are fixed at 0.

04When to Use

Use tensors whenever your data naturally has multiple axes and you want to preserve that structure for clarity and performance. Examples include:

Machine learning: mini-batches of images (N, H, W, C), weight tensors for convolutions (K, R, S, C), and activations; broadcasting is used for adding biases or scaling channels.
Scientific computing: discretized fields over space and time (T, X, Y, Z), or higher-order stiffness/compliance tensors in mechanics; contractions implement coordinate changes and physical laws.
Computer graphics and imaging: color volumes, video (T, H, W, C), and filters; transposes and reshapes help adapt to algorithm-specific memory layouts.
Data analytics: user × item × time tensors for recommendation systems; tensor factorizations (CP/Tucker) reveal latent structure. Reach for elementwise operations when each entry can be processed independently, outer products to pair every element of one tensor with every element of another, and contractions when you need to reduce along matching categories (generalized dot products). Choose permutes and reshapes to adapt layout for libraries or to improve cache locality without changing values. Broadcasting is ideal to apply per-axis parameters (e.g., per-channel gain) without explicit tiling.

⚠️Common Mistakes

Ignoring memory layout: Mixing row-major (C/C++) and column-major (Fortran/Matlab) assumptions leads to wrong offsets and poor cache behavior. Always compute and test strides explicitly.
Misusing broadcasting: Assuming two shapes are broadcastable when they are not, or expecting size-1 axes to duplicate values physically. Remember that broadcasting is a rule for indexing, not an automatic copy.
Axis mismatch in contractions: Contracting axes with unequal lengths or inserting the summed index in the wrong position in the output shape. Validate shapes and be explicit about which axes are reduced.
Confusing reshape with reorder: Reshape changes how indices map to the same linear memory; it does not move data. To reorder data values, you need a permutation (transpose) or explicit copy.
Off-by-one and index order bugs: Building linear offsets with incorrect stride orders or 1-based indices. Adopt a consistent zero-based convention and test on small shapes.
Hidden copies: Creating temporaries during views, slices, or broadcasting that inflate memory. Prefer views (O(1)) and in-place operations when safe.
Numerical issues: Large outer products or deep contractions can overflow or accumulate floating-point error. Consider scaling, mixed precision, and stable reduction orders.

Key Formulas

Tensor size

size (A) = k = 1 \prod n d_{k}

Explanation: The total number of elements in an n-dimensional tensor equals the product of its axis lengths. This helps estimate memory and running time for elementwise operations.

Row-major strides and offset

s_{k} = j = k + 1 \prod n d_{j}, off (i_{1}, \dots, i_{n}) = k = 1 \sum n i_{k} s_{k}

Explanation: In row-major order, the stride of axis k is the product of sizes of all later axes. The linear offset for index (i1,…,in) is the dot product of indices with strides.

Elementwise operation

C_{i_{1}, \dots, i_{n}} = f (A_{i_{1}, \dots, i_{n}}, B_{i_{1}, \dots, i_{n}})

Explanation: An elementwise operation applies a scalar function f to corresponding entries. It generalizes addition, subtraction, multiplication, and activation functions.

Outer (tensor) product

(A \otimes B)_{i_{1}, \dots, i_{m}, j_{1}, \dots, j_{p}} = A_{i_{1}, \dots, i_{m}} B_{j_{1}, \dots, j_{p}}

Explanation: The outer product multiplies every element of A with every element of B and concatenates indices, increasing order by p.

Single-axis contraction

C_{\dots} = k = 0 \sum r - 1 A_{\dots, k, \dots} B_{\dots, k, \dots}

Explanation: Contracting A and B along one pair of axes of length r sums over that index. Matrix multiplication is a special case with r equal to the shared inner dimension.

Einstein notation for matmul

C_{ab} = A_{ai} B_{ib}

Explanation: Repeated index i is implicitly summed. This compactly expresses contractions without writing the summation symbol explicitly.

Axis permutation (transpose)

B_{i_{π (1)}, \dots, i_{π (n)}} = A_{i_{1}, \dots, i_{n}}

Explanation: A permutation $π$ of axes reorders how indices map to entries. It changes shape but preserves the multiset of values.

Memory usage

bytes (A) = (k = 1 \prod n d_{k}) \times sizeof (T)

Explanation: The memory footprint of a tensor equals its number of elements times the size of its scalar type T. This is key to capacity planning.

Broadcasted offset

i_{k}^{'} = {0, i_{k}, d_{k}^{(B)} = 1 d_{k}^{(B)} > 1, off_{B} = k = 1 \sum n i_{k}^{'} s_{k}^{(B)}

Explanation: For broadcasting, if B’s dimension is 1 along axis k, its index there is fixed at 0; otherwise it matches the output index. This defines how to read from broadcasted operands.

Frobenius norm

∥ A ∥_{F} = i_{1} = 0 \sum d_{1} - 1 \dots i_{n} = 0 \sum d_{n} - 1 A_{i_{1}, \dots, i_{n}}^{2}

Explanation: The Frobenius norm is the Euclidean norm of all elements of A. It is used to measure tensor magnitude and appears in optimization objectives.

Hadamard product

(A ⊙ B)_{i_{1}, \dots, i_{n}} = A_{i_{1}, \dots, i_{n}} B_{i_{1}, \dots, i_{n}}

Explanation: Elementwise multiplication requires broadcast-compatible shapes. It is common in gating and feature-wise scaling.

Kronecker product (matrix form)

(A \otimes B)_{(i - 1) m + p, (j - 1) n + q} = A_{ij} B_{pq}

Explanation: This formula shows how the Kronecker product expands a matrix A into blocks scaled by entries of B, producing a larger matrix.

Elementwise time complexity

T_{elem} = O (k = 1 \prod n d_{k})

Explanation: Any operation that touches each element once is linear in the number of elements. This bounds cost for simple passes like add or ReLU.

Contraction time complexity

T_{contract} = O r \cdot k \neq = p \prod d_{k}^{(A)} \cdot k \neq = q \prod d_{k}^{(B)}

Explanation: Contracting A along axis p with B along axis q (length r) costs proportional to r times the sizes of the remaining axes. Matrix multiply is a specific instance.

Complexity Analysis

Elementwise operations such as add, subtract, multiply, and unary transforms visit each element exactly once. If a tensor has N = ∏_{

k=1}^{n}

d_{k}

elements, then time is O(N) and extra space can be O(1) for in-place updates or O(N) for out-of-place results. Broadcasting does not change asymptotic time (still O(N) over the output) and can be O(1) in memory if implemented as a view that reuses memory with zero strides; otherwise, materializing a broadcasted copy costs O(N) additional space. Structural transforms depend on implementation. A reshape can be O(1) time and space if it returns a view with new shape/strides and the same underlying data; copying reshapes are O(N) time and O(N) space. Axis permutation (general transpose) needs to touch each element once if materialized, costing O(N) time and O(N) extra space, though cache-friendly loop orders and blocking can greatly improve constants. Algebraic operations vary. The outer product of tensors with M and P elements produces MP elements, costing O(MP) time and O(MP) space. A single-axis contraction between A and B with matched dimension r costs O(r ×

A_{r} e m

B_{r} e m

), where

A_{r} e m

and

B_{r} e m

are the products of their non-contracted dimensions; this recovers O(

n^{3}

) matrix multiplication when A and B are n×n and

r = n

. Multi-axis contractions multiply costs across all reduced axes. Memory traffic often dominates performance; arranging loops so that the innermost iteration walks the contiguous (last) axis in row-major order improves cache locality, reducing cache misses and time in practice without changing big-O. Finally, consider precision and accumulation order for numerical stability in large reductions.

Code Examples

A minimal Tensor class with row-major strides and broadcasted elementwise add/mul

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // A simple row-major Tensor with dynamic rank.
5 // Focus: shape/strides, linear offset, broadcasting-aware elementwise ops.
6 template <typename T>
7 class Tensor {
8 public:
9     vector<size_t> shape;      // sizes per axis
10     vector<size_t> strides;    // row-major strides
11     vector<T> data;            // contiguous storage
12 
13     Tensor() = default;
14     explicit Tensor(const vector<size_t>& shp) : shape(shp) {
15         compute_strides();
16         data.assign(numel(), T{});
17     }
18 
19     size_t rank() const { return shape.size(); }
20     size_t numel() const {
21         return accumulate(shape.begin(), shape.end(), (size_t)1, multiplies<size_t>());
22     }
23 
24     // Compute row-major strides: s_k = product of later dimensions.
25     void compute_strides() {
26         strides.assign(rank(), 1);
27         for (int k = (int)rank() - 2; k >= 0; --k) {
28             strides[k] = strides[k+1] * shape[k+1];
29         }
30     }
31 
32     // Convert a multi-index to linear offset.
33     size_t offset(const vector<size_t>& idx) const {
34         size_t off = 0;
35         for (size_t k = 0; k < idx.size(); ++k) {
36             off += idx[k] * strides[k];
37         }
38         return off;
39     }
40 
41     // Safe element accessors using a multi-index.
42     T get(const vector<size_t>& idx) const { return data[offset(idx)]; }
43     void set(const vector<size_t>& idx, const T& value) { data[offset(idx)] = value; }
44 
45     // Fill with a lambda f(idx_flat) or f(multi-index)
46     template <class F>
47     void fill_flat(F f) {
48         for (size_t i = 0; i < numel(); ++i) data[i] = f(i);
49     }
50 
51     // Broadcasting utilities
52     static vector<size_t> broadcast_shape(vector<size_t> a, vector<size_t> b) {
53         // Right-align shapes
54         size_t na = a.size(), nb = b.size();
55         size_t n = max(na, nb);
56         a.insert(a.begin(), n - na, 1);
57         b.insert(b.begin(), n - nb, 1);
58         vector<size_t> out(n);
59         for (size_t k = 0; k < n; ++k) {
60             if (a[k] == b[k] || a[k] == 1) out[k] = b[k];
61             else if (b[k] == 1) out[k] = a[k];
62             else throw runtime_error("Shapes not broadcastable");
63         }
64         return out;
65     }
66 
67     // Compute effective strides for broadcasting: stride is 0 if dim==1.
68     static vector<size_t> broadcast_strides(const vector<size_t>& in_shape,
69                                             const vector<size_t>& in_strides,
70                                             const vector<size_t>& out_shape) {
71         size_t n = out_shape.size();
72         vector<size_t> a = in_shape;
73         a.insert(a.begin(), n - a.size(), 1);
74         vector<size_t> s = in_strides;
75         s.insert(s.begin(), n - s.size(), 0);
76         vector<size_t> eff(n);
77         for (size_t k = 0; k < n; ++k) eff[k] = (a[k] == 1 ? 0 : s[k]);
78         return eff;
79     }
80 
81     // Generic broadcasted binary elementwise op: out = f(A, B)
82     template <class F>
83     static Tensor apply_binary(const Tensor& A, const Tensor& B, F f) {
84         vector<size_t> out_shape = broadcast_shape(A.shape, B.shape);
85         Tensor out(out_shape);
86         // Precompute effective strides aligning to out_shape
87         vector<size_t> sA = broadcast_strides(A.shape, A.strides, out_shape);
88         vector<size_t> sB = broadcast_strides(B.shape, B.strides, out_shape);
89         // Iterate linearly over out tensor using mixed-radix counter
90         size_t N = out.numel();
91         vector<size_t> idx(out_shape.size(), 0);
92         for (size_t flat = 0; flat < N; ++flat) {
93             // Compute linear offsets for A and B via effective strides
94             size_t offA = 0, offB = 0;
95             for (size_t k = 0; k < idx.size(); ++k) {
96                 offA += idx[k] * sA[k];
97                 offB += idx[k] * sB[k];
98             }
99             out.data[flat] = f(A.data[offA], B.data[offB]);
100             // Increment mixed-radix counter (row-major)
101             for (int k = (int)idx.size() - 1; k >= 0; --k) {
102                 if (++idx[k] < out_shape[k]) break;
103                 idx[k] = 0;
104             }
105         }
106         return out;
107     }
108 };
109 
110 int main() {
111     // Example: A (2,3) plus B (1,3) via broadcasting along axis 0.
112     Tensor<double> A({2,3});
113     Tensor<double> B({1,3});
114 
115     // Fill A with row i times 10 plus column j, and B with a bias per column.
116     A.fill_flat([&](size_t t){ return (double)t; }); // simple increasing values
117     B.set({0,0}, 100.0);
118     B.set({0,1}, 200.0);
119     B.set({0,2}, 300.0);
120 
121     auto C = Tensor<double>::apply_binary(A, B, [](double x, double y){ return x + y; });
122     auto D = Tensor<double>::apply_binary(C, C, [](double x, double y){ return x * y; }); // Hadamard square
123 
124     // Print results to verify broadcasting and elementwise ops
125     cout << "C shape: (2,3)\n";
126     for (size_t i = 0; i < 2; ++i) {
127         for (size_t j = 0; j < 3; ++j) cout << C.get({i,j}) << ' ';
128         cout << '\n';
129     }
130     cout << "\nD = C .* C\n";
131     for (size_t i = 0; i < 2; ++i) {
132         for (size_t j = 0; j < 3; ++j) cout << D.get({i,j}) << ' ';
133         cout << '\n';
134     }
135 }
136

We implement a dynamic-rank, row-major tensor using std::vector for data and store strides to compute linear offsets. The apply_binary function aligns shapes from the right, computes an output shape using broadcasting rules, and then iterates over the output using a mixed-radix counter. Effective strides become zero for broadcasted axes, so multiple output positions read the same input element without copying. This supports elementwise add, multiply, and arbitrary binary functions.

Time: O(∏ out_shape) for each elementwise operationSpace: O(∏ out_shape) for the output (O(1) extra besides the result)

Single-axis tensor contraction (generalized dot product)

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 template <typename T>
5 struct Tensor {
6     vector<size_t> shape, strides; vector<T> data;
7     Tensor() {}
8     explicit Tensor(const vector<size_t>& shp): shape(shp) {
9         strides.assign(shape.size(), 1);
10         for (int k = (int)shape.size()-2; k >= 0; --k) strides[k] = strides[k+1]*shape[k+1];
11         data.assign(numel(), T{});
12     }
13     size_t numel() const { return accumulate(shape.begin(), shape.end(), (size_t)1, multiplies<size_t>()); }
14     size_t offset(const vector<size_t>& idx) const { size_t off=0; for (size_t k=0;k<idx.size();++k) off += idx[k]*strides[k]; return off; }
15     T get(const vector<size_t>& idx) const { return data[offset(idx)]; }
16     void set(const vector<size_t>& idx, const T& v){ data[offset(idx)] = v; }
17 };
18 
19 // Iterate over all indices of a shape, calling fn(index).
20 void for_each_index(const vector<size_t>& shape, function<void(const vector<size_t>&)> fn){
21     vector<size_t> idx(shape.size(),0);
22     size_t N = accumulate(shape.begin(), shape.end(), (size_t)1, multiplies<size_t>());
23     for (size_t flat=0; flat<N; ++flat){
24         fn(idx);
25         for (int k=(int)shape.size()-1; k>=0; --k){ if (++idx[k] < shape[k]) break; idx[k]=0; }
26     }
27 }
28 
29 // Contract A along axisA with B along axisB. Returns a new tensor.
30 template <typename T>
31 Tensor<T> contract(const Tensor<T>& A, int axisA, const Tensor<T>& B, int axisB){
32     if (A.shape[axisA] != B.shape[axisB]) throw runtime_error("Contracted axes must match");
33     size_t r = A.shape[axisA];
34     // Build output shape = A.shape without axisA concatenated with B.shape without axisB
35     vector<size_t> out_shape; out_shape.reserve(A.shape.size()+B.shape.size()-2);
36     for (size_t i=0;i<A.shape.size();++i) if ((int)i!=axisA) out_shape.push_back(A.shape[i]);
37     for (size_t j=0;j<B.shape.size();++j) if ((int)j!=axisB) out_shape.push_back(B.shape[j]);
38     Tensor<T> C(out_shape);
39 
40     // Iterate over all output indices and accumulate over k in [0, r)
41     for_each_index(out_shape, [&](const vector<size_t>& out_idx){
42         // Split out_idx into parts corresponding to A (without axisA) and B (without axisB)
43         vector<size_t> idxA_full(A.shape.size());
44         vector<size_t> idxB_full(B.shape.size());
45         // Fill idxA_full by inserting from out_idx
46         size_t p=0;
47         for (size_t i=0;i<A.shape.size();++i){
48             if ((int)i==axisA) continue;
49             idxA_full[i] = out_idx[p++];
50         }
51         // Fill idxB_full from the remaining suffix of out_idx
52         size_t q=0;
53         for (size_t j=0;j<B.shape.size();++j){
54             if ((int)j==axisB) continue;
55             idxB_full[j] = out_idx[p + q++];
56         }
57         // Accumulate over contracted index
58         T acc = T{};
59         for (size_t k=0;k<r;++k){
60             idxA_full[axisA] = k;
61             idxB_full[axisB] = k;
62             acc += A.get(idxA_full) * B.get(idxB_full);
63         }
64         C.data[C.offset(out_idx)] = acc;
65     });
66     return C;
67 }
68 
69 int main(){
70     // Example: matrix multiplication via contraction
71     Tensor<double> A({2,3}); // (2x3)
72     Tensor<double> B({3,4}); // (3x4)
73 
74     // Fill A and B with small values
75     for (size_t i=0;i<2;i++) for (size_t k=0;k<3;k++) A.set({i,k}, (i+1)*(k+1));
76     for (size_t k=0;k<3;k++) for (size_t j=0;j<4;j++) B.set({k,j}, (k+1)+(j));
77 
78     // Contract A axis 1 with B axis 0 => output shape (2,4)
79     auto C = contract(A, 1, B, 0);
80 
81     cout << "C (2x4) = A(2x3) * B(3x4)\n";
82     for (size_t i=0;i<2;i++){ for (size_t j=0;j<4;j++) cout << C.get({i,j}) << ' '; cout << '\n'; }
83 }
84

The contract function reduces one axis from A and one from B by summing the product over that index. We build the output shape by removing contracted axes, iterate over all output indices, and map them back to full index tuples for A and B by inserting the summation index. Matrix multiplication is recovered when contracting A’s columns with B’s rows.

Time: O(r × ∏(A without axisA) × ∏(B without axisB))Space: O(∏ output_shape)

Axis permutation (transpose) and reshape

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 template <typename T>
5 struct Tensor {
6     vector<size_t> shape, strides; vector<T> data;
7     Tensor(){}
8     explicit Tensor(const vector<size_t>& shp): shape(shp){
9         strides.assign(shape.size(),1);
10         for (int k=(int)shape.size()-2;k>=0;--k) strides[k]=strides[k+1]*shape[k+1];
11         data.assign(numel(), T{});
12     }
13     size_t numel() const { return accumulate(shape.begin(), shape.end(), (size_t)1, multiplies<size_t>()); }
14     size_t offset(const vector<size_t>& idx) const { size_t off=0; for (size_t k=0;k<idx.size();++k) off += idx[k]*strides[k]; return off; }
15 };
16 
17 template <typename T>
18 Tensor<T> permute(const Tensor<T>& A, const vector<int>& axes){
19     if (axes.size()!=A.shape.size()) throw runtime_error("axes length mismatch");
20     // New shape order
21     vector<size_t> new_shape(axes.size());
22     for (size_t i=0;i<axes.size();++i) new_shape[i]=A.shape[axes[i]];
23     Tensor<T> B(new_shape);
24     // Iterate over all indices of B and map back to A
25     vector<size_t> idxB(new_shape.size(),0);
26     size_t N=B.numel();
27     for (size_t flat=0; flat<N; ++flat){
28         vector<size_t> idxA(A.shape.size());
29         for (size_t i=0;i<axes.size();++i) idxA[axes[i]] = idxB[i];
30         B.data[flat] = A.data[A.offset(idxA)];
31         for (int k=(int)idxB.size()-1;k>=0;--k){ if (++idxB[k] < new_shape[k]) break; idxB[k]=0; }
32     }
33     return B;
34 }
35 
36 // Reshape with copy (for simplicity). A view-based reshape would share data.
37 template <typename T>
38 Tensor<T> reshape_copy(const Tensor<T>& A, const vector<size_t>& new_shape){
39     size_t newN = accumulate(new_shape.begin(), new_shape.end(), (size_t)1, multiplies<size_t>());
40     if (newN != A.numel()) throw runtime_error("element count mismatch in reshape");
41     Tensor<T> B(new_shape);
42     B.data = A.data; // copy assignment; same order preserved
43     return B;
44 }
45 
46 int main(){
47     Tensor<int> A({2,3,4});
48     // Fill A with 0..23
49     for (int i=0;i<24;++i) A.data[i]=i;
50 
51     // Permute axes from (0,1,2) to (2,0,1): shape (4,2,3)
52     auto P = permute(A, {2,0,1});
53 
54     // Reshape P to (4,6)
55     auto R = reshape_copy(P, {4,6});
56 
57     cout << "P shape (4,2,3) in row-major order first row of last 2D view by printing first 6 elements after reshape:\n";
58     for (size_t j=0;j<6;++j) cout << R.data[j] << ' ';
59     cout << "\nTotal elements: " << R.numel() << "\n";
60 }
61

Permutation reorders axes and thus changes strides. We create a new tensor with permuted shape and copy elements by mapping each output index back to its corresponding input index. The reshape_copy function changes the shape while preserving linear order; for simplicity it copies data, but a high-performance library would return a view that shares storage and updates strides in O(1).

Time: Permutation: O(N). Reshape (copy-based): O(N) due to copying.Space: Permutation: O(N) for the new tensor. Reshape (copy-based): O(N). A view-based reshape would be O(1).

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// A simple row-major Tensor with dynamic rank.
5	// Focus: shape/strides, linear offset, broadcasting-aware elementwise ops.
6	template <typename T>
7	class Tensor {
8	public:
9	vector<size_t> shape; // sizes per axis
10	vector<size_t> strides; // row-major strides
11	vector<T> data; // contiguous storage
12
13	Tensor() = default;
14	explicit Tensor(const vector<size_t>& shp) : shape(shp) {
15	compute_strides();
16	data.assign(numel(), T{});
17	}
18
19	size_t rank() const { return shape.size(); }
20	size_t numel() const {
21	return accumulate(shape.begin(), shape.end(), (size_t)1, multiplies<size_t>());
22	}
23
24	// Compute row-major strides: s_k = product of later dimensions.
25	void compute_strides() {
26	strides.assign(rank(), 1);
27	for (int k = (int)rank() - 2; k >= 0; --k) {
28	strides[k] = strides[k+1] * shape[k+1];
29	}
30	}
31
32	// Convert a multi-index to linear offset.
33	size_t offset(const vector<size_t>& idx) const {
34	size_t off = 0;
35	for (size_t k = 0; k < idx.size(); ++k) {
36	off += idx[k] * strides[k];
37	}
38	return off;
39	}
40
41	// Safe element accessors using a multi-index.
42	T get(const vector<size_t>& idx) const { return data[offset(idx)]; }
43	void set(const vector<size_t>& idx, const T& value) { data[offset(idx)] = value; }
44
45	// Fill with a lambda f(idx_flat) or f(multi-index)
46	template <class F>
47	void fill_flat(F f) {
48	for (size_t i = 0; i < numel(); ++i) data[i] = f(i);
49	}
50
51	// Broadcasting utilities
52	static vector<size_t> broadcast_shape(vector<size_t> a, vector<size_t> b) {
53	// Right-align shapes
54	size_t na = a.size(), nb = b.size();
55	size_t n = max(na, nb);
56	a.insert(a.begin(), n - na, 1);
57	b.insert(b.begin(), n - nb, 1);
58	vector<size_t> out(n);
59	for (size_t k = 0; k < n; ++k) {
60	if (a[k] == b[k] \|\| a[k] == 1) out[k] = b[k];
61	else if (b[k] == 1) out[k] = a[k];
62	else throw runtime_error("Shapes not broadcastable");
63	}
64	return out;
65	}
66
67	// Compute effective strides for broadcasting: stride is 0 if dim==1.
68	static vector<size_t> broadcast_strides(const vector<size_t>& in_shape,
69	const vector<size_t>& in_strides,
70	const vector<size_t>& out_shape) {
71	size_t n = out_shape.size();
72	vector<size_t> a = in_shape;
73	a.insert(a.begin(), n - a.size(), 1);
74	vector<size_t> s = in_strides;
75	s.insert(s.begin(), n - s.size(), 0);
76	vector<size_t> eff(n);
77	for (size_t k = 0; k < n; ++k) eff[k] = (a[k] == 1 ? 0 : s[k]);
78	return eff;
79	}
80
81	// Generic broadcasted binary elementwise op: out = f(A, B)
82	template <class F>
83	static Tensor apply_binary(const Tensor& A, const Tensor& B, F f) {
84	vector<size_t> out_shape = broadcast_shape(A.shape, B.shape);
85	Tensor out(out_shape);
86	// Precompute effective strides aligning to out_shape
87	vector<size_t> sA = broadcast_strides(A.shape, A.strides, out_shape);
88	vector<size_t> sB = broadcast_strides(B.shape, B.strides, out_shape);
89	// Iterate linearly over out tensor using mixed-radix counter
90	size_t N = out.numel();
91	vector<size_t> idx(out_shape.size(), 0);
92	for (size_t flat = 0; flat < N; ++flat) {
93	// Compute linear offsets for A and B via effective strides
94	size_t offA = 0, offB = 0;
95	for (size_t k = 0; k < idx.size(); ++k) {
96	offA += idx[k] * sA[k];
97	offB += idx[k] * sB[k];
98	}
99	out.data[flat] = f(A.data[offA], B.data[offB]);
100	// Increment mixed-radix counter (row-major)
101	for (int k = (int)idx.size() - 1; k >= 0; --k) {
102	if (++idx[k] < out_shape[k]) break;
103	idx[k] = 0;
104	}
105	}
106	return out;
107	}
108	};
109
110	int main() {
111	// Example: A (2,3) plus B (1,3) via broadcasting along axis 0.
112	Tensor<double> A({2,3});
113	Tensor<double> B({1,3});
114
115	// Fill A with row i times 10 plus column j, and B with a bias per column.
116	A.fill_flat([&](size_t t){ return (double)t; }); // simple increasing values
117	B.set({0,0}, 100.0);
118	B.set({0,1}, 200.0);
119	B.set({0,2}, 300.0);
120
121	auto C = Tensor<double>::apply_binary(A, B, [](double x, double y){ return x + y; });
122	auto D = Tensor<double>::apply_binary(C, C, [](double x, double y){ return x * y; }); // Hadamard square
123
124	// Print results to verify broadcasting and elementwise ops
125	cout << "C shape: (2,3)\n";
126	for (size_t i = 0; i < 2; ++i) {
127	for (size_t j = 0; j < 3; ++j) cout << C.get({i,j}) << ' ';
128	cout << '\n';
129	}
130	cout << "\nD = C .* C\n";
131	for (size_t i = 0; i < 2; ++i) {
132	for (size_t j = 0; j < 3; ++j) cout << D.get({i,j}) << ' ';
133	cout << '\n';
134	}
135	}
136