Tensor Operations
A tensor is a multi-dimensional array generalizing scalars, vectors, and matrices:
| Rank | Name | Shape | Example |
|---|---|---|---|
| 0 | Scalar | $()$ | loss value |
| 1 | Vector | $(n)$ | embedding |
| 2 | Matrix | $(m, n)$ | weight matrix |
| 3 | 3-D tensor | $(b, m, n)$ | batch of matrices |
| 4 | 4-D tensor | $(b, c, h, w)$ | batch of images |
In ML frameworks (PyTorch, NumPy), tensors are the fundamental data structure.
Indexing and Slicing
T[i] # select along first dimension
T[i, j] # select element
T[:, 1:3] # slice rows, select cols 1 and 2
T[..., -1] # last element along last dim (ellipsis)
Reshaping
Reshape: change shape without changing data. Total elements must match.
T.reshape(a, b, c) # explicit shape
T.view(a, -1) # -1 infers the size
T.flatten() # 1-D
Squeeze / Unsqueeze: remove or add dimensions of size 1.
T.squeeze(dim) # remove dim if size=1
T.unsqueeze(dim) # insert dim of size 1
Transposing and Permuting
Transpose swaps two dimensions:
T.T # transpose last two dims
T.transpose(0, 1)
Permute reorders all dimensions:
T.permute(2, 0, 1) # (a,b,c) → (c,a,b)
Common use: converting image format $(B, H, W, C) \to (B, C, H, W)$.
Broadcasting
Allows operations on tensors with different shapes, where the smaller tensor is “broadcast” across the larger.
Rules:
- Align shapes from the right
- Dimensions are compatible if equal or one of them is 1
- Size-1 dimensions are stretched to match
(3, 1, 5) + (2, 5)
→ (3, 1, 5) + (1, 2, 5) # align from right
→ (3, 2, 5) # broadcast
Broadcasting is zero-copy (no actual data duplication).
Elementwise Operations
Applied independently to each element (same shape required, or broadcastable):
A + B, A * B, A / B # arithmetic
torch.exp(T), torch.log(T), torch.sqrt(T)
torch.relu(T), torch.sigmoid(T)
Reduction Operations
Reduce along one or more dimensions:
T.sum(dim=0) # sum across rows
T.mean(dim=-1) # mean across last dim
T.max(dim=1) # max across cols
T.norm(dim=2) # L2 norm across dim 2
keepdim=True preserves the reduced dimension as size 1 (useful for broadcasting).
Matrix Multiplication
2-D: standard matmul
torch.mm(A, B) # (m,k) × (k,n) → (m,n)
Batched matmul: last two dims are the matrix dims, batch dims must broadcast
torch.bmm(A, B) # (b,m,k) × (b,k,n) → (b,m,n)
torch.matmul(A, B) # general: handles any batch dims
A @ B # operator syntax
Einstein summation (einsum):
torch.einsum('bik,bjk->bij', Q, K) # attention scores
torch.einsum('ij,jk->ik', A, B) # matmul
torch.einsum('ii->', A) # trace
torch.einsum('ij->ji', A) # transpose
Einsum is expressive and efficient, preferred for complex contractions.
Outer Product and Kronecker Product
torch.outer(a, b) # (n,) × (m,) → (n,m)
torch.kron(A, B) # Kronecker product
Concatenation and Stacking
torch.cat([A, B], dim=0) # concatenate along existing dim
torch.stack([A, B], dim=0) # stack along new dim
Tensor Contraction
Generalized multiply-and-sum over shared indices. Einsum handles all contractions:
# Frobenius inner product: sum(A * B)
torch.einsum('ij,ij->', A, B)
# Batch outer product
torch.einsum('bi,bj->bij', x, y)
# Attention: softmax(QK^T / sqrt(d_k)) V
scores = torch.einsum('bhid,bhjd->bhij', Q, K) / math.sqrt(d_k)
out = torch.einsum('bhij,bhjd->bhid', scores, V)
Memory Layout (Contiguous vs Non-Contiguous)
Tensors are stored in memory as 1-D arrays with strides describing how to step through each dimension.
- After transpose/permute, tensor may be non-contiguous (strides don’t match natural order)
.contiguous()makes a copy in natural row-major order- Non-contiguous tensors can cause errors with some ops. Use
.contiguous()before.view().
Sparse Tensors
Most entries are zero → store only non-zero values and their indices.
torch.sparse_coo_tensor(indices, values, size)
Used in: graph adjacency matrices, NLP bag-of-words, recommendation systems.
Tensor Decompositions
Generalize matrix decompositions to higher-order tensors:
| Decomposition | Description | Use |
|---|---|---|
| CP (CANDECOMP/PARAFAC) | Sum of rank-1 tensors | Compression, knowledge graphs |
| Tucker | Core tensor + factor matrices | Multilinear PCA |
| TT (Tensor Train) | Chain of 3-D tensors | Compressing weight tensors |
Tensor train decomposition used to compress large weight matrices in transformers with minimal loss of expressivity.
Gradient Flow Through Tensor Ops
All tensor operations in frameworks like PyTorch support autograd:
- Reshaping, permuting: gradients are reshaped back
- Reduction (sum, mean): gradient broadcasts back
- Elementwise: gradient multiplied elementwise by local derivative
- Matmul: see matrix calculus notes