Parallel Algorithm Theory
Key Points
- •Parallel algorithm theory studies how to solve problems faster by coordinating many processors that share work and memory.
- •The PRAM model abstracts ideal parallel machines and uses time T and work W to measure performance and efficiency.
- •Work-efficient algorithms match the total work of the best sequential algorithm up to constants, avoiding wasted parallel effort.
- •NC is the class of problems solvable in polylogarithmic time with a polynomial number of processors and captures highly parallelizable problems.
- •P-complete problems are believed to be inherently sequential and unlikely to admit NC algorithms.
- •Amdahl’s law limits speedup by the sequential fraction, while Gustafson’s law explains scaling when the problem size grows with processors.
- •MapReduce expresses large-scale parallelism across machines via map and reduce stages with communication (shuffle) between them.
- •GPUs implement massive SIMD-style data parallelism that accelerates ML training and inference when memory access is coalesced and work is regular.
Prerequisites
- →Asymptotic analysis and Big-O notation — To interpret work, span, and time bounds like O(n log n) and O(log n).
- →Basic concurrency in C++ (threads, futures, mutexes) — To implement parallel patterns and understand synchronization and overheads.
- →Divide-and-conquer algorithms — Many parallel algorithms (e.g., merge sort, scan) are structured using recursive splitting.
- →Matrix operations and linear algebra basics — To understand data-parallel computations ubiquitous in ML and matrix multiplication.
- →CPU memory hierarchy (caches) and locality — Memory bandwidth and locality strongly affect parallel scalability in practice.
- →Associativity and commutativity of operations — Parallel reductions and scans rely on associative (and often commutative) operators.
Detailed Explanation
Tap terms for definitions01Overview
Parallel algorithm theory asks: if I have many processors, how much faster can I solve a problem? To answer, we use clean machine models that strip away hardware quirks and focus on core limits. The most famous is PRAM (Parallel Random Access Machine), where many simple processors share a single, uniform-speed memory. In PRAM, we measure parallel time T (how many steps until the answer completes) and work W (total operations across all processors). A core goal is to design work-efficient algorithms: those that use about as many total operations as the best sequential solution, but finish much sooner by parallelizing steps. From this lens arises the complexity class NC (Nick’s Class), which contains problems solvable in polylogarithmic time using polynomially many processors—think highly parallel tasks like prefix sums, balanced tree operations, and some linear algebra. Not all problems parallelize well: P-complete problems, like general circuit evaluation or certain dynamic programs, appear inherently sequential and thus unlikely to be in NC. Practical models like MapReduce capture cluster-scale parallelism with communication phases, while GPUs embody data-parallel SIMD execution ideal for vectorizable workloads in AI/ML. Amdahl’s and Gustafson’s laws provide sanity checks on expected speedups. Together, these ideas help you predict which problems can scale on multicore CPUs, clusters, or GPUs, and how to structure algorithms to achieve near-ideal speedups.
02Intuition & Analogies
Imagine a group project where each person can do a part independently. If the project splits into many equal, independent tasks (like labeling thousands of images), then adding people helps a lot: everyone labels a chunk, and you finish much sooner. That is data parallelism. Now imagine some parts depend on others (you can’t write the conclusion before the analysis). These dependencies form a chain that everyone must respect—the critical path. Even with a huge team, you cannot beat the time of that longest dependence chain. This is exactly the idea behind span (also called depth): the shortest possible completion time if you had infinitely many helpers is bounded by the longest chain of dependent steps. The PRAM model is like imagining a huge quiet library with many desks (processors) and one enormous bookshelf (shared memory). Everyone can read/write instantly without bumping into each other—clearly unrealistic, but great for understanding what is possible if coordination were free. If many students try to grab the same book at once, different PRAM variants say whether that’s allowed and how conflicts are resolved. Classes like NC are like saying: any project that reliably wraps up in a small number of meetings (logarithmic levels of coordination) can be done by a large but reasonable team. In contrast, P-complete tasks behave like projects with long, unbreakable dependency chains—more people just wait around. In the real world, organizations (clusters or GPUs) add constraints: communication delays (like passing papers around), memory contention (line for the copier), and coordination overhead (meetings). MapReduce is like first having everyone independently summarize their notes (map), then a well-organized meeting consolidates by topic (shuffle + reduce). GPUs are like stadiums of synchronized students all filling spreadsheets in lockstep—blazing fast if everyone does the same kind of work with neatly arranged data.
03Formal Definition
04When to Use
Use parallel algorithm theory when you need to reason about scalability before implementation details obscure the picture. If a task is naturally data-parallel (e.g., map/filter/scan on large arrays, image processing, vectorized ML operations), the PRAM view helps design work-efficient algorithms with small span, suggesting good speedups on CPUs and GPUs. Divide-and-conquer problems (e.g., merge sort, FFT, parallel prefix sums, tree traversals) often have logarithmic span and fit into NC-style solutions. When moving to distributed systems, MapReduce is appropriate for embarrassingly parallel maps with associative, commutative aggregations (e.g., word count, log analytics), where a shuffle-and-reduce consolidates partial results. In AI/ML, use data parallelism (replicating the model across devices and sharding data) when batch computations dominate and gradient aggregation is associative; use model/pipeline parallelism when models exceed single-device memory. GPU parallelism is ideal when the computation is SIMD-friendly: the same operation on many items, with regular control flow and memory access (e.g., matrix multiplication, convolutions). Conversely, avoid parallelization (or temper expectations) when the problem has high synchronization needs, heavy interdependencies (large span), irregular memory access, or small input sizes where overhead dominates. For P-complete tasks (e.g., general dynamic programming on arbitrary DAGs), consider approximation, special cases, or parallelizing across instances rather than within a single hard instance.
⚠️Common Mistakes
• Equating core count with linear speedup: real machines have cache hierarchies, memory bandwidth limits, and synchronization overhead. Use the work–span model and Brent’s theorem to bound expectations. • Ignoring memory contention: PRAM assumes unit-time shared memory; in practice, false sharing and non-coalesced accesses on GPUs can dominate. Design for locality and coalescence. • Over-parallelizing: spawning too many threads or tasks for tiny workloads can make performance worse than sequential due to scheduling overhead. Introduce cutoffs and batch small tasks. • Misapplying Amdahl’s law: Amdahl’s upper bound is pessimistic when the problem size grows with p. Use Gustafson’s law to reason about scaled workloads. • Neglecting load balance: uneven task sizes lead to idle processors and poor utilization. Use work-stealing or divide data evenly to match compute cost, not just item counts. • Assuming associativity/commutativity incorrectly: MapReduce-style reductions and parallel scans require associative (and often commutative) operators; floating-point sums are not strictly associative—use pairwise or Kahan summation to improve accuracy. • Expecting NC-like bounds on irregular problems: graph algorithms with high diameter or strong dependencies often have large span; consider heuristics or asynchronous algorithms. • Forgetting determinism and races: concurrent writes need well-defined resolution; use atomics, barriers, or reductions to avoid data races. • Measuring the wrong thing: focus on end-to-end throughput and scalability (speedup, efficiency), not just CPU utilization. • Portability assumptions: C++ parallel patterns may need different backends (std::threads, OpenMP, TBB, CUDA) across platforms; separate algorithmic structure from implementation backend.
Key Formulas
Work
Explanation: Work is the total number of primitive operations performed across all p processors. It should ideally match the best sequential complexity to be considered work-efficient.
Parallel Time
Explanation: (n) measures how long an algorithm takes to run when using p processors. It is the key measure of speedup compared to sequential time (n).
Span / Depth
Explanation: Span is the runtime with infinitely many processors. It equals the longest chain of dependent operations and lower-bounds any achievable runtime.
Brent's Theorem
Explanation: Time on p processors is bounded by the sum of average work per processor and the span. This shows diminishing returns once p exceeds W/T_.
Speedup
Explanation: Speedup compares sequential time to parallel time. Ideal speedup is p, but overheads and span limit practical speedup.
Efficiency
Explanation: Efficiency measures how well processors are utilized. Values near 1 indicate excellent parallel scaling; values near 0 indicate wasted resources.
Amdahl's Law
Explanation: If fraction f of a program is sequential, speedup is limited regardless of processor count. This guides expectations for fixed-size workloads.
Gustafson's Law
Explanation: For scaled workloads, the effective speedup grows nearly linearly with p, discounted by the fraction f of inherently sequential work.
NC Class
Explanation: NC contains problems solvable in polylogarithmic time using polynomially many processors on a PRAM. It formalizes highly parallelizable computations.
Parallel Prefix Sum
Explanation: The classic tree-based scan achieves logarithmic time with linear work on PRAM. It is a building block for many parallel algorithms.
Parallel Merge Sort (PRAM)
Explanation: With parallel merging in O(log n), the overall time becomes O(lo n), maintaining optimal O(n log n) work.
Parallel Matrix Multiply (PRAM)
Explanation: Each output entry is an inner product; parallelizing the n multiplications and using a tree reduction yields logarithmic time with cubic work.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Parallel exclusive scan using two-phase block method. 5 // Phase 1: Each thread computes exclusive scan on its block and records block sum. 6 // Phase 2: Compute prefix over block sums, then add offset to each block in parallel. 7 8 vector<long long> parallel_exclusive_scan(const vector<int>& a, unsigned num_threads = thread::hardware_concurrency()) { 9 const size_t n = a.size(); 10 vector<long long> out(n, 0); 11 if (n == 0) return out; 12 13 if (num_threads == 0) num_threads = 4; // fallback 14 num_threads = min<unsigned>(num_threads, static_cast<unsigned>(n)); 15 const size_t block_size = (n + num_threads - 1) / num_threads; 16 17 vector<long long> blockSums(num_threads, 0); 18 vector<thread> threads; 19 threads.reserve(num_threads); 20 21 // Phase 1: local exclusive scans per block 22 for (unsigned t = 0; t < num_threads; ++t) { 23 threads.emplace_back([&, t](){ 24 const size_t begin = t * block_size; 25 const size_t end = min(n, begin + block_size); 26 long long sum = 0; 27 for (size_t i = begin; i < end; ++i) { 28 out[i] = sum; // exclusive: write previous sum 29 sum += a[i]; // accumulate current element 30 } 31 blockSums[t] = sum; // total of this block 32 }); 33 } 34 for (auto &th : threads) th.join(); 35 36 // Compute offsets as exclusive prefix of blockSums (sequential, size = num_threads) 37 vector<long long> offsets(num_threads, 0); 38 long long acc = 0; 39 for (unsigned t = 0; t < num_threads; ++t) { 40 offsets[t] = acc; 41 acc += blockSums[t]; 42 } 43 44 // Phase 2: add offsets in parallel 45 threads.clear(); 46 for (unsigned t = 0; t < num_threads; ++t) { 47 threads.emplace_back([&, t](){ 48 const size_t begin = t * block_size; 49 const size_t end = min(n, begin + block_size); 50 const long long off = offsets[t]; 51 for (size_t i = begin; i < end; ++i) { 52 out[i] += off; 53 } 54 }); 55 } 56 for (auto &th : threads) th.join(); 57 58 return out; 59 } 60 61 int main() { 62 ios::sync_with_stdio(false); 63 cin.tie(nullptr); 64 65 vector<int> a = {3, 1, 7, 0, 4, 1, 6, 3}; 66 auto res = parallel_exclusive_scan(a); 67 68 // Verify against sequential exclusive scan 69 vector<long long> check(a.size(), 0); 70 long long s = 0; 71 for (size_t i = 0; i < a.size(); ++i) { 72 check[i] = s; 73 s += a[i]; 74 } 75 76 cout << "Input: "; for (auto v: a) cout << v << ' '; cout << '\n'; 77 cout << "Parallel out:"; for (auto v: res) cout << ' ' << v; cout << '\n'; 78 cout << "Sequential :"; for (auto v: check) cout << ' ' << v; cout << '\n'; 79 80 bool ok = (res == check); 81 cout << (ok ? "Match" : "Mismatch") << '\n'; 82 return ok ? 0 : 1; 83 } 84
This implementation divides the array into contiguous blocks. Each thread computes an exclusive scan on its block and records the block sum. After computing the prefix of block sums (small: one per thread), we add the corresponding offset to every element in each block in parallel. This structure mirrors PRAM’s tree-based scan but adapts to practical multicore constraints by reducing synchronization to two global phases.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Parallel merge sort using std::async with a cutoff to limit task overhead. 5 // The algorithm remains work-efficient (O(n log n)) while reducing span through concurrency. 6 7 static const size_t CUTOFF = 1 << 14; // switch to sequential sort for small arrays 8 9 void merge_ranges(vector<int>& a, vector<int>& tmp, size_t l, size_t m, size_t r) { 10 size_t i = l, j = m, k = l; 11 while (i < m && j < r) tmp[k++] = (a[i] <= a[j]) ? a[i++] : a[j++]; 12 while (i < m) tmp[k++] = a[i++]; 13 while (j < r) tmp[k++] = a[j++]; 14 // copy back 15 for (size_t t = l; t < r; ++t) a[t] = tmp[t]; 16 } 17 18 void parallel_merge_sort_rec(vector<int>& a, vector<int>& tmp, size_t l, size_t r, unsigned depth) { 19 const size_t n = r - l; 20 if (n <= 1) return; 21 if (n <= CUTOFF || depth == 0) { 22 sort(a.begin() + l, a.begin() + r); 23 return; 24 } 25 size_t m = l + n / 2; 26 // Spawn one half asynchronously and sort the other in current thread 27 auto fut = async(launch::async, [&]{ parallel_merge_sort_rec(a, tmp, l, m, depth - 1); }); 28 parallel_merge_sort_rec(a, tmp, m, r, depth - 1); 29 fut.get(); 30 merge_ranges(a, tmp, l, m, r); 31 } 32 33 void parallel_merge_sort(vector<int>& a) { 34 vector<int> tmp(a.size()); 35 unsigned hw = thread::hardware_concurrency(); 36 if (hw == 0) hw = 4; 37 // Depth controls number of concurrent tasks: up to 2^depth tasks 38 unsigned depth = static_cast<unsigned>(log2(hw)); 39 parallel_merge_sort_rec(a, tmp, 0, a.size(), depth); 40 } 41 42 int main() { 43 ios::sync_with_stdio(false); 44 cin.tie(nullptr); 45 46 vector<int> a(1 << 20); 47 iota(a.begin(), a.end(), 0); 48 // Shuffle to create a random permutation 49 mt19937 rng(123); 50 shuffle(a.begin(), a.end(), rng); 51 52 vector<int> b = a; 53 parallel_merge_sort(a); 54 sort(b.begin(), b.end()); 55 56 cout << (a == b ? "Sorted correctly" : "Incorrectly sorted") << "\n"; 57 return a == b ? 0 : 1; 58 } 59
The array is recursively split into halves. For sufficiently large subarrays and remaining depth budget, one half is sorted asynchronously while the other is processed in the current thread, then the two halves are merged. A cutoff avoids excessive task creation. The algorithm maintains O(n log n) work and reduces span roughly to O(log^2 n) on PRAM; in practice, speed depends on the scheduler, cutoff, and memory bandwidth.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Compute C = A * B where A, B, C are n x n matrices (dense), parallelized by rows. 5 // For better cache locality, we transpose B first so that inner loop reads contiguous memory. 6 7 typedef vector<double> Vec; 8 typedef vector<Vec> Mat; 9 10 Mat transpose(const Mat& B) { 11 size_t n = B.size(); 12 Mat Bt(n, Vec(n)); 13 for (size_t i = 0; i < n; ++i) 14 for (size_t j = 0; j < n; ++j) 15 Bt[j][i] = B[i][j]; 16 return Bt; 17 } 18 19 Mat parallel_matmul(const Mat& A, const Mat& B, unsigned num_threads = thread::hardware_concurrency()) { 20 size_t n = A.size(); 21 Mat C(n, Vec(n, 0.0)); 22 if (n == 0) return C; 23 if (num_threads == 0) num_threads = 4; 24 num_threads = min<unsigned>(num_threads, static_cast<unsigned>(n)); 25 26 Mat Bt = transpose(B); // improves memory access patterns 27 28 size_t rows_per_thread = (n + num_threads - 1) / num_threads; 29 vector<thread> threads; 30 threads.reserve(num_threads); 31 32 for (unsigned t = 0; t < num_threads; ++t) { 33 threads.emplace_back([&, t]() { 34 size_t r_begin = t * rows_per_thread; 35 size_t r_end = min(n, r_begin + rows_per_thread); 36 for (size_t i = r_begin; i < r_end; ++i) { 37 for (size_t j = 0; j < n; ++j) { 38 const double* ai = A[i].data(); 39 const double* btj = Bt[j].data(); 40 double sum = 0.0; 41 // Inner product of A[i][:] and B[:][j] (via Bt[j][:]) 42 for (size_t k = 0; k < n; ++k) sum += ai[k] * btj[k]; 43 C[i][j] = sum; 44 } 45 } 46 }); 47 } 48 for (auto& th : threads) th.join(); 49 50 return C; 51 } 52 53 int main() { 54 ios::sync_with_stdio(false); 55 cin.tie(nullptr); 56 57 size_t n = 4; 58 Mat A(n, Vec(n)), B(n, Vec(n)); 59 // Simple test: A = identity, C should equal B 60 for (size_t i = 0; i < n; ++i) { 61 for (size_t j = 0; j < n; ++j) { 62 A[i][j] = (i == j) ? 1.0 : 0.0; 63 B[i][j] = (i + j); 64 } 65 } 66 Mat C = parallel_matmul(A, B); 67 bool ok = true; 68 for (size_t i = 0; i < n; ++i) 69 for (size_t j = 0; j < n; ++j) 70 ok = ok && (fabs(C[i][j] - B[i][j]) < 1e-9); 71 72 cout << (ok ? "Matmul correct" : "Matmul incorrect") << "\n"; 73 return ok ? 0 : 1; 74 } 75
The computation assigns contiguous row blocks to threads. Each thread computes C[i][j] as an inner product; transposing B turns the innermost loop into a contiguous memory access, improving cache behavior. This mirrors data-parallel SIMD-friendly structure that maps well to GPUs too, where each output element is computed independently.