📚TheoryIntermediate

Parallel Algorithm Theory

Key Points

  • Parallel algorithm theory studies how to solve problems faster by coordinating many processors that share work and memory.
  • The PRAM model abstracts ideal parallel machines and uses time T and work W to measure performance and efficiency.
  • Work-efficient algorithms match the total work of the best sequential algorithm up to constants, avoiding wasted parallel effort.
  • NC is the class of problems solvable in polylogarithmic time with a polynomial number of processors and captures highly parallelizable problems.
  • P-complete problems are believed to be inherently sequential and unlikely to admit NC algorithms.
  • Amdahl’s law limits speedup by the sequential fraction, while Gustafson’s law explains scaling when the problem size grows with processors.
  • MapReduce expresses large-scale parallelism across machines via map and reduce stages with communication (shuffle) between them.
  • GPUs implement massive SIMD-style data parallelism that accelerates ML training and inference when memory access is coalesced and work is regular.

Prerequisites

  • Asymptotic analysis and Big-O notationTo interpret work, span, and time bounds like O(n log n) and O(log n).
  • Basic concurrency in C++ (threads, futures, mutexes)To implement parallel patterns and understand synchronization and overheads.
  • Divide-and-conquer algorithmsMany parallel algorithms (e.g., merge sort, scan) are structured using recursive splitting.
  • Matrix operations and linear algebra basicsTo understand data-parallel computations ubiquitous in ML and matrix multiplication.
  • CPU memory hierarchy (caches) and localityMemory bandwidth and locality strongly affect parallel scalability in practice.
  • Associativity and commutativity of operationsParallel reductions and scans rely on associative (and often commutative) operators.

Detailed Explanation

Tap terms for definitions

01Overview

Parallel algorithm theory asks: if I have many processors, how much faster can I solve a problem? To answer, we use clean machine models that strip away hardware quirks and focus on core limits. The most famous is PRAM (Parallel Random Access Machine), where many simple processors share a single, uniform-speed memory. In PRAM, we measure parallel time T (how many steps until the answer completes) and work W (total operations across all processors). A core goal is to design work-efficient algorithms: those that use about as many total operations as the best sequential solution, but finish much sooner by parallelizing steps. From this lens arises the complexity class NC (Nick’s Class), which contains problems solvable in polylogarithmic time using polynomially many processors—think highly parallel tasks like prefix sums, balanced tree operations, and some linear algebra. Not all problems parallelize well: P-complete problems, like general circuit evaluation or certain dynamic programs, appear inherently sequential and thus unlikely to be in NC. Practical models like MapReduce capture cluster-scale parallelism with communication phases, while GPUs embody data-parallel SIMD execution ideal for vectorizable workloads in AI/ML. Amdahl’s and Gustafson’s laws provide sanity checks on expected speedups. Together, these ideas help you predict which problems can scale on multicore CPUs, clusters, or GPUs, and how to structure algorithms to achieve near-ideal speedups.

02Intuition & Analogies

Imagine a group project where each person can do a part independently. If the project splits into many equal, independent tasks (like labeling thousands of images), then adding people helps a lot: everyone labels a chunk, and you finish much sooner. That is data parallelism. Now imagine some parts depend on others (you can’t write the conclusion before the analysis). These dependencies form a chain that everyone must respect—the critical path. Even with a huge team, you cannot beat the time of that longest dependence chain. This is exactly the idea behind span (also called depth): the shortest possible completion time if you had infinitely many helpers is bounded by the longest chain of dependent steps. The PRAM model is like imagining a huge quiet library with many desks (processors) and one enormous bookshelf (shared memory). Everyone can read/write instantly without bumping into each other—clearly unrealistic, but great for understanding what is possible if coordination were free. If many students try to grab the same book at once, different PRAM variants say whether that’s allowed and how conflicts are resolved. Classes like NC are like saying: any project that reliably wraps up in a small number of meetings (logarithmic levels of coordination) can be done by a large but reasonable team. In contrast, P-complete tasks behave like projects with long, unbreakable dependency chains—more people just wait around. In the real world, organizations (clusters or GPUs) add constraints: communication delays (like passing papers around), memory contention (line for the copier), and coordination overhead (meetings). MapReduce is like first having everyone independently summarize their notes (map), then a well-organized meeting consolidates by topic (shuffle + reduce). GPUs are like stadiums of synchronized students all filling spreadsheets in lockstep—blazing fast if everyone does the same kind of work with neatly arranged data.

03Formal Definition

In the PRAM (Parallel Random Access Machine) model, computation proceeds in synchronized steps across p processors with unit-time access to a shared memory of unbounded size. Variants differ by how they handle simultaneous access to the same memory address: EREW (Exclusive Read Exclusive Write) forbids both concurrent reads and concurrent writes; CREW allows concurrent reads but not writes; CRCW also allows concurrent writes, with subvariants for how write conflicts are resolved (e.g., arbitrary, priority, or common-write requiring equal values). For an input of size n, a parallel algorithm’s time on p processors is (n), and its work is W(n) = p (n) when load-balanced; more generally, W(n) is the total number of primitive operations over all processors. The span (or depth) T_(n) is the time with an unbounded number of processors, equal to the length of the critical path in the algorithm’s dependency DAG. A parallel algorithm is work-efficient if W(n) = O((n)), where is optimal sequential time. Brent’s theorem (also called the work–span bound) connects these measures: for sufficiently large p, (n) = O\left( + T_(n)\right). The class NC consists of decision problems solvable in T(n) = O(( n)^k) time on a PRAM with at most processors, for some constants k, c 1; subclasses N correspond to O(( n)^i) time. P-complete problems (under NC-reductions) are believed not to be in NC, indicating likely inherent sequentiality. Practical models like MapReduce abstract massive parallelism with constrained communication phases, while GPU models assume SIMD/SIMT execution with high arithmetic throughput but structured memory access.

04When to Use

Use parallel algorithm theory when you need to reason about scalability before implementation details obscure the picture. If a task is naturally data-parallel (e.g., map/filter/scan on large arrays, image processing, vectorized ML operations), the PRAM view helps design work-efficient algorithms with small span, suggesting good speedups on CPUs and GPUs. Divide-and-conquer problems (e.g., merge sort, FFT, parallel prefix sums, tree traversals) often have logarithmic span and fit into NC-style solutions. When moving to distributed systems, MapReduce is appropriate for embarrassingly parallel maps with associative, commutative aggregations (e.g., word count, log analytics), where a shuffle-and-reduce consolidates partial results. In AI/ML, use data parallelism (replicating the model across devices and sharding data) when batch computations dominate and gradient aggregation is associative; use model/pipeline parallelism when models exceed single-device memory. GPU parallelism is ideal when the computation is SIMD-friendly: the same operation on many items, with regular control flow and memory access (e.g., matrix multiplication, convolutions). Conversely, avoid parallelization (or temper expectations) when the problem has high synchronization needs, heavy interdependencies (large span), irregular memory access, or small input sizes where overhead dominates. For P-complete tasks (e.g., general dynamic programming on arbitrary DAGs), consider approximation, special cases, or parallelizing across instances rather than within a single hard instance.

⚠️Common Mistakes

• Equating core count with linear speedup: real machines have cache hierarchies, memory bandwidth limits, and synchronization overhead. Use the work–span model and Brent’s theorem to bound expectations. • Ignoring memory contention: PRAM assumes unit-time shared memory; in practice, false sharing and non-coalesced accesses on GPUs can dominate. Design for locality and coalescence. • Over-parallelizing: spawning too many threads or tasks for tiny workloads can make performance worse than sequential due to scheduling overhead. Introduce cutoffs and batch small tasks. • Misapplying Amdahl’s law: Amdahl’s upper bound is pessimistic when the problem size grows with p. Use Gustafson’s law to reason about scaled workloads. • Neglecting load balance: uneven task sizes lead to idle processors and poor utilization. Use work-stealing or divide data evenly to match compute cost, not just item counts. • Assuming associativity/commutativity incorrectly: MapReduce-style reductions and parallel scans require associative (and often commutative) operators; floating-point sums are not strictly associative—use pairwise or Kahan summation to improve accuracy. • Expecting NC-like bounds on irregular problems: graph algorithms with high diameter or strong dependencies often have large span; consider heuristics or asynchronous algorithms. • Forgetting determinism and races: concurrent writes need well-defined resolution; use atomics, barriers, or reductions to avoid data races. • Measuring the wrong thing: focus on end-to-end throughput and scalability (speedup, efficiency), not just CPU utilization. • Portability assumptions: C++ parallel patterns may need different backends (std::threads, OpenMP, TBB, CUDA) across platforms; separate algorithmic structure from implementation backend.

Key Formulas

Work

Explanation: Work is the total number of primitive operations performed across all p processors. It should ideally match the best sequential complexity to be considered work-efficient.

Parallel Time

Explanation: (n) measures how long an algorithm takes to run when using p processors. It is the key measure of speedup compared to sequential time (n).

Span / Depth

Explanation: Span is the runtime with infinitely many processors. It equals the longest chain of dependent operations and lower-bounds any achievable runtime.

Brent's Theorem

Explanation: Time on p processors is bounded by the sum of average work per processor and the span. This shows diminishing returns once p exceeds W/T_.

Speedup

Explanation: Speedup compares sequential time to parallel time. Ideal speedup is p, but overheads and span limit practical speedup.

Efficiency

Explanation: Efficiency measures how well processors are utilized. Values near 1 indicate excellent parallel scaling; values near 0 indicate wasted resources.

Amdahl's Law

Explanation: If fraction f of a program is sequential, speedup is limited regardless of processor count. This guides expectations for fixed-size workloads.

Gustafson's Law

Explanation: For scaled workloads, the effective speedup grows nearly linearly with p, discounted by the fraction f of inherently sequential work.

NC Class

Explanation: NC contains problems solvable in polylogarithmic time using polynomially many processors on a PRAM. It formalizes highly parallelizable computations.

Parallel Prefix Sum

Explanation: The classic tree-based scan achieves logarithmic time with linear work on PRAM. It is a building block for many parallel algorithms.

Parallel Merge Sort (PRAM)

Explanation: With parallel merging in O(log n), the overall time becomes O(lo n), maintaining optimal O(n log n) work.

Parallel Matrix Multiply (PRAM)

Explanation: Each output entry is an inner product; parallelizing the n multiplications and using a tree reduction yields logarithmic time with cubic work.

Complexity Analysis

Parallel algorithms are judged by both their parallel time (n) and total work W(n). Work-efficiency (W(n) = O((n))) ensures we do not expend more total operations than necessary, while small span T_∞(n) enables strong speedups as p grows. Brent’s theorem, (n) = O(W(n)/p + T_∞(n)), reveals two bottlenecks: insufficient parallelism (W/p term) and inherent dependencies (span term). For the block-parallel prefix sum shown below, each thread performs Θ local work, and we incur an additional Θ(p) work for block prefixing, giving = Θ(n/p + overhead) and W = Θ(n), achieving work-efficiency for n. For parallel merge sort, the total work remains O(n log n), but span is typically O(lo n) on PRAM due to parallel merges; thus log n / p + lo n). In practice with std::async and cutoffs, scheduler overhead adds to the span, so we rely on coarse-grained tasks. For parallel matrix multiplication dividing rows among threads, work is Θ(), span is roughly Θ(/p) plus synchronization overhead; without algorithmic changes, we cannot reduce work below cubic, but we can improve constants with cache-aware blocking and data locality. Space complexity often increases to hold temporaries (e.g., O(n) extra for merge buffers or scans, O() for matrices). Real hardware introduces non-PRAM costs: cache misses, memory bandwidth, NUMA effects, and synchronization. Therefore, algorithmic bounds provide optimistic ceilings; measured runtimes depend heavily on memory access patterns and load balance. To approach theoretical performance, combine work-efficient designs with careful partitioning, cutoffs, locality optimizations, and associative/commutative reductions to minimize synchronization.

Code Examples

Parallel Prefix Sum (Exclusive Scan) via Block Decomposition
1#include <bits/stdc++.h>
2using namespace std;
3
4// Parallel exclusive scan using two-phase block method.
5// Phase 1: Each thread computes exclusive scan on its block and records block sum.
6// Phase 2: Compute prefix over block sums, then add offset to each block in parallel.
7
8vector<long long> parallel_exclusive_scan(const vector<int>& a, unsigned num_threads = thread::hardware_concurrency()) {
9 const size_t n = a.size();
10 vector<long long> out(n, 0);
11 if (n == 0) return out;
12
13 if (num_threads == 0) num_threads = 4; // fallback
14 num_threads = min<unsigned>(num_threads, static_cast<unsigned>(n));
15 const size_t block_size = (n + num_threads - 1) / num_threads;
16
17 vector<long long> blockSums(num_threads, 0);
18 vector<thread> threads;
19 threads.reserve(num_threads);
20
21 // Phase 1: local exclusive scans per block
22 for (unsigned t = 0; t < num_threads; ++t) {
23 threads.emplace_back([&, t](){
24 const size_t begin = t * block_size;
25 const size_t end = min(n, begin + block_size);
26 long long sum = 0;
27 for (size_t i = begin; i < end; ++i) {
28 out[i] = sum; // exclusive: write previous sum
29 sum += a[i]; // accumulate current element
30 }
31 blockSums[t] = sum; // total of this block
32 });
33 }
34 for (auto &th : threads) th.join();
35
36 // Compute offsets as exclusive prefix of blockSums (sequential, size = num_threads)
37 vector<long long> offsets(num_threads, 0);
38 long long acc = 0;
39 for (unsigned t = 0; t < num_threads; ++t) {
40 offsets[t] = acc;
41 acc += blockSums[t];
42 }
43
44 // Phase 2: add offsets in parallel
45 threads.clear();
46 for (unsigned t = 0; t < num_threads; ++t) {
47 threads.emplace_back([&, t](){
48 const size_t begin = t * block_size;
49 const size_t end = min(n, begin + block_size);
50 const long long off = offsets[t];
51 for (size_t i = begin; i < end; ++i) {
52 out[i] += off;
53 }
54 });
55 }
56 for (auto &th : threads) th.join();
57
58 return out;
59}
60
61int main() {
62 ios::sync_with_stdio(false);
63 cin.tie(nullptr);
64
65 vector<int> a = {3, 1, 7, 0, 4, 1, 6, 3};
66 auto res = parallel_exclusive_scan(a);
67
68 // Verify against sequential exclusive scan
69 vector<long long> check(a.size(), 0);
70 long long s = 0;
71 for (size_t i = 0; i < a.size(); ++i) {
72 check[i] = s;
73 s += a[i];
74 }
75
76 cout << "Input: "; for (auto v: a) cout << v << ' '; cout << '\n';
77 cout << "Parallel out:"; for (auto v: res) cout << ' ' << v; cout << '\n';
78 cout << "Sequential :"; for (auto v: check) cout << ' ' << v; cout << '\n';
79
80 bool ok = (res == check);
81 cout << (ok ? "Match" : "Mismatch") << '\n';
82 return ok ? 0 : 1;
83}
84

This implementation divides the array into contiguous blocks. Each thread computes an exclusive scan on its block and records the block sum. After computing the prefix of block sums (small: one per thread), we add the corresponding offset to every element in each block in parallel. This structure mirrors PRAM’s tree-based scan but adapts to practical multicore constraints by reducing synchronization to two global phases.

Time: O(n/p + p) parallel time; O(n) workSpace: O(n + p) extra space for output and per-block metadata
Parallel Merge Sort with std::async and Cutoff
1#include <bits/stdc++.h>
2using namespace std;
3
4// Parallel merge sort using std::async with a cutoff to limit task overhead.
5// The algorithm remains work-efficient (O(n log n)) while reducing span through concurrency.
6
7static const size_t CUTOFF = 1 << 14; // switch to sequential sort for small arrays
8
9void merge_ranges(vector<int>& a, vector<int>& tmp, size_t l, size_t m, size_t r) {
10 size_t i = l, j = m, k = l;
11 while (i < m && j < r) tmp[k++] = (a[i] <= a[j]) ? a[i++] : a[j++];
12 while (i < m) tmp[k++] = a[i++];
13 while (j < r) tmp[k++] = a[j++];
14 // copy back
15 for (size_t t = l; t < r; ++t) a[t] = tmp[t];
16}
17
18void parallel_merge_sort_rec(vector<int>& a, vector<int>& tmp, size_t l, size_t r, unsigned depth) {
19 const size_t n = r - l;
20 if (n <= 1) return;
21 if (n <= CUTOFF || depth == 0) {
22 sort(a.begin() + l, a.begin() + r);
23 return;
24 }
25 size_t m = l + n / 2;
26 // Spawn one half asynchronously and sort the other in current thread
27 auto fut = async(launch::async, [&]{ parallel_merge_sort_rec(a, tmp, l, m, depth - 1); });
28 parallel_merge_sort_rec(a, tmp, m, r, depth - 1);
29 fut.get();
30 merge_ranges(a, tmp, l, m, r);
31}
32
33void parallel_merge_sort(vector<int>& a) {
34 vector<int> tmp(a.size());
35 unsigned hw = thread::hardware_concurrency();
36 if (hw == 0) hw = 4;
37 // Depth controls number of concurrent tasks: up to 2^depth tasks
38 unsigned depth = static_cast<unsigned>(log2(hw));
39 parallel_merge_sort_rec(a, tmp, 0, a.size(), depth);
40}
41
42int main() {
43 ios::sync_with_stdio(false);
44 cin.tie(nullptr);
45
46 vector<int> a(1 << 20);
47 iota(a.begin(), a.end(), 0);
48 // Shuffle to create a random permutation
49 mt19937 rng(123);
50 shuffle(a.begin(), a.end(), rng);
51
52 vector<int> b = a;
53 parallel_merge_sort(a);
54 sort(b.begin(), b.end());
55
56 cout << (a == b ? "Sorted correctly" : "Incorrectly sorted") << "\n";
57 return a == b ? 0 : 1;
58}
59

The array is recursively split into halves. For sufficiently large subarrays and remaining depth budget, one half is sorted asynchronously while the other is processed in the current thread, then the two halves are merged. A cutoff avoids excessive task creation. The algorithm maintains O(n log n) work and reduces span roughly to O(log^2 n) on PRAM; in practice, speed depends on the scheduler, cutoff, and memory bandwidth.

Time: O(n log n) work; parallel time about O(n log n / p + log n) in practice with cutoffsSpace: O(n) auxiliary buffer for merging
Parallel Matrix Multiplication (Row-Partitioned Threads)
1#include <bits/stdc++.h>
2using namespace std;
3
4// Compute C = A * B where A, B, C are n x n matrices (dense), parallelized by rows.
5// For better cache locality, we transpose B first so that inner loop reads contiguous memory.
6
7typedef vector<double> Vec;
8typedef vector<Vec> Mat;
9
10Mat transpose(const Mat& B) {
11 size_t n = B.size();
12 Mat Bt(n, Vec(n));
13 for (size_t i = 0; i < n; ++i)
14 for (size_t j = 0; j < n; ++j)
15 Bt[j][i] = B[i][j];
16 return Bt;
17}
18
19Mat parallel_matmul(const Mat& A, const Mat& B, unsigned num_threads = thread::hardware_concurrency()) {
20 size_t n = A.size();
21 Mat C(n, Vec(n, 0.0));
22 if (n == 0) return C;
23 if (num_threads == 0) num_threads = 4;
24 num_threads = min<unsigned>(num_threads, static_cast<unsigned>(n));
25
26 Mat Bt = transpose(B); // improves memory access patterns
27
28 size_t rows_per_thread = (n + num_threads - 1) / num_threads;
29 vector<thread> threads;
30 threads.reserve(num_threads);
31
32 for (unsigned t = 0; t < num_threads; ++t) {
33 threads.emplace_back([&, t]() {
34 size_t r_begin = t * rows_per_thread;
35 size_t r_end = min(n, r_begin + rows_per_thread);
36 for (size_t i = r_begin; i < r_end; ++i) {
37 for (size_t j = 0; j < n; ++j) {
38 const double* ai = A[i].data();
39 const double* btj = Bt[j].data();
40 double sum = 0.0;
41 // Inner product of A[i][:] and B[:][j] (via Bt[j][:])
42 for (size_t k = 0; k < n; ++k) sum += ai[k] * btj[k];
43 C[i][j] = sum;
44 }
45 }
46 });
47 }
48 for (auto& th : threads) th.join();
49
50 return C;
51}
52
53int main() {
54 ios::sync_with_stdio(false);
55 cin.tie(nullptr);
56
57 size_t n = 4;
58 Mat A(n, Vec(n)), B(n, Vec(n));
59 // Simple test: A = identity, C should equal B
60 for (size_t i = 0; i < n; ++i) {
61 for (size_t j = 0; j < n; ++j) {
62 A[i][j] = (i == j) ? 1.0 : 0.0;
63 B[i][j] = (i + j);
64 }
65 }
66 Mat C = parallel_matmul(A, B);
67 bool ok = true;
68 for (size_t i = 0; i < n; ++i)
69 for (size_t j = 0; j < n; ++j)
70 ok = ok && (fabs(C[i][j] - B[i][j]) < 1e-9);
71
72 cout << (ok ? "Matmul correct" : "Matmul incorrect") << "\n";
73 return ok ? 0 : 1;
74}
75

The computation assigns contiguous row blocks to threads. Each thread computes C[i][j] as an inner product; transposing B turns the innermost loop into a contiguous memory access, improving cache behavior. This mirrors data-parallel SIMD-friendly structure that maps well to GPUs too, where each output element is computed independently.

Time: O(n^3/p) parallel time; O(n^3) workSpace: O(n^2) for outputs plus O(n^2) temporary for B^T