27-4 Multithreading reductions and prefix computations
A $\otimes$-reduction of an array $x[1 \ldots n]$, where $\otimes$ is an associative operator, is the value
$$y = x[1] \otimes x[2] \otimes \cdots \otimes x[n].$$
The following procedure computes the $\otimes$-reduction of a subarray $x[i \ldots j]$ serially.
a. Use nested parallelism to implement a multithreaded algorithm $\text{P-REDUCE}$, which performs the same function with $\Theta(n)$ work and $\Theta(\lg n)$ span. Analyze your algorithm.
A related problem is that of computing a $\otimes$-prefix computation, sometimes called a $\otimes$-scan, on an array $x[1 \ldots n]$, where $\otimes$ is once again an associative operator. The $\otimes$-scan produces the array $y[1 \ldots n]$ given by
$$ \begin{aligned} y[1] & = x[1], \\ y[2] & = x[1] \otimes x[2], \\ y[3] & = x[1] \otimes x[2] \otimes x[3], \\ & \vdots \\ y[n] & = x[1] \otimes x[2] \otimes x[3] \otimes \cdots \otimes x[n], \end{aligned} $$
that is, all prefixes of the array $x$ "summed" using $\otimes$ operator. The following serial procedure $\text{SCAN}$ performs a $\otimes$-prefix computation:
SCAN(x) n = x.length let y[1..n] be a new array y[1] = x[1] for i = 2 to n y[i] = y[i - 1] ⊗ x[i] return y
Unfortunately, multithreading $\text{SCAN}$ is not straightforward. For example, changing the for loop to a parallel for loop would create races, since each iteration of the loop body depends on the previous iteration. The following procedure $\text{P-SCAN-1}$ performs the $\otimes$-prefix computation in parallel, albeit inefficiently.
b. Analyze the work, span, and parallelism of $\text{P-SCAN-1}$.
By using nested parallelism, we can obtain a more efficient $\otimes$-prefix computation:
P-SCAN-2-AUX(x, y, i, j) if i == j y[i] = x[i] else k = floor((i + j) / 2) spawn P-SCAN-2-AUX(x, y, i, k) P-SCAN-2-AUX(x, y, k + 1, j) sync parallel for l = k + 1 to j y[l] = y[k] ⊗ y[l]
c. Argue that $\text{P-SCAN-2}$ is correct, and analyze its work, span, and parallelism.
We can improve on both $\text{P-SCAN-1}$ and $\text{P-SCAN-2}$ by performing the $\otimes$-prefix computation in two distinct passes over the data. On the first pass, we gather the terms for various contiguous subarrays of $x$ into a temporary array $t$, and on the second pass we use the terms in $t$ to compute the final result $y$. The following pseudocode implements this strategy, but certain expressions have been omitted:
P-SCAN-3(x) n = x.length let y[1..n] and t[1..n] be new arrays y[1] = x[1] if n > 1 P-SCAN-UP(x, t, 2, n) P-SCAN-DOWN(x[1], x, t, y, 2, n) return y
P-SCAN-UP(x, t, i, j) if i == j return x[i] else k = floor((i + j) / 2) t[k] = spawn P-SCAN-UP(x, t, i, k) right = P-SCAN-UP(x, t, k + 1, j) sync return ____ // fill in the blank
P-SCAN-DOWN(v, x, t, y, i, j) if i == j y[i] = v ⊗ x[i] else k = floor((i + j) / 2) spawn P-SCAN-DOWN(____, x, t, y, i, k) // fill in the blank P-SCAN-DOWN(____, x, t, y, k + 1, j) // fill in the blank sync
d. Fill in the three missing expressions in line 8 of $\text{P-SCAN-UP}$ and lines 5 and 6 of $\text{P-SCAN-DOWN}$. Argue that with expressions you supplied, $\text{P-SCAN-3}$ is correct. ($\textit{Hint:}$ Prove that the value $v$ passed to $\text{P-SCAN-DOWN}(v, x, t, y, i, j)$ satisfies $v = x[1] \otimes x[2] \otimes \cdots \otimes x[i - 1]$.)
e. Analyze the work, span, and parallelism of $\text{P-SCAN-3}$.
(Removed)