# 27-4 Multithreading reductions and prefix computations

A $\otimes$-reduction of an array $x[1 \ldots n]$, where $\otimes$ is an associative operator, is the value

$$y = x \otimes x \otimes \cdots \otimes x[n].$$

The following procedure computes the $\otimes$-reduction of a subarray $x[i \ldots j]$ serially.

REDUCE(x, i, j)
y = x[i]
for k = i + 1 to j
y = y ⊗ x[k]
return y


a. Use nested parallelism to implement a multithreaded algorithm $\text{P-REDUCE}$, which performs the same function with $\Theta(n)$ work and $\Theta(\lg n)$ span. Analyze your algorithm.

A related problem is that of computing a $\otimes$-prefix computation, sometimes called a $\otimes$-scan, on an array $x[1 \ldots n]$, where $\otimes$ is once again an associative operator. The $\otimes$-scan produces the array $y[1 \ldots n]$ given by

\begin{aligned} y & = x, \\ y & = x \otimes x, \\ y & = x \otimes x \otimes x, \\ & \vdots \\ y[n] & = x \otimes x \otimes x \otimes \cdots \otimes x[n], \end{aligned}

that is, all prefixes of the array $x$ "summed" using $\otimes$ operator. The following serial procedure $\text{SCAN}$ performs a $\otimes$-prefix computation:

SCAN(x)
n = x.length
let y[1..n] be a new array
y = x
for i = 2 to n
y[i] = y[i - 1] ⊗ x[i]
return y


Unfortunately, multithreading $\text{SCAN}$ is not straightforward. For example, changing the for loop to a parallel for loop would create races, since each iteration of the loop body depends on the previous iteration. The following procedure $\text{P-SCAN-1}$ performs the $\otimes$-prefix computation in parallel, albeit inefficiently.

P-SCAN-1(x)
n = x.length
let y[1..n] be a new array
P-SCAN-1-AUX(x, y, 1, n)
return y

P-SCAN-1-AUX(x, y, i, j)
parallel for l = i to j
y[l] = P-REDUCE(x, 1, l)


b. Analyze the work, span, and parallelism of $\text{P-SCAN-1}$.

By using nested parallelism, we can obtain a more efficient $\otimes$-prefix computation:

P-SCAN-2(x)
n = x.length
let y[1..n] be a new array
P-SCAN-2-AUX(x, y, 1, n)
return y

P-SCAN-2-AUX(x, y, i, j)
if i == j
y[i] = x[i]
else k = floor((i + j) / 2)
spawn P-SCAN-2-AUX(x, y, i, k)
P-SCAN-2-AUX(x, y, k + 1, j)
sync
parallel for l = k + 1 to j
y[l] = y[k] ⊗ y[l]


c. Argue that $\text{P-SCAN-2}$ is correct, and analyze its work, span, and parallelism.

We can improve on both $\text{P-SCAN-1}$ and $\text{P-SCAN-2}$ by performing the $\otimes$-prefix computation in two distinct passes over the data. On the first pass, we gather the terms for various contiguous subarrays of $x$ into a temporary array $t$, and on the second pass we use the terms in $t$ to compute the final result $y$. The following pseudocode implements this strategy, but certain expressions have been omitted:

P-SCAN-3(x)
n = x.length
let y[1..n] and t[1..n] be new arrays
y = x
if n > 1
P-SCAN-UP(x, t, 2, n)
P-SCAN-DOWN(x, x, t, y, 2, n)
return y

P-SCAN-UP(x, t, i, j)
if i == j
return x[i]
else
k = floor((i + j) / 2)
t[k] = spawn P-SCAN-UP(x, t, i, k)
right = P-SCAN-UP(x, t, k + 1, j)
sync
return ____           // fill in the blank

P-SCAN-DOWN(v, x, t, y, i, j)
if i == j
y[i] = v ⊗ x[i]
else
k = floor((i + j) / 2)
spawn P-SCAN-DOWN(____, x, t, y, i, k)  // fill in the blank
P-SCAN-DOWN(____, x, t, y, k + 1, j)    // fill in the blank
sync


d. Fill in the three missing expressions in line 8 of $\text{P-SCAN-UP}$ and lines 5 and 6 of $\text{P-SCAN-DOWN}$. Argue that with expressions you supplied, $\text{P-SCAN-3}$ is correct. ($\textit{Hint:}$ Prove that the value $v$ passed to $\text{P-SCAN-DOWN}(v, x, t, y, i, j)$ satisfies $v = x \otimes x \otimes \cdots \otimes x[i - 1]$.)

e. Analyze the work, span, and parallelism of $\text{P-SCAN-3}$.

(Removed)