27-5 Multithreading a simple stencil calculation

Computational science is replete with algorithms that require the entries of an array to be filled in with values that depend on the values of certain already computed neighboring entries, along with other information that does not change over the course of the computation. The pattern of neighboring entries does not change during the computation and is called a stencil. For example, Section 15.4 presents a stencil algorithm to compute a longest common subsequence, where the value in entry $c[i, j]$ depends only on the values in $c[i - 1, j]$, $c[i, j - 1]$, and $c[i - 1, j - 1]$, as well as the elements $x_i$ and $y_j$ within the two sequences given as inputs. The input sequences are fixed, but the algorithm fills in the two-dimensional array $c$ so that it computes entry $c[i, j]$ after computing all three entries $c[i - 1, j]$, $c[i, j - 1]$, and $c[i - 1, j - 1]$.

In this problem, we examine how to use nested parallelism to multithread a simple stencil calculation on an $n \times n$ array $A$ in which, of the values in $A$, the value placed into entry $A[i, j]$ depends only on values in $A[i' , j']$, where $i' \le i$ and $j' \le j$ (and of course, $i' \ne i$ or $j' \ne j$). In other words, the value in an entry depends only on values in entries that are above it and/or to its left, along with static information outside of the array. Furthermore, we assume throughout this problem that once we have filled in the entries upon which $A[i, j]$ depends, we can fill in $A[i, j]$ in $\Theta(1)$ time (as in the $\text{LCS-LENGTH}$ procedure of Section 15.4).

We can partition the $n \times n$ array $A$ into four $n / 2 \times n / 2$ subarrays as follows:

$$ A = \begin{pmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \tag{27.11} \end{pmatrix} . $$

Observe now that we can fill in subarray $A_{11}$ recursively, since it does not depend on the entries of the other three subarrays. Once $A_{11}$ is complete, we can continue to fill in $A_{12}$ and $A_{21}$ recursively in parallel, because although they both depend on $A_{11}$ , they do not depend on each other. Finally, we can fill in $A_{22}$ recursively.

a. Give multithreaded pseudocode that performs this simple stencil calculation using a divide-and-conquer algorithm $\text{SIMPLE-STENCIL}$ based on the decomposition $\text{(27.11)}$ and the discussion above. (Don't worry about the details of the base case, which depends on the specific stencil.) Give and solve recurrences for the work and span of this algorithm in terms of $n$. What is the parallelism?

b. Modify your solution to part (a) to divide an $n \times n$ array into nine $n / 3 \times n / 3$ subarrays, again recursing with as much parallelism as possible. Analyze this algorithm. How much more or less parallelism does this algorithm have compared with the algorithm from part (a)?

c. Generalize your solutions to parts (a) and (b) as follows. Choose an integer $b \ge 2$. Divide an $n \times n$ array into $b^2$ subarrays, each of size $n / b \times n / b$, recursing with as much parallelism as possible. In terms of $n$ and $b$, what are the work, span, and parallelism of your algorithm? Argue that, using this approach, the parallelism must be $o(n)$ for any choice of $b \ge 2$. ($\textit{Hint:}$ For this last argument, show that the exponent of $n$ in the parallelism is strictly less than $1$ for any choice of $b \ge 2$.)

d. Give pseudocode for a multithreaded algorithm for this simple stencil calculation that achieves $\Theta(n\lg n)$ parallelism. Argue using notions of work and span that the problem, in fact, has $\Theta(n)$ inherent parallelism. As it turns out, the divide-and-conquer nature of our multithreaded pseudocode does not let us achieve this maximal parallelism.

(Removed)