Skip to content

27.1 The basics of dynamic multithreading


Suppose that we spawn $\text{P-FIB}(n - 2)$ in line 4 of $\text{P-FIB}$, rather than calling it as is done in the code. What is the impact on the asymptotic work, span, and parallelism?



Draw the computation dag that results from executing $\text{P-FIB}(5)$. Assuming that each strand in the computation takes unit time, what are the work, span, and parallelism of the computation? Show how to schedule the dag on 3 processors using greedy scheduling by labeling each strand with the time step in which it is executed.

  • Work: $T_1 = 29$.
  • Span: $T_\infty = 10$.
  • Parallelism: $T_1 / T_\infty \approx 2.9$.


Prove that a greedy scheduler achieves the following time bound, which is slightly stronger than the bound proven in Theorem 27.1:

$$T_P \le \frac{T_1 - T_\infty}{P} + T_\infty. \tag{27.5}$$

Suppose that there are x incomplete steps in a run of the program. Since each of these steps causes at least one unit of work to be done, we have that there is at most $(T_1 - x)$ units of work done in the complete steps. Then, we suppose by contradiction that the number of complete steps is strictly greater than $\lfloor (T_1 - x) / P \rfloor$. Then, we have that the total amount of work done during the complete steps is

$$P \cdot (\lfloor (T_1 - x) / P \rfloor + 1) = P \lfloor (T_1 - x) / P = (T_1 - x) - ((T_1 - x) \mod P) + P > T_1 - x.$$

This is a contradiction because there are only $(T_1 - x)$ units of work done during complete steps, which is less than the amount we would be doing. Notice that since $T_\infty$ is abound on the total number of both kinds of steps, it is a bound on the number of incomplete steps, $x$, so,

$$T_P \le \lfloor (T_1 - x) / P \rfloor + x \le \lfloor (T_1 - T_\infty) / P \rfloor + T_\infty.$$

Where the second inequality comes by noting that the middle expression, as a function of $x$ is monotonically increasing, and so is bounded by the largest value of $x$ that is possible, namely $T_\infty$.


Construct a computation dag for which one execution of a greedy scheduler can take nearly twice the time of another execution of a greedy scheduler on the same number of processors. Describe how the two executions would proceed.

The computation is given in the image below. Let vertex $u$ have degree $k$, and assume that there are $m$ vertices in each vertical chain. Assume that this is executed on $k$ processors. In one execution, each strand from among the $k$ on the left is executed concurrently, and then the $m$ strands on the right are executed one at a time. If each strand takes unit time to execute, then the total computation takes $2m$ time. On the other hand, suppose that on each time step of the computation, $k - 1$ strands from the left (descendants of $u$) are executed, and one from the right (a descendant of $v$), is executed. If each strand take unit time to executed, the total computation takes $m + m / k$. Thus, the ratio of times is $2m / (m + m / k) = 2 / (1 + 1 / k)$. As $k$ gets large, this approaches $2$ as desired.


Professor Karan measures her deterministic multithreaded algorithm on $4$, $10$, and $64$ processors of an ideal parallel computer using a greedy scheduler. She claims that the three runs yielded $T_4 = 80$ seconds, $T_{10} = 42$ seconds, and $T_{64} = 10$ seconds. Argue that the professor is either lying or incompetent. ($\textit{Hint:}$ Use the work law $\text{(27.2)}$, the span law $\text{(27.3)}$, and inequality $\text{(27.5)}$ from Exercise 27.1-3.)



Give a multithreaded algorithm to multiply an $n \times n$ matrix by an $n$-vector that achieves $\Theta(n^2 / \lg n)$ parallelism while maintaining $\Theta(n^2)$ work.



Consider the following multithreaded pseudocode for transposing an $n \times n$ matrix $A$ in place:

    n = A.rows
    parallel for j = 2 to n
        parallel for i = 1 to j - 1
            exchange a[i, j] with a[j, i]

Analyze the work, span, and parallelism of this algorithm.



Suppose that we replace the parallel for loop in line 3 of $\text{P-TRANSPOSE}$ (see Exercise 27.1-7) with an ordinary for loop. Analyze the work, span, and parallelism of the resulting algorithm.



For how many processors do the two versions of the chess programs run equally fast, assuming that $T_P = T_1 / P + T_\infty$?