27.3 Multithreaded merge sort

27.3-1

Explain how to coarsen the base case of $\text{P-MERGE}$.

Replace the condition on line 2 with a check that $n < k$ for some base case size $k$. And instead of just copying over the particular element of $A$ to the right spot in $B$, you would call a serial sort on the remaining segment of $A$ and copy the result of that over into the right spots in $B$.

27.3-2

Instead of finding a median element in the larger subarray, as $\text{P-MERGE}$ does, consider a variant that finds a median element of all the elements in the two sorted subarrays using the result of Exercise 9.3-8. Give pseudocode for an efficient multithreaded merging procedure that uses this median-finding procedure. Analyze your algorithm.

By a slight modification of exercise 9.3-8 we can find we can find the median of all elements in two sorted arrays of total length $n$ in $O(\lg n)$ time. We'll modify $\text{P-MERGE}$ to use this fact. Let $\text{MEDIAN}(T, p_1, r_1, p_2, r_2)$ be the function which returns a pair, $q$, where $q.pos$ is the position of the median of all the elements $T$ which lie between positions $p_1$ and $r_1$, and between positions $p_2$ and $r_2$, and $q.arr$ is $1$ if the position is between $p_1$ and $r_1$, and $2$ otherwise.

P-MEDIAN-MERGE(T, p[1], r[1], p[2], r[2], A, p[3])
    n[1] = r[1] - p[1] + 1
    n[2] = r[2] - p[2] + 1
    if n[1] < n[2]              // ensure that n[1] ≥ n[2]
        exchange p[1] with p[2]
        exchange r[1] with r[2]
        exchange n[1] with n[2]
    if n[1] == 0              // both empty?
        return
    q = MEDIAN(T, p[1], r[1], p[2], r[2])
    if q.arr == 1
        q[2] = BINARY-SEARCH(T[q.pos], T, p[2], r[2])
        q[3] = p[3] + q.pos - p[1] + q[2] - p[2]
        A[q[3]] = T[q.pos]
        spawn P-MEDIAN-MERGE(T, p[1], q.pos - 1, p[2], q[2] - 1, A, p[3])
        P-MEDIAN-MERGE(T, q.pos + 1, r[1], q[2] + 1, r[2], A, p[3])
        sync
    else
        q[2] = BINARY-SEARCH(T[q.pos], T, p[1], r[1])
        q[3] = p[3] + q.pos - p[2] + q[2] - p[1]
        A[q[3]] = T[q.pos]
        spawn P-MEDIAN-MERGE(T, p[1], q[2] - 1, p[2], q.pos - 1, A, p[3])
        P-MEDIAN-MERGE(T, q[2] + 1, r[1], q.pos + 1, r[2], A, p[3])
        sync

The work is characterized by the recurrence $T_1(n) = O(\lg n) + 2T_1(n / 2)$, whose solution tells us that $T_1(n) = O(n)$. The work is at least $\Omega(n)$ since we need to examine each element, so the work is $\Theta(n)$. The span satisfies the recurrence

$$ \begin{aligned} T_\infty(n) & = O(\lg n) + O(\lg n / 2) + T_\infty(n / 2) \\ & = O(\lg n) + T_\infty(n / 2) \\ & = \Theta(\lg^2 n), \end{aligned} $$

by exercise 4.6-2.

27.3-3

Give an efficient multithreaded algorithm for partitioning an array around a pivot, as is done by the $\text{PARTITION}$ procedure on page 171. You need not partition the array in place. Make your algorithm as parallel as possible. Analyze your algorithm. ($\textit{Hint:}$ You may need an auxiliary array and may need to make more than one pass over the input elements.)

Suppose that there are $c$ different processors, and the array has length $n$ and you are going to use its last element as a pivot. Then, look at each chunk of size $\lceil \frac{n}{c} \rceil$ of entries before the last element, give one to each processor. Then, each counts the number of elements that are less than the pivot. Then, we compute all the running sums of these values that are returned. This can be done easily by considering all of the subarrays placed along the leaves of a binary tree, and then summing up adjacent pairs. This computation can be done in time $\lg(\min\{c, n\})$ since it's the log of the number of leaves. From there, we can compute all the running sums for each of the subarrays also in logarithmic time. This is by keeping track of the sum of all more left cousins of each internal node, which is found by adding the left sibling's sum vale to the left cousin value of the parent, with the root's left cousin value initiated to $0$. This also just takes time the depth of the tree, so is $\lg(\min\{c, n\})$. Once all of these values are computed at the root, it is the index that the subarray's elements less than the pivot should be put. To find the position where the subarray's elements larger than the root should be put, just put it at twice the sum value of the root minus the left cousin value for that subarray. Then, the time taken is just $O(\frac{n}{c})$. By doing this procedure, the total work is just $O(n)$, and the span is $O(\lg n)$, and so has parallelization of $O(\frac{n}{\lg n})$. This whole process is split across the several algoithms appearing here.

27.3-4

Give a multithreaded version of $\text{RECURSIVE-FFT}$ on page 911. Make your implementation as parallel as possible. Analyze your algorithm.

P-RECURSIVE-FFT(a)
    n = a.length
    if n == 1
        return a
    w[n] = e^{2 * π * i / n}
    w = 1
    a(0) = [a[0], a[2]..a[n - 2]]
    a(1) = [a[1], a[3]..a[n - 1]]
    y(0) = spawn P-RECURSIVE-FFT(a[0])
    y(1) = P-RECURSIVE-FFT(a[1])
    sync
    parallel for k = 0 to n / 2 - 1
        y[k] = y[k](0) + w * y[k](1)
        y[k + n / 2] = y[k](0) - w * y[k](1)
        w = w * w[n]
    return y

$\text{P-RECURSIVE-FFT}$ parallelized over the two recursive calls, having a parallel for works because each of the iterations of the for loop touch independent sets of variables. The span of the procedure is only $\Theta(\lg n)$ giving it a parallelization of $\Theta(n)$.

27.3-5 $\star$

Give a multithreaded version of $\text{RANDOMIZED-SELECT}$ on page 216. Make your implementation as parallel as possible. Analyze your algorithm. ($\textit{Hint:}$ Use the partitioning algorithm from Exercise 27.3-3.)

Randomly pick a pivot element, swap it with the last element, so that it is in the correct format for running the procedure described in 27.3-3. Run partition from problem 27.3-3. As an intermediate step, in that procedure, we compute the number of elements less than the pivot ($T.root.sum$), so keep track of that value after the end of partition. Then, if we have that it is less than $k$, recurse on the subarray that was greater than or equal to the pivot, decreasing the order statistic of the element to be selected by $T.root.sum$. If it is larger than the order statistic of the element to be selected, then leave it unchanged and recurse on the subarray that was formed to be less than the pivot. A lot of the analysis in section 9.2 still applies, except replacing the timer needed for partitioning with the runtime of the algorithm in problem 27.3-3. The work is unchanged from the serial case because when $c = 1$, the algorithm reduces to the serial algorithm for partitioning. For span, the $O(n)$ term in the equation half way down page 218 can be replaced with an $O(\lg n)$ term. It can be seen with the substitution method that the solution to this is logarithmic

$$E[T(n)] \le \frac{2}{n} \sum_{k = \lfloor n / 2 \rfloor}^{n - 1} C\lg k + O(\lg n) \le O(\lg n).$$

So, the total span of this algorithm will still just be $O(\lg n)$.

27.3-6 $\star$

Show how to multithread $\text{SELECT}$ from Section 9.3. Make your implementation as parallel as possible. Analyze your algorithm.

Let $\text{MEDIAN}(A)$ denote a brute force method which returns the median element of the array $A$. We will only use this to find the median of small arrays, in particular, those of size at most $5$, so it will always run in constant time. We also let $A[i..j]$ denote the array whose elements are $A[i], A[i + 1], \ldots, A[j]$. The function $\text{P-PARTITION}(A, x)$ is a multithreaded function which partitions $A$ around the input element $x$ and returns the number of elements in $A$ which are less than or equal to $x$. Using a parallel for loop, its span is logarithmic in the number of elements in $A$. The work is the same as the serialization, which is $\Theta(n)$ according to section 9.3. The span satisfies the recurrence

$$ \begin{aligned} T_\infty(n) & = \Theta(lg n / 5) + T_\infty(n / 5) + \Theta(\lg n) + T_\infty(7n / 10 + 6) \\ & \le \Theta(\lg n) + T_\infty(n / 5) + T_\infty(7n / 10 + 6). \end{aligned} $$

Using the substitution method we can show that $T_\infty(n) = O(n^\epsilon)$ for some $\epsilon < 1$. In particular, $\epsilon = 0.9$ works. This gives a parallelization of $\Omega(n^0.1)$.

P-SELECT(A, i)
    if n == 1
        return A[1]
    let T[1..floor(n / 5)] be a new array
    parallel for i = 0 to floor(n / 5) - 1
        T[i + 1] = MEDIAN(A[i * floor(n / 5)..i * floor(n / 5) + 4])
    if n / 5 is not an integer
        T[floor(n / 5)] = MEDIAN(A[5 * floor(n / 5)..n])
    x = P-SELECT(T, ceil(n / 5))
    k = P-PARTITION(A, x)
    if k == i
        return x
    else if i < k
        P-SELECT(A[1..k - 1], i)
    else
        P-SELECT(A[k + 1..n], i - k)