27-4 Multithreading reductions and prefix computations

A $\otimes$-reduction of an array $x[1 \ldots n]$, where $\otimes$ is an associative operator, is the value

$$y = x[1] \otimes x[2] \otimes \cdots \otimes x[n].$$

The following procedure computes the $\otimes$-reduction of a subarray $x[i \ldots j]$ serially.

1
2
3
4
5
REDUCE(x, i, j)
    y = x[i]
    for k = i + 1 to j
         y = y  x[k]
    return y

a. Use nested parallelism to implement a multithreaded algorithm $\text{P-REDUCE}$, which performs the same function with $\Theta(n)$ work and $\Theta(\lg n)$ span. Analyze your algorithm.

A related problem is that of computing a $\otimes$-prefix computation, sometimes called a $\otimes$-scan, on an array $x[1 \ldots n]$, where $\otimes$ is once again an associative operator. The $\otimes$-scan produces the array $y[1 \ldots n]$ given by

$$ \begin{aligned} y[1] & = x[1], \\ y[2] & = x[1] \otimes x[2], \\ y[3] & = x[1] \otimes x[2] \otimes x[3], \\ & \vdots \\ y[n] & = x[1] \otimes x[2] \otimes x[3] \otimes \cdots \otimes x[n], \end{aligned} $$

that is, all prefixes of the array $x$ "summed" using $\otimes$ operator. The following serial procedure $\text{SCAN}$ performs a $\otimes$-prefix computation:

1
2
3
4
5
6
7
SCAN(x)
    n = x.length
    let y[1..n] be a new array
    y[1] = x[1]
    for i = 2 to n
        y[i] = y[i - 1]  x[i]
    return y

Unfortunately, multithreading $\text{SCAN}$ is not straightforward. For example, changing the for loop to a parallel for loop would create races, since each iteration of the loop body depends on the previous iteration. The following procedure $\text{P-SCAN-1}$ performs the $\otimes$-prefix computation in parallel, albeit inefficiently.

1
2
3
4
5
P-SCAN-1(x)
    n = x.length
    let y[1..n] be a new array
    P-SCAN-1-AUX(x, y, 1, n)
    return y
1
2
3
P-SCAN-1-AUX(x, y, i, j)
    parallel for l = i to j
        y[l] = P-REDUCE(x, 1, l)

b. Analyze the work, span, and parallelism of $\text{P-SCAN-1}$.

By using nested parallelism, we can obtain a more efficient $\otimes$-prefix computation:

1
2
3
4
5
P-SCAN-2(x)
    n = x.length
    let y[1..n] be a new array
    P-SCAN-2-AUX(x, y, 1, n)
    return y
1
2
3
4
5
6
7
8
9
P-SCAN-2-AUX(x, y, i, j)
    if i == j
        y[i] = x[i]
    else k = floor((i + j) / 2)
        spawn P-SCAN-2-AUX(x, y, i, k)
        P-SCAN-2-AUX(x, y, k + 1, j)
        sync
        parallel for l = k + 1 to j
            y[l] = y[k]  y[l]

c. Argue that $\text{P-SCAN-2}$ is correct, and analyze its work, span, and parallelism.

We can improve on both $\text{P-SCAN-1}$ and $\text{P-SCAN-2}$ by performing the $\otimes$-prefix computation in two distinct passes over the data. On the first pass, we gather the terms for various contiguous subarrays of $x$ into a temporary array $t$, and on the second pass we use the terms in $t$ to compute the final result $y$. The following pseudocode implements this strategy, but certain expressions have been omitted:

1
2
3
4
5
6
7
8
P-SCAN-3(x)
    n = x.length
    let y[1..n] and t[1..n] be new arrays
    y[1] = x[1]
    if n > 1
        P-SCAN-UP(x, t, 2, n)
        P-SCAN-DOWN(x[1], x, t, y, 2, n)
    return y
1
2
3
4
5
6
7
8
9
P-SCAN-UP(x, t, i, j)
    if i == j
        return x[i]
    else
        k = floor((i + j) / 2)
        t[k] = spawn P-SCAN-UP(x, t, i, k)
        right = P-SCAN-UP(x, t, k + 1, j)
        sync
        return ____           // fill in the blank
1
2
3
4
5
6
7
8
P-SCAN-DOWN(v, x, t, y, i, j)
    if i == j
        y[i] = v  x[i]
    else
        k = floor((i + j) / 2)
        spawn P-SCAN-DOWN(____, x, t, y, i, k)  // fill in the blank
        P-SCAN-DOWN(____, x, t, y, k + 1, j)    // fill in the blank
        sync

d. Fill in the three missing expressions in line 8 of $\text{P-SCAN-UP}$ and lines 5 and 6 of $\text{P-SCAN-DOWN}$. Argue that with expressions you supplied, $\text{P-SCAN-3}$ is correct. ($\textit{Hint:}$ Prove that the value $v$ passed to $\text{P-SCAN-DOWN}(v, x, t, y, i, j)$ satisfies $v = x[1] \otimes x[2] \otimes \cdots \otimes x[i - 1]$.)

e. Analyze the work, span, and parallelism of $\text{P-SCAN-3}$.

a. Here is a multithreaded $\otimes$-reduction algorithm:

1
2
3
4
5
6
7
8
P-REDUCE(x, i, j)
    if i == j
        return x[i]
    else mid = floor((i + j) / 2)
        lh = spawn P-REDUCE(x, i, mid)
        rh = P-REDUCE(x, mid + 1, j)
        sync
        return lh  rh

If we denote the length $j - i + 1$ of the subarray $x[i..j]$ by $n$, then the work for the above algorithm is given by the recurrence $T_1(n) = 2T_1(n / 2) + \Theta(1) = \Theta(n)$. Because one of the recursive calls to $\text{P-REDUCE}$ is spawned and the procedure does constant work following the recursive calls and in the base case, the span is given by the recurrence $T_\infty(n) = T_\infty(n / 2) + \Theta(1) = \Theta(\lg n)$.

b. The work and span of $\text{P-SCAN-1-AUX}$ dominate the work and span of $\text{P-SCAN-1}$. We can calculate the work of $\text{P-SCAN-1-AUX}$ by replacing the parallel for loop with an ordinary for loop and noting that in each iteration, the running time of $\text{P-REDUCE}$ will be equal to $\Theta(l)$. Since $\text{P-SCAN-1}$ calls $\text{P-SCAN-1-AUX}$ with $1$ and $n$ as the last two arguments, the running time of $\text{P-SCAN-1}$, and hence its work, is $\Theta(1 + 2 + \cdots + n) = \Theta(n^2)$.

As we noted earlier, the parallel for loop in $\text{P-SCAN-1-AUX}$ undergoes $n$ iterations; therefore, the span of $\text{P-SCAN-1-AUX}$ is given by $\Theta(\lg n)$ for the recursive splitting of the loop iterations plus the span of the iteration that has maximum span. Among the loop iterations, the call to $\text{P-REDUCE}$ in the last iteration (when $l = n$) has the maximum span, equal to $\Theta(\lg n)$. Thus, $\text{P-SCAN-1}$ has $\Theta(\lg n)$ span and $\Theta(n^2 / \lg n)$ parallelism.

c. In $\text{P-SCAN-2-AUX}$, before the parallel for loop in lines 7 and 8 executes, the following invariant is satisfied: $y[l] = x[i] \otimes x[i + 1] \otimes \cdots \otimes x[l]$ for $l = i, i + 1, \ldots, k$ and $y[l] = x[k + 1] \otimes x[k + 2] \otimes \cdots \otimes x[k]$ for $l = k + 1, k + 2, \ldots, j$. The parallel for loop need not update $y[i], \ldots, y[k]$, since they have the correct values after the call to $\text{P-SCAN-2-AUX}(x, y, i, k)$. For $l = k + 1, k + 2, \ldots, j$, the parallel for loop sets

$$ \begin{aligned} y[l] & = y[k] \otimes y[l] \\ & = x[i] \otimes \cdots \otimes x[k] \otimes x[k + 1] \otimes \cdots \otimes x[l] \\ & = x[i] \otimes \cdots \otimes x[l]. \end{aligned} $$

as desired. We can run this loop in parallel because the $l$th iteration depends only on the values of $y[k]$, which is the same in all iterations, and $y[l]$. Therefore, when the call to $\text{P-SCAN-2-AUX}$ from $\text{P-SCAN-2}$ returns, array $y$ represents the $\otimes$-prefix computation of array $x$.

Because the work and span of $\text{P-SCAN-2-AUX}$ dominate the work and span of $\text{P-SCAN-2}$, we will concentrate on calculating these values for $\text{P-SCAN-2-AUX}$ working on an array of size $n$. The work $PS2A_1(n)$ of $\text{P-SCAN-2-AUX}$ is given by the recurrence $PS2A_1(n) = 2PS2A_1(n / 2) + \Theta(\lg n)$, which equals $\Theta(n\lg n)$ by case 2 of the master theorem. The span $PS2A_\infty(n)$ of $\text{P-SCAN-2-AUX}$ is given by the recurrence $PS2A_\infty(n) = PS2A_\infty(n / 2) + \Theta(\lg n)$, which equals $\Theta(\lg^2 n)$ per Exercise 4.6-2. That is, the work, span, and parallelism of $\text{P-SCAN-2}$ are $\Theta(n\lg n)$, $\Theta(\lg^2 n)$, and $\Theta(n / \lg n)$, respectively.

d. The missing expression in line 8 of $\text{P-SCAN-UP}$ is $t[k] \otimes right$ right. The missing expressions in lines 5 and 6 of $\text{P-SCAN-DOWN}$ are $v$ and $v \otimes t[k]$, respectively.

As suggested in the hint, we will prove that the value $v$ passed to $\text{P-SCAN-DOWN}(v, x, t, y, i, j)$ satisfies $v = x[1] \otimes x[2] \otimes \cdots \otimes x[i - 1]$, so that the value $v \otimes x[i]$ stored into $y[i]$ in the base case of $\text{P-SCAN-DOWN}$ is correct.

In order to compute the arguments that are passed to $\text{P-SCAN-DOWN}$, we must first understand what $t[k]$ holds as a result of the call to $\text{P-SCAN-UP}$. A call to $\text{P-SCAN-UP}(x, t, i, j)$ returns $x[i] \otimes \cdots \otimes x[j]$; because $t[k]$ stores the return value of $\text{P-SCAN-UP}(x, t, i, k)$, we can say that $t[k] = x[i] \otimes \cdots \otimes x[k]$.

The value $v = x[1]$ when $\text{P-SCAN-DOWN}(x[1], x, t, y, 2, n)$ is called from $\text{P-SCAN-3}$ clearly satisifies $v = x[1] \otimes \cdots \otimes x[i - 1]$. Let us suppose that $v = x[1] \otimes x[2] \otimes \cdots \otimes x[i - 1]$ in a call of $\text{P-SCAN-DOWN}(v, x, t, y, i, j)$. Therefore, $v$ meets the required condition in the first recursive call, with $i$ and $k$ as the last two arguments, in $\text{P-SCAN-DOWN}$. If we can prove that the value $v \otimes t[k]$ passed to the second recursive call in $\text{P-SCAN-DOWN}$ equals $x[1] \otimes x[2] \otimes \cdots \otimes x[k]$, we would have proved the required condition on $v$ for all calls to $\text{P-SCAN-DOWN}$. Earlier, we proved that $t[k] = x[i] \otimes \cdots \otimes x[k]$; therefore,

$$ \begin{aligned} v \otimes t[k] & = x[1] \otimes x[2] \otimes \cdots \otimes x[i - 1] \otimes x[i] \otimes \cdots x[k] \\ & = x[1] \otimes x[2] \otimes \cdots \otimes x[k]. \end{aligned} $$

Thus, the value $v$ passed to $\text{P-SCAN-DOWN}(v, x, t, y, i, j)$ satisfies $v = x[1] \otimes x[2] \otimes \cdots \otimes x[i - 1]$.

e. Let $PSU_1(n)$ and $PSU_\infty(n)$ denote the work and span of $\text{P-SCAN-UP}$ and let $PSD_1(n)$ and $PSD_\infty(n)$ denote the work and span of $\text{P-SCAN-DOWN}$. Then the expressions $T_1(n) = PSU_1(n) + PSD_1(n) + \Theta(1)$ and $T_\infty(n) = PSU_\infty(n) + PSD_\infty(n) + \Theta(1)$ characterize the work and span of $\text{P-SCAN-3}$.

The work $PSU_1(n)$ of $\text{P-SCAN-UP}$ is given by the recurrence

$$PSU_1(n) = 2PSU_1(n / 2) + \Theta(1),$$

and its span is defined by the recurrence

$$PSU_\infty(n) = PSU_\infty(n / 2) + \Theta(1).$$

Using the master theorem to solve these recurrences, we get $PSU_1(n) = \Theta(\lg n)$ and $PSU_\infty(n) = \Theta(\lg n)$.

Similarly, the recurrences

$$ \begin{aligned} PSD_1 & = 2PSD_1(n / 2) + \Theta(1), & \text{(*)} \\ PSD_\infty & = PSD_\infty(n / 2) + \Theta(1) & \text{(**)} \end{aligned} $$

define the work and span of $\text{P-SCAN-DOWN}$, and they evaluate to $PSD_1(n) = \Theta(\lg n)$ and $PSD_\infty(n) = \Theta(\lg n)$.

Applying the results for the work and span of $\text{P-SCAN-UP}$ and $\text{P-SCAN-DOWN}$ obtained above in the expressions for the work and span of $\text{P-SCAN-3}$, we get $T_1(n) = \Theta(\lg n)$ and $T_\infty(n) = \Theta(\lg n)$. Hence, $\text{P-SCAN-3}$ has $\Theta(n / \lg n)$ parallelism. $\text{P-SCAN-3}$ performs less work than $\text{P-SCAN-1}$, but with the same span, and it has the same parallelism as $\text{P-SCAN-2}$ with less work and a lower span.