27-4 Multithreading reductions and prefix computations

A $\otimes$-reduction of an array $x[1 \ldots n]$, where $\otimes$ is an associative operator, is the value

$$y = x[1] \otimes x[2] \otimes \cdots \otimes x[n].$$

The following procedure computes the $\otimes$-reduction of a subarray $x[i \ldots j]$ serially.
REDUCE(x, i, j)
    y = x[i]
    for k = i + 1 to j
         y = y ⊗ x[k]
    return y
a. Use nested parallelism to implement a multithreaded algorithm $\text{P-REDUCE}$, which performs the same function with $\Theta(n)$ work and $\Theta(\lg n)$ span. Analyze your algorithm.

A related problem is that of computing a $\otimes$-prefix computation, sometimes called a $\otimes$-scan, on an array $x[1 \ldots n]$, where $\otimes$ is once again an associative operator. The $\otimes$-scan produces the array $y[1 \ldots n]$ given by

$$ \begin{aligned} y[1] & = x[1], \\ y[2] & = x[1] \otimes x[2], \\ y[3] & = x[1] \otimes x[2] \otimes x[3], \\ & \vdots \\ y[n] & = x[1] \otimes x[2] \otimes x[3] \otimes \cdots \otimes x[n], \end{aligned} $$

that is, all prefixes of the array $x$ "summed" using $\otimes$ operator. The following serial procedure $\text{SCAN}$ performs a $\otimes$-prefix computation:
SCAN(x)
    n = x.length
    let y[1..n] be a new array
    y[1] = x[1]
    for i = 2 to n
        y[i] = y[i - 1] ⊗ x[i]
    return y
Unfortunately, multithreading $\text{SCAN}$ is not straightforward. For example, changing the for loop to a parallel for loop would create races, since each iteration of the loop body depends on the previous iteration. The following procedure $\text{P-SCAN-1}$ performs the $\otimes$-prefix computation in parallel, albeit inefficiently.
P-SCAN-1(x)
    n = x.length
    let y[1..n] be a new array
    P-SCAN-1-AUX(x, y, 1, n)
    return y
P-SCAN-1-AUX(x, y, i, j)
    parallel for l = i to j
        y[l] = P-REDUCE(x, 1, l)
b. Analyze the work, span, and parallelism of $\text{P-SCAN-1}$.

By using nested parallelism, we can obtain a more efficient $\otimes$-prefix computation:
P-SCAN-2(x)
    n = x.length
    let y[1..n] be a new array
    P-SCAN-2-AUX(x, y, 1, n)
    return y
P-SCAN-2-AUX(x, y, i, j)
    if i == j
        y[i] = x[i]
    else k = floor((i + j) / 2)
        spawn P-SCAN-2-AUX(x, y, i, k)
        P-SCAN-2-AUX(x, y, k + 1, j)
        sync
        parallel for l = k + 1 to j
            y[l] = y[k] ⊗ y[l]
c. Argue that $\text{P-SCAN-2}$ is correct, and analyze its work, span, and parallelism.

We can improve on both $\text{P-SCAN-1}$ and $\text{P-SCAN-2}$ by performing the $\otimes$-prefix computation in two distinct passes over the data. On the first pass, we gather the terms for various contiguous subarrays of $x$ into a temporary array $t$, and on the second pass we use the terms in $t$ to compute the final result $y$. The following pseudocode implements this strategy, but certain expressions have been omitted:
P-SCAN-3(x)
    n = x.length
    let y[1..n] and t[1..n] be new arrays
    y[1] = x[1]
    if n > 1
        P-SCAN-UP(x, t, 2, n)
        P-SCAN-DOWN(x[1], x, t, y, 2, n)
    return y
P-SCAN-UP(x, t, i, j)
    if i == j
        return x[i]
    else
        k = floor((i + j) / 2)
        t[k] = spawn P-SCAN-UP(x, t, i, k)
        right = P-SCAN-UP(x, t, k + 1, j)
        sync
        return ____           // fill in the blank
P-SCAN-DOWN(v, x, t, y, i, j)
    if i == j
        y[i] = v ⊗ x[i]
    else
        k = floor((i + j) / 2)
        spawn P-SCAN-DOWN(____, x, t, y, i, k)  // fill in the blank
        P-SCAN-DOWN(____, x, t, y, k + 1, j)    // fill in the blank
        sync
d. Fill in the three missing expressions in line 8 of $\text{P-SCAN-UP}$ and lines 5 and 6 of $\text{P-SCAN-DOWN}$. Argue that with expressions you supplied, $\text{P-SCAN-3}$ is correct. ($\textit{Hint:}$ Prove that the value $v$ passed to $\text{P-SCAN-DOWN}(v, x, t, y, i, j)$ satisfies $v = x[1] \otimes x[2] \otimes \cdots \otimes x[i - 1]$.)

e. Analyze the work, span, and parallelism of $\text{P-SCAN-3}$.

(Removed)