27-5 Multithreading a simple stencil calculation

Computational science is replete with algorithms that require the entries of an array to be filled in with values that depend on the values of certain already computed neighboring entries, along with other information that does not change over the course of the computation. The pattern of neighboring entries does not change during the computation and is called a stencil. For example, Section 15.4 presents a stencil algorithm to compute a longest common subsequence, where the value in entry $c[i, j]$ depends only on the values in $c[i - 1, j]$, $c[i, j - 1]$, and $c[i - 1, j - 1]$, as well as the elements $x_i$ and $y_j$ within the two sequences given as inputs. The input sequences are fixed, but the algorithm fills in the two-dimensional array $c$ so that it computes entry $c[i, j]$ after computing all three entries $c[i - 1, j]$, $c[i, j - 1]$, and $c[i - 1, j - 1]$.

In this problem, we examine how to use nested parallelism to multithread a simple stencil calculation on an $n \times n$ array $A$ in which, of the values in $A$, the value placed into entry $A[i, j]$ depends only on values in $A[i' , j']$, where $i' \le i$ and $j' \le j$ (and of course, $i' \ne i$ or $j' \ne j$). In other words, the value in an entry depends only on values in entries that are above it and/or to its left, along with static information outside of the array. Furthermore, we assume throughout this problem that once we have filled in the entries upon which $A[i, j]$ depends, we can fill in $A[i, j]$ in $\Theta(1)$ time (as in the $\text{LCS-LENGTH}$ procedure of Section 15.4).

We can partition the $n \times n$ array $A$ into four $n / 2 \times n / 2$ subarrays as follows:

$$ A = \begin{pmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \tag{27.11} \end{pmatrix} . $$

Observe now that we can fill in subarray $A_{11}$ recursively, since it does not depend on the entries of the other three subarrays. Once $A_{11}$ is complete, we can continue to fill in $A_{12}$ and $A_{21}$ recursively in parallel, because although they both depend on $A_{11}$ , they do not depend on each other. Finally, we can fill in $A_{22}$ recursively.

a. Give multithreaded pseudocode that performs this simple stencil calculation using a divide-and-conquer algorithm $\text{SIMPLE-STENCIL}$ based on the decomposition $\text{(27.11)}$ and the discussion above. (Don't worry about the details of the base case, which depends on the specific stencil.) Give and solve recurrences for the work and span of this algorithm in terms of $n$. What is the parallelism?

b. Modify your solution to part (a) to divide an $n \times n$ array into nine $n / 3 \times n / 3$ subarrays, again recursing with as much parallelism as possible. Analyze this algorithm. How much more or less parallelism does this algorithm have compared with the algorithm from part (a)?

c. Generalize your solutions to parts (a) and (b) as follows. Choose an integer $b \ge 2$. Divide an $n \times n$ array into $b^2$ subarrays, each of size $n / b \times n / b$, recursing with as much parallelism as possible. In terms of $n$ and $b$, what are the work, span, and parallelism of your algorithm? Argue that, using this approach, the parallelism must be $o(n)$ for any choice of $b \ge 2$. ($\textit{Hint:}$ For this last argument, show that the exponent of $n$ in the parallelism is strictly less than $1$ for any choice of $b \ge 2$.)

d. Give pseudocode for a multithreaded algorithm for this simple stencil calculation that achieves $\Theta(n\lg n)$ parallelism. Argue using notions of work and span that the problem, in fact, has $\Theta(n)$ inherent parallelism. As it turns out, the divide-and-conquer nature of our multithreaded pseudocode does not let us achieve this maximal parallelism.

a. In this part of the problem, we will assume that $n$ is an exact power of $2$, so that in a recursive step, when we divide the $n \times n$ matrix $A$ into four $n / 2 \times n / 2$ matrices, we will be guaranteed that $n / 2$ is an integer, for all $n \ge 2$. We make this assumption simply to avoid introducing $\lfloor n / 2 \rfloor$ and $\lceil n / 2 \rceil$ terms in the pseudocode and the analysis that follow. In the pseudocode below, we assume that we have a procedure $\text{BASE-CASE}$ available to us, which calculates the base case of the stencil.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
SIMPLE-STENCIL(A, i, j, n)
    if n == 1
        A[i, j] = BASE-CASE(A, i, j)
    else // Calculate submatrix A[1, 1].
        SIMPLE-STENCIL(A, i, j, n / 2)
        // Calculate submatrices A[1, 2] and A[2, 1] in parallel.
        spawn SIMPLE-STENCIL(A, i, j + n / 2, n / 2)
        SIMPLE-STENCIL(A, i + n / 2, j, n / 2)
        sync
        // Calculate submatrix A[2, 2].
        SIMPLE-STENCIL(A, i + n / 2, j + n / 2, n / 2)

To perform a simple stencil calculation on an $n \times n$ matrix $A$, we call $\text{SIMPLE-STENCIL}(A, 1, 1, n)$. The recurrence for the work is $T_1(n) = 4T_1(n / 2) + \Theta(1) = \Theta(n^2)$. Of the four recursive calls in the algorithm above, only two run in parallel. Therefore, the recurrence for the span is $T_\infty(n) = 3T_\infty(n / 2) + \Theta(1) = \Theta(n^{\lg 3})$, and the parallelism is $\Theta(n^{2 - \lg 3}) \approx \Theta(n^{0.415})$.

b. Similar to $\text{SIMPLE-STENCIL}$ of the previous part, we present $\text{P-STENCIL-3}$, which divides $A$ into nine submatrices, each of size $n / 3 \times n / 3$, and solves them recursively. To perform a stencil calculation on an $n \times n$ matrix $A$, we call $\text{P-STENCIL-3}(A, 1, 1, n)$.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
P-STENCIL-3(A, i, j, n)
    if n == 1
        A[i, j] = BASE-CASE(A, i, j)
    else // Group 1: compute submatrix A[1, 1].
        P-STENCIL-3(A, i, j, n / 3)
        // Group 2: compute submatrices A[1, 2] and A[2, 1].
        spawn P-STENCIL-3(A, i, j + n / 3, n / 3)
        P-STENCIL-3(A, i + n / 3, j, n / 3)
        sync
        // Group 3: compute submatrices A[1, 3], A[2, 2], and A[3, 1].
        spawn P-STENCIL-3(A, i, j + 2n / 3, n / 3)
        spawn P-STENCIL-3(A, i + n / 3, j + n / 3, n / 3)
        P-STENCIL-3(A, i + 2n / 3, j, n / 3)
        sync
        // Group 4: compute submatrices A[2, 3] and A[3, 2].
        spawn P-STENCIL-3(A, i + n / 3, j + 2n / 3, n / 3)
        P-STENCIL-3(A, i + 2n / 3, j + n / 3, n / 3)
        sync
        // Group 5: compute submatrix A[3, 3].
        P-STENCIL-3(A, i + 2n / 3, j + 2n / 3, n / 3)

From the pseudocode, we can informally say that we can solve the nine subproblems in five groups, as shown in the following matrix:

$$ \begin{pmatrix} 1 & 2 & 3 \\ 2 & 3 & 4 \\ 3 & 4 & 5 \end{pmatrix} . $$

Each entry in the above matrix specifies the group of the corresponding $n / 3 \times n / 3$ submatrix of $A$; we can compute in parallel the entries of all submatrices that fall in the same group. In general, for $i = 2, 3, 4, 5$, we can calculate group $i$ after completing the computation of group $i - 1$.

The recurrence for the work is $$T_1(n) = 9T_1(n / 3) + \Theta(1) = \Theta(n^2).$$

The recurrence for the span is

$$T_\infty(n) = 5T_\infty(n / 3) + \Theta(1) = \Theta(n^{\log_3 5}).$$

Therefore, the parallelism is

$$\Theta(n^{2 - \log_3 5}) \approx \Theta(n^{0.535}).$$

c. Similar to the previous part, we can solve the $b^2$ subproblems in $2b - 1$ groups:

$$ \begin{pmatrix} 1 & 2 & 3 & \cdots & b - 2 & b - 1 & b \\ 2 & 3 & 4 & \cdots & b - 1 & b & b + 1 \\ 3 & 4 & 5 & \cdots & b & b + 1 & b + 2 \\ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots & \vdots \\ b - 2 & b - 1 & b & \cdots & 2b - 5 & 2b - 4 & 2b - 3 \\ b - 1 & b & b + 1 & \cdots & 2b - 4 & 2b - 3 & 2b - 2 \\ b & b + 1 & b + 2 & \cdots & 2b - 3 & 2b - 2 & 2b - 1 \end{pmatrix} . $$

The recurrence for the work is

$$T_1(n) = b^2 T_1(n / b) + \Theta(1) = \Theta(n^2).$$

The recurrence for the span is

$$T_\infty(n) = (2b - 1) T_\infty(n / b) + \Theta(1) = \Theta(n^{\log_b (2b - 1)}).$$

The parallelism is

$$\Theta(n^{2 - \log_b (2b - 1)}).$$

As the hint suggests, in order to show that the parallelism must be $o(n)$ for any choice of $b \ge 2$, we need to show that $2 - \log_b (2b - 1)$, which is the exponent of $n$ in the parallelism, is strictly less than $1$ for any choice of $b \ge 2$. Since $b \ge 2$, we know that $2b - 1 > b$, which implies that $\log_b (2b - 1) > \log_b b = 1$. Hence, $2 - \log_b (2b - 1) < 2 - 1 = 1$.

d. The idea behind achieving $\Theta(n / \lg n)$ parallelism is similar to that presented in the previous part, except without recursive division. We will compute $A[1, 1]$ serially, which will enable us to compute entries $A[1, 2]$ and $A[2, 1]$ in parallel, after which we can compute entries $A[1, 3]$, $A[2, 2]$ and $A[3, 1]$ in parallel, and so on. Here is the pseudocode:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
P-STENCIL(A)
    n = A.rows
    // Calculate all entries on the antidiagonal and above it.
    for i = 1 to n
        parallel for j = 1 to i
            A[i - j + 1, j] = BASE-CASE(A, i - j + 1, j)
    // Calculate all entries below the antidiagonal.
    for i = 2 to n
        parallel for j = i to n
            A[n + i - j, j] = BASE-CASE(A, n + i - j, j)

For each value of index $i$ of the first serial for loop, the inner loop iterates $i$ times, doing constant work in each iteration. Because index $i$ ranges from $1$ to $n$ in the first for loop, we require $\Theta(1 + 2 + \cdots + n) = \Theta(n^2)$ work to calculate all entries on the antidiagonal and above it. For each value of index $i$ of the second serial for loop, the inner loop iterates $n - i + 1$ times, doing constant work in each iteration. Because index $i$ ranges from $2$ to $n$ in the second for loop, we require $\Theta((n - 1) + (n - 2) + \cdots + 1) = \Theta(n^2)$ work to calculate all entries on the antidiagonal and above it. Therefore, the work of $\text{P-STENCIL}$ is $T_1(n) = \Theta(n^2)$.

Note that both for loops in $\text{P-STENCIL}$, which execute parallel for loops within, are serial. Therefore, in order to calculate the span of $\text{P-STENCIL}$, we must add the spans of all the parallel for loops. Given that any parallel for loop in $\text{P-STENCIL}$ does constant work in each iteration, the span of a parallel for loop with $n'$ iterations is $\Theta(\lg n')$. Hence,

$$ \begin{aligned} T_\infty(n) & = \Theta((\lg 1 + \lg 2 + \cdots + \lg n) + (\lg(n - 1) + \cdots + 1)) \\ & = \Theta(\lg(n!) + \lg(n - 1)!) \\ & = \Theta(n\lg n), \end{aligned} $$

giving us $\Theta(n / \lg n)$ parallelism.