# 27-1 Implementing parallel loops using nested parallelism

Consider the following multithreaded algorithm for performing pairwise addition on $n$-element arrays $A[1..n]$ and $B[1..n]$, storing the sums in $C[1..n]$:

 1 2 3 SUM-ARRAYS(A, B, C) parallel for i = 1 to A.length C[i] = A[i] + B[i]

a. Rewrite the parallel loop in $\text{SUM-ARRAYS}$ using nested parallelism (spawn and sync) in the manner of $\text{MAT-VEC-MAIN-LOOP}$. Analyze the parallelism of your implementation.

Consider the following alternative implementation of the parallel loop, which contains a value $grain\text-size$ to be specified:

 1 2 3 4 5 6 7 SUM-ARRAYS'(A, B, C) n = A.length grain-size = ? // to be determined r = ceil(n / grain-size) for k = 0 to r - 1 spawn ADD-SUBARRAY(A, B, C, k * grain-size + 1, min((k + 1) * grain-size, n)) sync
 1 2 3 ADD-SUBARRAY(A, B, C, i, j) for k = i to j C[k] = A[k] + B[k]

b. Suppose that we set $grain\text -size = 1$. What is the parallelism of this implementation?

c. Give a formula for the span of $\text{SUM-ARRAYS}'$ in terms of $n$ and $grain\text-size$. Derive the best value for grain-size to maximize parallelism.

a. Similar to $\text{MAT-VEC-MAIN-LOOP}$, the required procedure, which we name $\text{NESTED-SUM-ARRAYS}$, will take parameters $i$ and $j$ to specify the range of the array that is being computed in parallel. In order to perform the pairwise addition of two $n$-element arrays $A$ and $B$ and store the result into array $C$, we call $\text{NESTED-SUM-ARRAYS}(A, B, C, 1, A.length)$.

 1 2 3 4 5 6 NESTED-SUM-ARRAYS(A, B, C, i, j) if i == j C[i] = A[i] + B[i] else k = floor((i + j) / 2) spawn NESTED-SUM-ARRAYS(A, B, C, i, k) NESTED-SUM-ARRAYS(A, B, C, k + 1, j) sync

The work of $\text{NESTED-SUM-ARRAYS}$ is given by the recurrence

\begin{aligned} T_1(n) & = 2T_1(n / 2) + \Theta(1) \\ & = \Theta(n), \end{aligned}

by case 1 of the master theorem. The span of the procedure is given by the recurrence

\begin{aligned} T_\infty(n) & = T_\infty(n / 2) + \Theta(1) \\ & = \Theta(\lg n), \end{aligned}

by case 2 of the master theorem. Therefore, the above algorithm has $\Theta(n / \lg n)$ parallelism.

b. Because $\text{ADD-SUBARRAY}$ is serial, we can calculate both its work and span to be $\Theta(j - i + 1)$, which based on the arguments from the call in $\text{SUM-ARRAYS}'$ is $\Theta(grain\text-size)$, for all but the last call (which is $O(grain\text-size)$).

If $grain\text-size = 1$, the procedure $\text{SUM-ARRAYS}'$ calculates $r$ to be $n$, and each of the $n$ iterations of the serial for loop spawns $\text{ADD-SUBARRAY}$ with the same value, $k + 1$, for the last two arguments. For example, when $k = 0$, the last two arguments to $\text{ADD-SUBARRAY}$ are $1$, when $k = 1$, the last two arguments are $2$, and so on. That is, in each call to $\text{ADD-SUBARRAY}$, its for loop iterates once and calculates a single value in the array $C$. When $grain\text-size = 1$, the for loop in $\text{SUM-ARRAYS}'$ iterates $n$ times and each iteration takes $\Theta(1)$ time, resulting in $\Theta(n)$ work.

Although the for loop in $\text{SUM-ARRAYS}'$ looks serial, note that each iteration spawns the call to $\text{ADD-SUBARRAY}$ and the procedure waits for all its spawned children at the end of the for loop. That is, all loop iterations of $\text{SUM-ARRAYS}'$ execute in parallel. Therefore, one might be tempted to say that the span of $\text{SUM-ARRAYS}'$ is equal to the span of a single call to $\text{ADD-SUBARRAY}$ plus the constant work done by the ﬁrst three lines in $\text{SUM-ARRAYS}'$, giving $\Theta(1)$ span and $\Theta(n)$ parallelism. This calculation of span and parallelism would be wrong, however, because there are $r$ spawns of $\text{ADD-SUBARRAY}$ in $\text{SUM-ARRAYS}'$, where $r$ is not a constant. Hence, we must add a $\Theta(r)$ term to the span of $\text{SUM-ARRAYS}'$ in order to account for the overhead of spawning $r$ calls to $\text{ADD-SUBARRAY}$.

Based on the above discussion, the span of $\text{SUM-ARRAYS}'$ is $\Theta(r) + \Theta(grain\text-size) + \Theta(1)$. When $grain\text-size = 1$, we get $r = n$; therefore, $\text{SUM-ARRAYS}'$ has $\Theta(n)$ span and $\Theta(1)$ parallelism.

c. For a general $grain\text-size$, each iteration of the for loop in $\text{SUM-ARRAYS}'$ except for the last results in $grain\text-size$ iterations of the for loop in $\text{ADD-SUBARRAY}$. In the last iteration of $\text{SUM-ARRAYS}'$, the for loop in $\text{ADD-SUBARRAY}$ iterates $n \mod grain\text-size$ times. Therefore, we can claim that the span of $\text{ADD-SUBARRAY}$ is

$$\Theta(\max(grain\text-size, n \mod grain\text-size)) = \Theta(grain\text-size).$$

$\text{SUM-ARRAYS}'$ achieves maximum parallelism when its span, given by $\Theta(r) + \Theta(grain\text-size) + \Theta(1)$, is minimum. Since $r = \lceil n / grain\text-size \rceil$, the minimum occurs when $r \approx grain\text-size$, i.e., when $grain\text-size \approx \sqrt n$.