# 27.1 The basics of dynamic multithreading

## 27.1-1

Suppose that we spawn $\text{P-FIB}(n - 2)$ in line 4 of $\text{P-FIB}$, rather than calling it as is done in the code. What is the impact on the asymptotic work, span, and parallelism?

There will be no change in the asymptotic work, span, or parallelism of $\text{P-FIB}$ even if we were to spawn the recursive call to $\text{P-FIB}(n - 2)$. The serialization of $\text{P-FIB}$ under consideration would yield the same recurrence as that for $\text{FIB}$; we can, therefore, calculate the work as $T_1(n) = \Theta(\phi^n)$. Similarly, because the spawned calls to $\text{P-FIB}(n - 1)$ and $\text{P-FIB}(n - 2)$ can run in parallel, we can calculate the span in exactly the same way as in the text, $T_\infty(n) = \Theta(n)$, resulting in $\Theta(\phi^n / n)$ parallelism.

## 27.1-2

Draw the computation dag that results from executing $\text{P-FIB}(5)$. Assuming that each strand in the computation takes unit time, what are the work, span, and parallelism of the computation? Show how to schedule the dag on 3 processors using greedy scheduling by labeling each strand with the time step in which it is executed.

• Work: $T_1 = 29$.
• Span: $T_\infty = 10$.
• Parallelism: $T_1 / T_\infty \approx 2.9$.

## 27.1-3

Prove that a greedy scheduler achieves the following time bound, which is slightly stronger than the bound proven in Theorem 27.1:

$$T_P \le \frac{T_1 - T_\infty}{P} + T_\infty. \tag{27.5}$$

Suppose that there are x incomplete steps in a run of the program. Since each of these steps causes at least one unit of work to be done, we have that there is at most $(T_1 - x)$ units of work done in the complete steps. Then, we suppose by contradiction that the number of complete steps is strictly greater than $\lfloor (T_1 - x) / P \rfloor$. Then, we have that the total amount of work done during the complete steps is

$$P \cdot (\lfloor (T_1 - x) / P \rfloor + 1) = P \lfloor (T_1 - x) / P = (T_1 - x) - ((T_1 - x) \mod P) + P > T_1 - x.$$

This is a contradiction because there are only $(T_1 - x)$ units of work done during complete steps, which is less than the amount we would be doing. Notice that since $T_\infty$ is abound on the total number of both kinds of steps, it is a bound on the number of incomplete steps, $x$, so,

$$T_P \le \lfloor (T_1 - x) / P \rfloor + x \le \lfloor (T_1 - T_\infty) / P \rfloor + T_\infty.$$

Where the second inequality comes by noting that the middle expression, as a function of $x$ is monotonically increasing, and so is bounded by the largest value of $x$ that is possible, namely $T_\infty$.

## 27.1-4

Construct a computation dag for which one execution of a greedy scheduler can take nearly twice the time of another execution of a greedy scheduler on the same number of processors. Describe how the two executions would proceed.

The computation is given in the image below. Let vertex $u$ have degree $k$, and assume that there are $m$ vertices in each vertical chain. Assume that this is executed on $k$ processors. In one execution, each strand from among the $k$ on the left is executed concurrently, and then the $m$ strands on the right are executed one at a time. If each strand takes unit time to execute, then the total computation takes $2m$ time. On the other hand, suppose that on each time step of the computation, $k - 1$ strands from the left (descendants of $u$) are executed, and one from the right (a descendant of $v$), is executed. If each strand take unit time to executed, the total computation takes $m + m / k$. Thus, the ratio of times is $2m / (m + m / k) = 2 / (1 + 1 / k)$. As $k$ gets large, this approaches $2$ as desired.

## 27.1-5

Professor Karan measures her deterministic multithreaded algorithm on $4$, $10$, and $64$ processors of an ideal parallel computer using a greedy scheduler. She claims that the three runs yielded $T_4 = 80$ seconds, $T_{10} = 42$ seconds, and $T_{64} = 10$ seconds. Argue that the professor is either lying or incompetent. ($\textit{Hint:}$ Use the work law $\text{(27.2)}$, the span law $\text{(27.3)}$, and inequality $\text{(27.5)}$ from Exercise 27.1-3.)

By the work law for $P = 4$, we have $80 = T_4 \ge T_1 / 4$, or $T_1 \le 320$. By the span law for $P = 64$, we have $T_\infty \le T_{64} = 10$. Now we will use inequality $\text{(27.5)}$ from Exercise 27.1-3 to derive a contradiction. For $P = 10$, we have

\begin{aligned} 42 & = T_{10} \\ & \le \frac{320 - T_\infty}{10} + T_\infty \\ & = 32 + \frac{9}{10} T_\infty \end{aligned}

or, equivalently,

\begin{aligned} T_\infty & \ge \frac{10}{9} \cdot 10 \\ & > 10, \end{aligned}

which contradicts $T_\infty \le 10$.

Therefore, the running times reported by the professor are suspicious.

## 27.1-6

Give a multithreaded algorithm to multiply an $n \times n$ matrix by an $n$-vector that achieves $\Theta(n^2 / \lg n)$ parallelism while maintaining $\Theta(n^2)$ work.

 1 2 3 4 5 6 7 8 FAST-MAT-VEC(A, x) n = A.rows let y be a new vector of length n parallel for i = 1 to n y[i] = 0 parallel for i = 1 to n y[i] = MAT-SUB-LOOP(A, x, i, 1, n) return y 
 1 2 3 4 5 6 7 8 MAT-SUB-LOOP(A, x, i, j, j') if j == j' return a[i, j] * x[j] else mid = floor((j + j') / 2) lhalf = spawn MAT-SUB-LOOP(A, x, i, j, mid) uhalf = MAT-SUB-LOOP(A, x, i, mid + 1, j') sync return lhalf + uhalf 

We calculate the work $T_1(n)$ of $\text{FAST-MAT-VEC}$ by computing the running time of its serialization, i.e., by replacing the parallel for loop by an ordinary for loop. Therefore, we have $T_1(n) = n T_1'(n)$, where $T_1'(n)$ denotes the work of $\text{MAT-SUB-LOOP}$ to compute a given output entry $y_i$. The work of $\text{MAT-SUB-LOOP}$ is given by the recurrence

$$T_1'(n) = 2T_1'(n / 2) + \Theta(1).$$

By applying case 1 of the master theorem, we have $T_1'(n) = \Theta(n)$. Therefore, $T_1(n) = \Theta(n^2)$.

To calculate the span, we use

$$T_\infty(n) = \Theta(\lg n) + \max_{1 \le i \le n} iter_\infty (i).$$

Note that each iteration of the second parallel for loop calls procedure $\text{MAT-SUB-LOOP}$ with the same parameters, except for the index $i$. Because $\text{MAT-SUB-LOOP}$ recursively halves the space between its last two parameters ($1$ and $n$), does constant-time work in the base case, and spawns one of the recursive calls in parallel with the other, it has span $\Theta(\lg n)$. The procedure $\text{FAST-MAT-VEC}$, therefore, has a span of $\Theta(\lg n)$ and $\Theta(n^2 / \lg n)$ parallelism.

## 27.1-7

Consider the following multithreaded pseudocode for transposing an $n \times n$ matrix $A$ in place:

 1 2 3 4 5 P-TRANSPOSE(A) n = A.rows parallel for j = 2 to n parallel for i = 1 to j - 1 exchange a[i, j] with a[j, i] 

Analyze the work, span, and parallelism of this algorithm.

We analyze the work of $\text{P-TRANSPOSE}$, as usual, by computing the running time of its serialization, where we replace both the parallel for loops with simple for loops. We can compute the work of $\text{P-TRANSPOSE}$ using the summation

\begin{aligned} T_1(n) & = \Theta \Bigg( \sum_{j = 2}^n (j - 1) \Bigg) \\ & = \Theta \Bigg( \sum_{j = 1}^{n - 1} j \Bigg) \\ & = \Theta(n^2). \end{aligned}

The span of $\text{P-TRANSPOSE}$ is determined by the span of the doubly nested parallel for loops. Although the number of iterations of the inner loop depends on the value of the variable $j$ of the outer loop, each iteration of the inner loop does constant work. Let $iter_\infty(j)$ denote the span of the $j$th iteration of the outer loop and $iter'_\infty(i)$ denote the span of the ith iteration of the inner loop. We characterize the span $T_\infty(n)$ of $\text{P-TRANSPOSE}$ as

$$T_\infty(n) = \Theta(\lg n) + \max_{2 \le j \le n} iter_\infty(j).$$

The maximum occurs when $j = n$, and in this case,

$$iter_\infty(n) = \Theta(\lg n) + \max_{1 \le i \le n - 1} iter'_\infty(i).$$

As we noted, each iteration of the inner loop does constant work, and therefore $iter'_\infty(i) = \Theta(1)$ for all $i$. Thus, we have

\begin{aligned} \Theta(\lg n) & = \Theta(\lg n) + \Theta(\lg n) + \Theta(1) \\ & = \Theta(\lg n). \end{aligned}

Since the work $\text{P-TRANSPOSE}$ is $\Theta(n^2)$ and its span is $\Theta(\lg n)$, the parallelism of $\text{P-TRANSPOSE}$ is $\Theta(n^2 / \lg n)$.

## 27.1-8

Suppose that we replace the parallel for loop in line 3 of $\text{P-TRANSPOSE}$ (see Exercise 27.1-7) with an ordinary for loop. Analyze the work, span, and parallelism of the resulting algorithm.

If we were to replace the inner parallel for loop of $\text{P-TRANSPOSE}$ with an ordinary for loop, the work would still remain $\Theta(n^2)$. The span, however, would increase to $\Theta(n)$ because the last iteration of the parallel for loop, which dominates the span of the computation, would lead to $(n - 1)$ iterations of the inner, serial for loop. The parallelism, therefore, would reduce to $\Theta(n^2) / \Theta(n) = \Theta(n)$.

## 27.1-9

For how many processors do the two versions of the chess programs run equally fast, assuming that $T_P = T_1 / P + T_\infty$?

Based on the values of work and span given for the two versions of the chess program, we solve for $P$ using

$$\frac{2048}{P} + 1 = \frac{1024}{P} + 8.$$

The solution gives $P$ between $146$ and $147$.