# 27.1 The basics of dynamic multithreading

## 27.1-1

Suppose that we spawn $\text{P-FIB}(n - 2)$ in line 4 of $\text{P-FIB}$, rather than calling it as is done in the code. What is the impact on the asymptotic work, span, and parallelism?

(Removed)

## 27.1-2

Draw the computation dag that results from executing $\text{P-FIB}(5)$. Assuming that each strand in the computation takes unit time, what are the work, span, and parallelism of the computation? Show how to schedule the dag on 3 processors using greedy scheduling by labeling each strand with the time step in which it is executed.

• Work: $T_1 = 29$.
• Span: $T_\infty = 10$.
• Parallelism: $T_1 / T_\infty \approx 2.9$.

## 27.1-3

Prove that a greedy scheduler achieves the following time bound, which is slightly stronger than the bound proven in Theorem 27.1:

$$T_P \le \frac{T_1 - T_\infty}{P} + T_\infty. \tag{27.5}$$

Suppose that there are x incomplete steps in a run of the program. Since each of these steps causes at least one unit of work to be done, we have that there is at most $(T_1 - x)$ units of work done in the complete steps. Then, we suppose by contradiction that the number of complete steps is strictly greater than $\lfloor (T_1 - x) / P \rfloor$. Then, we have that the total amount of work done during the complete steps is

$$P \cdot (\lfloor (T_1 - x) / P \rfloor + 1) = P \lfloor (T_1 - x) / P = (T_1 - x) - ((T_1 - x) \mod P) + P > T_1 - x.$$

This is a contradiction because there are only $(T_1 - x)$ units of work done during complete steps, which is less than the amount we would be doing. Notice that since $T_\infty$ is abound on the total number of both kinds of steps, it is a bound on the number of incomplete steps, $x$, so,

$$T_P \le \lfloor (T_1 - x) / P \rfloor + x \le \lfloor (T_1 - T_\infty) / P \rfloor + T_\infty.$$

Where the second inequality comes by noting that the middle expression, as a function of $x$ is monotonically increasing, and so is bounded by the largest value of $x$ that is possible, namely $T_\infty$.

## 27.1-4

Construct a computation dag for which one execution of a greedy scheduler can take nearly twice the time of another execution of a greedy scheduler on the same number of processors. Describe how the two executions would proceed.

The computation is given in the image below. Let vertex $u$ have degree $k$, and assume that there are $m$ vertices in each vertical chain. Assume that this is executed on $k$ processors. In one execution, each strand from among the $k$ on the left is executed concurrently, and then the $m$ strands on the right are executed one at a time. If each strand takes unit time to execute, then the total computation takes $2m$ time. On the other hand, suppose that on each time step of the computation, $k - 1$ strands from the left (descendants of $u$) are executed, and one from the right (a descendant of $v$), is executed. If each strand take unit time to executed, the total computation takes $m + m / k$. Thus, the ratio of times is $2m / (m + m / k) = 2 / (1 + 1 / k)$. As $k$ gets large, this approaches $2$ as desired.

## 27.1-5

Professor Karan measures her deterministic multithreaded algorithm on $4$, $10$, and $64$ processors of an ideal parallel computer using a greedy scheduler. She claims that the three runs yielded $T_4 = 80$ seconds, $T_{10} = 42$ seconds, and $T_{64} = 10$ seconds. Argue that the professor is either lying or incompetent. ($\textit{Hint:}$ Use the work law $\text{(27.2)}$, the span law $\text{(27.3)}$, and inequality $\text{(27.5)}$ from Exercise 27.1-3.)

(Removed)

## 27.1-6

Give a multithreaded algorithm to multiply an $n \times n$ matrix by an $n$-vector that achieves $\Theta(n^2 / \lg n)$ parallelism while maintaining $\Theta(n^2)$ work.

(Removed)

## 27.1-7

Consider the following multithreaded pseudocode for transposing an $n \times n$ matrix $A$ in place:

 1 2 3 4 5 P-TRANSPOSE(A) n = A.rows parallel for j = 2 to n parallel for i = 1 to j - 1 exchange a[i, j] with a[j, i] 

Analyze the work, span, and parallelism of this algorithm.

(Removed)

## 27.1-8

Suppose that we replace the parallel for loop in line 3 of $\text{P-TRANSPOSE}$ (see Exercise 27.1-7) with an ordinary for loop. Analyze the work, span, and parallelism of the resulting algorithm.

(Removed)

## 27.1-9

For how many processors do the two versions of the chess programs run equally fast, assuming that $T_P = T_1 / P + T_\infty$?

(Removed)