# 27.2 Multithreaded matrix multiplication

## 27.2-1

Draw the computation dag for computing $\text{P-SQUARE-MATRIX-MULTIPLY}$ on $2 \times 2$ matrices, labeling how the vertices in your diagram correspond to strands in the execution of the algorithm. Use the convention that spawn and call edges point downward, continuation edges point horizontally to the right, and return edges point upward. Assuming that each strand takes unit time, analyze the work, span, and parallelism of this computation.

(Omit!)

## 27.2-2

Repeat Exercise 27.2-1 for $\text{P-MATRIX-MULTIPLY-RECURSIVE}$.

(Omit!)

## 27.2-3

Give pseudocode for a multithreaded algorithm that multiplies two $n \times n$ matrices with work $\Theta(n^3)$ but span only $\Theta(\lg n)$. Analyze your algorithm.

1 2 3 4 5 6 7 | P-FAST-MATRIX-MULTIPLY(A, B) n = A.rows let C be a new n × n matrix parallel for i = 1 to n parallel for j = 1 to n c[i, j] = MATRIX-MULT-SUBLOOP(A, B, i, j, 1, n) return C |

1 2 3 4 5 6 7 8 | MATRIX-MULT-SUBLOOP(A, B, i, j, k, k') if k == k' return a[i, k] * b[k, j] else mid = floor((k + k') / 2) lhalf = spawn MATRIX-MULT-SUBLOOP(A, B, i, j, k, mid) uhalf = MATRIX-MULT-SUBLOOP(A, B, i, j, mid + 1, k') sync return lhalf + uhalf |

We calculate the work $T_1(n)$ of $\text{P-FAST-MATRIX-MULTIPLY}$ by computing the running time of its serialization, i.e., by replacing the **parallel for** loops by ordinary **for** loops. Therefore, we have $T_1(n) = n^2T_1'(n)$, where $T_1'(n)$ denotes the work of $\text{MATRIX-MULT-SUBLOOP}$ to compute a given output entry $c_{ij}$. The work of $\text{MATRIX-MULT-SUBLOOP}$ is given by the recurrence

$$T_1'(n) = 2T_1'(n / 2) + \Theta(1).$$

By applying case 1 of the master theorem, we have $T_1'(n) = \Theta(n)$. Therefore,

$$T_1(n) = \Theta(n^3).$$

To calculate the span, we use

$$T_\infty(n) = \Theta(\lg n) + \max_{1 \le i \le n} iter_\infty(i).$$

Note that each iteration of the outer **parallel for** loop does the same amount of work: it calls the inner **parallel for** loop. Similarly, each iteration of the inner **parallel for** loop calls procedure $\text{MATRIX-MULT-SUBLOOP}$ with the same parameters, except for the indices $i$ and $j$. Because $\text{MATRIX-MULT-SUBLOOP}$ recursively halves the space between its last two parameters ($1$ and $n$), does constanttime work in the base case, and spawns one of the recursive calls in parallel with the other, it has span $\Theta(\lg n)$. Since each iteration of the inner **parallel for** loop, which has $n$ iterations, has span $\Theta(\lg n)$, the inner **parallel for** loop has span $\Theta(\lg n)$. By similar logic, the outer **parallel for** loop, and hence procedure $\text{P-FAST-MATRIX-MULTIPLY}$, has span $\Theta(\lg n)$ and $\Theta(n^3 / \lg n)$ parallelism.

## 27.2-4

Give pseudocode for an efficient multithreaded algorithm that multiplies a $p \times q$ matrix by a $q \times r$ matrix. Your algorithm should be highly parallel even if any of $p$, $q$, and $r$ are $1$. Analyze your algorithm.

We can efficiently multiply a $p \times q$ matrix by a $q \times r$ matrix in parallel by using the solution to Exercise 27.2-3 as a base. We will replace the upper limits of the nested **parallel for** loops with $p$ and $r$ respectively and we will pass $q$ as the last argument to the call of $\text{MATRIX-MULT-SUBLOOP}$. We present the pseudocode for a multithreaded algorithm for multiplying a $p \times q$ matrix by a $q \times r$ matrix in procedure $\text{P-GEN-MATRIX-MULTIPLY}$ below. Because the pseudocode for procedure $\text{MATRIX-MULT-SUBLOOP}$ (which $\text{P-GEN-MATRIX-MULTIPLY}$ calls) remains the same as was presented in the solution to Exercise 27.2-3, we do not repeat it here.

1 2 3 4 5 6 7 8 9 | P-GEN-MATRIX-MULTIPLY(A, B) p = A.rows q = A.columns r = B.columns let C be a new p × r matrix parallel for i = 1 to p parallel for j = 1 to r c[i, j] = MATRIX-MULT-SUBLOOP(A, B, i, j, 1, q) return C |

To calculate the work for $\text{P-GEN-MATRIX-MULTIPLY}$, we replace the **parallel for** loops with ordinary **for** loops. As before, we can calculate the work of $\text{MATRIX-MULT-SUBLOOP}$ to be $\Theta(q)$ (because the input size to the procedure is $q$ here). Therefore, the work of $\text{P-GEN-MATRIX-MULTIPLY}$ is $T_1 = \Theta(pqr)$.

We can analyze the span of $\text{P-GEN-MATRIX-MULTIPLY}$ as we did in the solution to Exercise 27.2-3, but we must take into account the different number of loop iterations. Each of the $p$ iterations of the outer **parallel for** loop executes the inner **parallel for** loop, and each of the $r$ iterations of the inner **parallel for** loop calls $\text{MATRIX-MULT-SUBLOOP}$, whose span is given by $\Theta(\lg q)$. We know that, in general, the span of a **parallel for** loop with $n$ iterations, where the $i$th iteration has span $iter_\infty(i)$ is given by

$$T_\infty = \Theta(\lg n) + \max_{1 \le i \le n} iter_\infty(i).$$

Based on the above observations, we can calculate the span of $\text{P-GEN-MATRIX-MULTIPLY}$ as

$$ \begin{aligned} T_\infty & = \Theta(\lg p) + \Theta(\lg r) + \Theta(\lg q) \\ & = \Theta(\lg(pqr)). \end{aligned} $$

The parallelism of the procedure is, therefore, given by $\Theta(pqr / \lg(pqr))$. To check whether this analysis is consistent with Exercise 27.2-3, we note that if $p = q = r = n$, then the parallelism of $\text{P-GEN-MATRIX-MULTIPLY}$ would be $\Theta(n^3 / \lg n^3) = \Theta(n^3 / \lg n)$.

## 27.2-5

Give pseudocode for an efficient multithreaded algorithm that transposes an $n \times n$ matrix in place by using divide-and-conquer to divide the matrix recursively into four $n / 2 \times n / 2$ submatrices. Analyze your algorithm.

1 2 3 4 5 6 7 8 9 | P-MATRIX-TRANSPOSE-RECURSIVE(A, r, c, s) // Transpose the s × s submatrix starting at a[r, c]. if s == 1 return else s' = floor(s / 2) spawn P-MATRIX-TRANSPOSE-RECURSIVE(A, r, c, s') spawn P-MATRIX-TRANSPOSE-RECURSIVE(A, r + s', c + s', s - s') P-MATRIX-TRANSPOSE-SWAP(A, r, c + s', r + s', c, s', s - s') sync |

1 2 3 4 5 6 7 8 9 10 | P-MATRIX-TRANSPOSE-SWAP(A, r[1], c[1], r[2], c[2], s[1], s[2]) // Transpose the s[1] × s[2] submatrix starting at a[r[1], c[1]] with the s[2] × s[1] submatrix starting at a[r[2], c[2]]. if s[1] < s[2] P-MATRIX-TRANSPOSE-SWAP(A, r[2], c[2], r[1], c[1], s[2], s[1]) else if s[1] == 1 // since s[1] ≥ s[2], must have that s[2] equals 1 exchange a[r[1], c[1]] with a[r[2], c[2]] else s' = floor(s[1] / 2) spawn P-MATRIX-TRANSPOSE-SWAP(A, r[2], c[2], r[1], c[1], s[2], s') P-MATRIX-TRANSPOSE-SWAP(A, r[2], c[2] + s', r[1] + s', c[1], s[2], s[1] - s') sync |

In order to transpose an $n \times n$ matrix $A$, we call $\text{P-MATRIX-TRANSPOSE-RECURSIVE}(A, 1, 1, n)$.

Let us first calculate the work and span of $\text{P-MATRIX-TRANSPOSE-SWAP}$ so that we can plug in these values into the work and span calculations of $\text{P-MATRIX-TRANSPOSE-RECURSIVE}$. The work $T_1'(N)$ of $\text{P-MATRIX-TRANSPOSE-SWAP}$ on an $N$-element matrix is the running time of its serialization. We have the recurrence

$$ \begin{aligned} T_1'(N) & = 2T_1'(N / 2) + \Theta(1) \\ & = \Theta(N). \end{aligned} $$

The span $T_\infty'(N)$ is similarly described by the recurrence

$$ \begin{aligned} T_\infty'(N) & = T_\infty'(N / 2) + \Theta(1) \\ & = \Theta(\lg N). \end{aligned} $$

In order to calculate the work of $\text{P-MATRIX-TRANSPOSE-RECURSIVE}$, we calculate the running time of its serialization. Let $T_1(N)$ be the work of the algorithm on an $N$-element matrix, where $N = n^2$, and assume for simplicity that $n$ is an exact power of $2$. Because the procedure makes two recursive calls with square submatrices of sizes $n / 2 \times n / 2 = N / 4$ and because it does $\Theta(n^2) = \Theta(N)$ work to swap all the elements of the other two submatrices of size $n / 2 \times n / 2$, its work is given by the recurrence

$$ \begin{aligned} T_1(N) & = 2T_1(N / 4) + \Theta(N) \\ & = \Theta(N). \end{aligned} $$

The two parallel recursive calls in $\text{P-MATRIX-TRANSPOSE-RECURSIVE}$ execute on matrices of size $n / 2 \times n / 2$. The span of the procedure is given by maximum of the span of one of these two recursive calls and the $\Theta(\lg N)$ span of $\text{P-MATRIX-TRANSPOSE-SWAP}$, plus $\Theta(1)$. Since the recurrence

$$T_\infty(N) = T_\infty(N / 4) + \Theta(1)$$

has the solution $T_\infty(N) = \Theta(\lg N)$ by case 2 of Theorem 4.1, the span of the recursive call is asymptotically the same as the span of $\text{P-MATRIX-TRANSPOSE-SWAP}$, and hence the span of $\text{P-MATRIX-TRANSPOSE-RECURSIVE}$ is $\Theta(\lg N)$.

Thus, $\text{P-MATRIX-TRANSPOSE-RECURSIVE}$ has parallelism $\Theta(N / \lg N) = \Theta(n^2 / \lg n)$.

## 27.2-6

Give pseudocode for an efficient multithreaded implementation of the Floyd-Warshall algorithm (see Section 25.2), which computes shortest paths between all pairs of vertices in an edge-weighted graph. Analyze your algorithm.

1 2 3 4 5 6 7 8 9 10 | P-FLOYD-WARSHALL(W) n = W.rows parallel for i = 1 to n parallel for j = 1 to n d[i, j] = w[i, j] for k = 1 to n parallel for i = 1 to n parallel for j = 1 to n d[i, j] = min(d[i, j], d[i, k] + d[k, j]) return D |

By Exercise 25.2-4, we can compute all the $d_{ij}$ values in parallel.

The work of $\text{P-FLOYD-WARSHALL}$ is the same as the running time of its serialization, which we computed as $\Theta(n^3)$ in Section 25.2. The span of the doubly nested **parallel for** loops, which do constant work inside, is $\Theta(\lg n)$. Note, however, that the second set of doubly nested **parallel for** loops is executed within each of the $n$ iterations of the outermost serial **for** loop. Therefore, $\text{P-FLOYD-WARSHALL}$ has span $\Theta(n\lg n)$ and $\Theta(n^2 / \lg n)$ parallelism.