# 27-2 Saving temporary space in matrix multiplication

The $\text{P-MATRIX-MULTIPLY-RECURSIVE}$ procedure has the disadvantage that it must allocate a temporary matrix $T$ of size $n \times n$, which can adversely affect the constants hidden by the $\Theta$-notation. The $\text{P-MATRIX-MULTIPLY-RECURSIVE}$ procedure does have high parallelism, however. For example, ignoring the constants in the $\Theta$-notation, the parallelism for multiplying $1000 \times 1000$ matrices comes to approximately $1000^3 / 10^2 = 10^7$, since $\lg 1000 \approx 10$. Most parallel computers have far fewer than 10 million processors.

a. Describe a recursive multithreaded algorithm that eliminates the need for the temporary matrix $T$ at the cost of increasing the span to $\Theta(n)$. ($\textit{Hint:}$ Compute $C = C + AB$ following the general strategy of $\text{P-MATRIX-MULTIPLY-RECURSIVE}$, but initialize $C$ in parallel and insert a sync in a judiciously chosen location.)

b. Give and solve recurrences for the work and span of your implementation.

c. Analyze the parallelism of your implementation. Ignoring the constants in the $\Theta$-notation, estimate the parallelism on $1000 \times 1000$ matrices. Compare with the parallelism of $\text{P-MATRIX-MULTIPLY-RECURSIVE}$.

a. We initialize the output matrix $C$ using doubly nested parallel for loops and then call $\text{P-MATRIX-MULTIPLY-RECURSIVE}'$, defined below.

 1 2 3 4 5 6 P-MATRIX-MULTIPLY-LESS-MEM(C, A, B) n = A.rows parallel for i = 1 to n parallel for j = 1 to n c[i, j] = 0 P-MATRIX-MULTIPLY-RECURSIVE'(C, A, B) 
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 P-MATRIX-MULTIPLY-RECURSIVE'(C, A, B) n = A.rows if n == 1 c[1, 1] = c[1, 1] + a[1, 1] * b[1, 1] else partition A, B, and C into n / 2 × n / 2 submatrices A[1, 1], A[1, 2], A[2, 1], A[2, 2]; B[1, 1], B[1, 2], B[2, 1], B[2, 2]; and C[1, 1], C[1, 2], C[2, 1], C[2, 2] spwan P-MATRIX-MULTIPLY-RECURSIVE'(C[1, 2], A[1, 1], B[1, 1]) spwan P-MATRIX-MULTIPLY-RECURSIVE'(C[1, 2], A[1, 1], B[1, 2]) spwan P-MATRIX-MULTIPLY-RECURSIVE'(C[2, 1], A[2, 1], B[1, 1]) P-MATRIX-MULTIPLY-RECURSIVE'(C[2, 2], A[2, 1], B[1, 2]) sync spwan P-MATRIX-MULTIPLY-RECURSIVE'(C[1, 1], A[1, 2], B[2, 1]) spwan P-MATRIX-MULTIPLY-RECURSIVE'(C[1, 2], A[1, 2], B[2, 2]) spwan P-MATRIX-MULTIPLY-RECURSIVE'(C[2, 1], A[2, 2], B[2, 1]) P-MATRIX-MULTIPLY-RECURSIVE'(C[2, 2], A[2, 2], B[2, 2]) sync 

b. The procedure $\text{P-MATRIX-MULTIPLY-LESS-MEM}$ performs $\Theta(n^2)$ work in the doubly nested parallel for loops, and then it calls the procedure $\text{P-MATRIX-MULTIPLY-RECURSIVE}'$. The recurrence for the work $M_1'(n)$ of $\text{P-MATRIX-MULTIPLY-RECURSIVE}'$ is $8M_1'(n / 2) + \Theta(1)$, which gives us $M_1'(n) = \Theta(n^3)$. Therefore, $T_1(n) = \Theta(n^3)$.

The span of the doubly nested parallel for loops that initialize the output array $C$ is $\Theta(\lg n)$. In $\text{P-MATRIX-MULTIPLY-RECURSIVE}'$, there are two groups of spawned recursive calls; therefore, the span $M_\infty'(n)$ of $\text{P-MATRIX-MULTIPLY-RECURSIVE}'$ is given by the recurrence $M_\infty'(n) = 2M_\infty'(n / 2) + \Theta(1)$, which gives us $M_\infty'(n) = \Theta(n)$. Because the span $\Theta(n)$ of $\text{P-MATRIX-MULTIPLY-RECURSIVE}'$ dominates, we have $T_\infty(n) = \Theta(n)$.

c. The parallelism of $\text{P-MATRIX-MULTIPLY-LESS-MEM}$ is $\Theta(n^3 / n) = \Theta(n^2)$. Ignoring the constants in the $\Theta$-notation, the parallelism for multiplying $1000 \times 1000$ matrices is $1000^2 = 10^6$, which is only a factor of $10$ less than that of $\text{P-MATRIX-MULTIPLY-RECURSIVE}$. Although the parallelism of the new procedure is much less than that of $\text{P-MATRIX-MULTIPLY-RECURSIVE}$, the algorithm still scales well for a large number of processors.