27-3 Multithreaded matrix algorithms

a. Parallelize the $\text{LU-DECOMPOSITION}$ procedure on page 821 by giving pseudocode for a multithreaded version of this algorithm. Make your implementation as parallel as possible, and analyze its work, span, and parallelism.

b. Do the same for $\text{LUP-DECOMPOSITION}$ on page 824.

c. Do the same for $\text{LUP-SOLVE}$ on page 817.

d. Do the same for a multithreaded algorithm based on equation $\text{(28.13)}$ for inverting a symmetric positive-definite matrix.

a. For the algorithm $\text{LU-DECOMPOSITION}(A)$ on page 821, the inner for loops can be parallelized, since they never update values that are read on later runs of those loops. However, the outermost for loop cannot be parallelized because across iterations of it the changes to the matrices from previous runs are used to affect the next. This means that the span will be $\Theta(n \lg n)$, work will still be $\Theta(n^3)$ and, so, the parallelization will be $\Theta(\frac{n^3}{n\lg n}) = \Theta(\frac{n^2}{\lg n})$.

b. The for loop on lines 7-10 is taking the max of a set of things, while recording the index that that max occurs. This for loop can therefor be replaced with a $\lg n$ span parallelized procedure in which we arrange the $n$ elements into the leaves of an almost balanced binary tree, and we let each internal node be the max of its two children. Then, the span will just be the depth of this tree. This procedure can gracefully scale with the number of processors to make the span be linear, though even if it is $\Theta(n\lg n)$ it will be less than the $\Theta(n^2)$ work later. The for loop on line 14-15 and the implicit for loop on line 15 have no concurrent editing, and so, can be made parallel to have a span of $lg n$. While the for loop on lines 18-19 can be made parallel, the one containing it cannot without creating data races. Therefore, the total span of the naive parallelized algorithm will be $\Theta(n^2\lg n)$, with a work of $\Theta(n^3)$. So, the parallelization will be $\Theta(\frac{n}{\lg n})$. Not as parallized as part (a), but still a significant improvement.

c. We can parallelize the computing of the sums on lines 4 and 6, but cannot also parallize the for loops containing them without creating an issue of concurrently modifying data that we are reading. This means that the span will be $\Theta(n\lg n)$, work will still be $\Theta(n^2)$, and so the parallelization will be $\Theta(\frac{n}{\lg n})$.

d. The recurrence governing the amount of work of implementing this procedure is given by

$$I(n) \le 2I(n / 2) + 4M(n / 2) + O(n^2).$$

However, the two inversions that we need to do are independent, and the span of parallelized matrix multiply is just $O(\lg n)$. Also, the $n^2$ work of having to take a transpose and subtract and add matrices has a span of only $O(\lg n)$. Therefore, the span satisfies the recurrence

$$I_\infty(n) \le I_\infty(n / 2) + O(\lg n).$$

This recurrence has the solution $I_\infty(n) \in \Theta(\lg^2 n)$ by exercise 4.6-2. Therefore, the span of the inversion algorithm obtained by looking at the procedure detailed on page 830. This makes the parallelization of it equal to $\Theta(M(n) / \lg^2 n)$ where $M(n)$ is the time to compute matrix products.