The algorithm sweeps down the matrix from the top left corner to the bottom right corner, leaving zero subdiagonal elements behind it.

What is parallel in the algorithm?

- 1.
`MAXLOC`

: reduction operation on the row and column defined by the mask`lpiv`

, then broadcast within that row and column- 2.
- scale factors require
*N*-*n*independent operations within column`icol`

- 3.
- scale factor and a pivot row value must be broadcast within each column and row respectively
- 4.
- the reductions require independent operations

Attributes of the computation:

- There is little locality in communication, apart from broadcasts and reductions in rows and columns.
- Computation is clustered: much of it is performed in a single row and column, and then, once we get to the reductions, in the bottom right hand corner.
- A
`BLOCK`

distribution is not going to be advantageous: it would result in many processors being idle!

- Suggested distribution for a small number of processors:
!HPF$ ALIGN B(:,:) WITH A(:,:) !HPF$ DISTRIBUTE A(*,CYCLIC)

- If a large number of processors is available:
!HPF$ ALIGN B(:,:) WITH A(:,:) !HPF$ DISTRIBUTE A(CYCLIC,CYCLIC)