# Orthogonalization performance

Methods like GMRES and Jacobi-Davidson construct an orthogonal basis for their search subspace. It is well-known that classical Gram-Schmidt (CGS) is vulnerable for loss of orthogonality due to rounding errors. Modified Gram-Schmidt (MGS) is usually the fix for this, yet it is not free of rounding errors and is memory-bound when it comes to performance. In this post we'll look into iterative or refined orthogonalization methods and their performance.

June 18, 2017 - 10 minute read -

In GMRES it is necessary to build an orthonormal basis for the Krylov subspace, because the basis vectors $v, Av, A^2v, \cdots$ become quickly linearly dependent in finite precision. In Jacobi-Davidson we require orthonormality of the search subspace and the converged Schur vectors as well, so that converged Schur vector do not re-enter the search subspace. Also, throughout both algorithms we simplify expressions via equalities like $V^*V = I$. Due to rounding errors this identity might however not hold exactly. Therefore it is important to have some guarantee of orthogonality.

The benchmarks in this post are performed on an Intel i5-4460 @ 3.20GHz using 16GB of DDR3 memory at 1600MHz.

## Classical and Modified Gram-Schmidt

Given a matrix $V \in \mathbb{C}^{n \times m}$ where $n \gg m$ and a vector $w \in \mathbb{C}^n$, our goal is to remove the components of $w$ spanned by the columns of $V$.

This comes down to updating $w := (I - VV^*)w$ and subsequently normalizing $w := w / \|w\|$. In GMRES we want to store the projection $V^*w$ as well.

Classical Gram-Schmidt in Julia would look like this:

It is well-known that due to rounding errors the above procedure is not stable. Modified Gram-Schmidt tries to “solve” this problem by performing the projection sequentially:

Performance-wise this for-loop is a major bottleneck. The problem is that we have gone from BLAS-2 (matrix-vector product) to BLAS-1 (inner products).

If we bench the above with m = 20 and n = 10_000_000 we’ll quickly see the problem:

This outputs:

which shows that MGS is more than twice as slow. The reason for this is that BLAS-1 is memory bound: all data is touched only once, and there is a lot of data in memory. For this reason it is almost useless to have threading enabled.

## Iterative / refined orthogonalization

Even though MGS is an easy trick to ensure better orthogonality, it does not necessarily generate vectors that are orthogonal up to machine precision. For instance, if we do

it outputs

which basically means that the projection is not idempotent. After the first pass of MGS there are still components in the direction of $V$ and only in the second pass these are removed almost up to machine precision.

If we really need to maintain orthogonality, one idea is to repeat MGS twice or even more. In Krylov subspace there is the rule of thumb that “twice is enough” due to Kahan and Parlett, but we can just as well decide whether to repeat orthogonalization by using the Daniel-Gragg-Kaufman-Stewart criterion. This criterion states that we should repeat orthogonalization if the tangent between $w$ and $span(V)$ is smaller than $\eta$. In ARPACK it seems that this parameter is chosen $\eta = 1 / \sqrt{2}$.

I’m going to assume reorthogonalization is only necessary once, but the if can be replaced by a while.

Similarly for MGS:

We benchmark these methods as well:

and their results are:

The good news is that repeated classical Gram-Schmidt is still faster than modified Gram-Schmidt without repetition.

## Conclusion

If orthogonality is so important that repeated orthogonalization is necessary, then repeated classical Gram-Schmidt seems a good candidate. Even if repeated orthogonalization is not a necessity, repeated classical Gram-Schmidt can even be faster than modified Gram-Schmidt without repetition.

Note that this implies that the orthonormal basis $V$ is indeed best represented as a Matrix rather than a Vector of Vectors.

### Remark

In the benchmarks above I normalize a vector by first computing t = 1 / norm(w) and only then computing w *= t. The good thing is that multiplication is a lot faster than division, but the downside is that we round numbers twice. I don’t know if the latter has significant consequences for orthogonality.