## Thursday, January 26, 2012

### When double precision is not enough

I was doing some finite element (FE) calculation and I needed the sum of the lowest 7 eigenvalues of a symmetric matrix (that comes from the FE assembly) to converge to at least 1e-8 accuracy (so that I can check calculation done by some other solver of mine, that calculates the same but doesn't use FE). In reality I wanted the rounded value to 8 decimal digits to be correct, so I really needed 1e-9 accuracy (but it's ok if it is let's say 2e-9, but not ok if it is 9e-9). With my FE solver, I couldn't get it to converge more than to roughly 5e-7 no matter how hard I tried. Now what?

When doing the convergence, I take a good mesh and keep increasing "p" (the polynomial order) until it converges. For my particular problem, it is fully converged for about p=25 (the solver supports the order up to 64). Increasing "p" further will not increase the accuracy anymore, and the accuracy stays at the level 5e-7 for the sum of the lowest 7 eigenvalues. For optimal meshes, it converges at p=25, for not optimal meshes, it converges for higher "p", but in all cases, it doesn't get below 5e-7.

I know from experience, that for simpler problems, the FE solver can easily converge to 1e-10 or more using double precision. So I know it is doable, now the question is what the problem is: there
are a few possible reasons:

• The FE quadrature is not accurate enough
• The condition number of the matrix is high, thus LAPACK doesn't return very accurate eigenvalues
• Bug in the assembly/solver (like single/double corruption in Fortran, or some other subtle bug)
When using the same solver for simpler potential, it converged nicely to 1e-10. So this suggests there is no bug in the assembly or solver itself. It is possible that the quadrature is not accurate enough, but again, if it converges for simple problem, it's probably not it. So it seems it is the ill conditioned matrix, that causes this. So I printed the residuals (that I simply calculated in Fortran using the matrix and the eigenvectors returned by LAPACK), and it only showed 1e-9. For simpler problems, it can go to 1e-14 easily. So that must be it. How do we fix it?

Obviously by making the matrix less ill conditioned, which is caused by the mesh for the problem (the ratio of the longest/shortest elements is 1e9) but for my problem I really needed such a mesh. So the other option is to increase the real number accuracy.

In Fortran all real variables are defined as real(dp), where dp is an integer defined at a single place in the project. There are several ways to define it, but it's value is 8 for gfortran and it means double precision. So I increased it to 16 (quadruple precision), recompiled. Now the whole program calculates in quadruple precision (more than 30 significant digits). I had to recompile LAPACK using the "-fdefault-real-8" gfortran option, that promotes all double precision numbers to quadruple precision, and I used the "d" versions (double precision, now promoted to quadruple) of LAPACK routines.

I rerun the calculation ---- and suddenly LAPACK residuals are around 1e-13, and the solver converges to 1e-10 easily (for the sum of the lowest 7 eigenvalues). Problem solved.

Turning my Fortran program to quadruple precision is as easy as changing one variable and recompiling. Turning LAPACK to quadruple precision is easy with a single gfortran flag (LAPACK uses the old f77 syntax for double precision, if it used real(dp), then I would simply change it as for my program). The whole calculation got at least 10x slower with quadruple. The reason is that gfortran runtime uses the libquadmath library, that simulates quadruple precision (as current CPUs only support double precision natively).

I actually discovered a few bugs in my program (typically some constants in older code didn't use the "dp" syntax, but had the double precision hardwired). Fortran warns about all such cases, when the real variables have incompatible precision.

It is amazing how easy it is to work with different precision in Fortran (literally just one change and recompile). How could this be done with C++? This wikipedia page suggests, that "long double" is only 80bit in most cases (quadruple is 128bit), but gcc offers __float128, so it seems I would have to manually change all "double" to "__float128" in the whole C++ program (this could be done with a single "sed" command).

Marius Gedminas said...

Could you be sneaky and #define double __float128? Anonymous said...

I suppose the cleanest solution is to use a typedef. If you want to get into trouble, you may use #define double __float128.

The C++-ish way to define something similar to Fortran's real type would be a simple class template. But it feels a lot more clumsy.

You can use static assertions to get better error messages in case something goes wrong and other techniques to improve compatibility. Furthermore, you can use preprocessor macros to work around different compilers, implementations, etc.

I don't know if this has been done already.

typedef typename real< FLOAT_BYTE_SIZE >::t my_real;

template < int byte_size >
struct real;
template <> struct real<4>
{
typedef float t;
};
template <> struct real<8>
{
typedef double t;
};
template <> struct real<10>
{
STATIC_ASSERT(sizeof(long double) == 10);
typedef long double t;
};
#ifdef MACRO_TO_CHECK_WHETHER_YOU_ARE_USING_A_GNU_COMPILER_ETC
template <> struct real<16>
{
typedef __float128 t;
};
#endif

Unknown said...

On x86/amd64, gfortran also supports the 80-bit "extended precision" of the processor with real(kind=10). It's a little bit slower than the standard double precision real(kind=8), but much faster than software simulated quadruple precision real(kind=16). So if 80 bit are enough in your case, you might want to try this.

CarlK said...

Does anyone knows how David H. Bailey's double-double and quad-double package (based on IEEE double precision pairs) compares to libquadmath?. See: http://crd-legacy.lbl.gov/~dhbailey/mpdist/

Unknown said...

sojrhthu, that is a great idea. But I don't know how to force LAPACK to use it. For my own code that will work well.

Markus, thanks for he tip. This looks really cool. And I guess you can also hook in MPFR automatically to get any precision. I like Fortran, because it makes things easy for scientific computing, but I also like C++, because almost anything is possible to do, if one really wants to.

Unknown said...

CarlK --- I didn't try his double-double and quad-double, but I tried his MPFUN90, and compared it with MPFR. On the sin(pi/10)**2 + cos(pi/10)**2 example, MPFUN90 takes about 6s on my computer, and MPFR is *instant*. I used the same precision for both, I forgot which, I think it was 10000, maybe it was too much.

Pavel Holoborodko said...

Properly designed C++ mathematical code would use templates to be invariant from scalar types (along with type traits and other meta-magic).

Then change of scalar type (float-double-whatever) wouldn't be difficult operation.

Thanks to this approach I was able to convert Eigen: C++ matrix library to arbitrary precision using my custom made multiprecision floating-point scalar type - mpreal (C++ wrapper around MPFR/MPIR).

I use similar OOP techniques to enable Matlab to run computations in arbitrary precision without changes in source code:
Multiprecision Computing Toolbox for MATLAB

Unknown said...

Pavel, thanks a lot for the feedback and links. Indeed, I think that the correct C++ approach is to use templates and classes for scalar types and also arrays. So compared to Fortran, it is more heavyweight, but it does the job and also it is more general (i.e. as you wrote, you can then easily use MPFR --- while in Fortran, few more modifications would be needed).