Double-precision to string
Posted: Tue May 14, 2024 4:24 pm
Recently I've been working on sample programs for "The Art of ARM Assembly, Volume 2." When working on the numeric conversion examples, I chose not to simply copy the algorithms I used in "The Art of 64-bit Assembly" and "The Art of ARM Assembly, Volume 1." Instead, I experimented with some new algorithms in an attempt to improve the performance of the code.
I was fairly successful in converting the algorithm for double-precision floating-point to string (actually, the other numeric conversions were also much improved as well, I've discussed them elsewhere). I wondered if the ARM code could be translated to x86 code to see similar improvements. So I set about translating the 32-bit ARM code into 64-bit x86 assembly language. The conversion was very simplistic, I just did a line-by-line (instruction-by-instruction) from ARM to x86. So the resulting code is nowhere near as fast as it could be. However, comparing the new x86 code against the algorithm i used in "The Art of 64-bit Assembly Language (volume 1)" the result is better than 2.6x faster. I have attached some sample code that demonstrates the new algorithm (alongside the original FPU-based code, with several bug fixes, from Ao64A).
Comparing the performance of these two algorithms is slightly an apples-to-oranges comparison. The new algorithm handles 64-bit double-precision values only (using SSE instructions), whereas the original algorithm uses the FPU and supports real10 (80-bit) extended-precision values. Nevertheless, FPU usage has fallen out of favor recently, so having code that supports SSE/AVX instructions is probably more useful, anyway.
I have also included two text files providing the output of the two algorithms. They largely match for the sample data I used. The differences fall into two categories: differences due to the 64-bit vs 80-bit floating-point precision, and certain special values (NaN, Inf, and denormalized values) which the FPU-based code doesn't support.
The new algorithm achieves the performance gain by using several *really large* lookup tables. In particular, I use a 617-entry table (of double-precision floats) to convert a normalized double value to a value in the range 1.0e16 to just less than 1.0e17. I use a second 617-entry table (of dwords) to determine the decimal exponent. Finally, i use a 40,000-byte lookup table to convert a Radix-10,000 number (integers in the range 0 to 9,999) to a string of four ASCII characters (for fast mantissa-to-string conversions). All told, these tables (and the code) consume over 100K memory (a huge amount for an algorithm such as this, but still nothing on a modern CPU).
If you're wondering why it's worth the effort to create faster library code, keep in mind that library code should always be as fast as possible. Just because you're not writing speed-sensitive code using the library doesn't mean the next person doesn't need that speed. Library code should always be as fast as possible to handle the (often outlier) cases where the caller really does need the speed.
The attached source code assembles using the MASM assembler.
Cheers,
Randy Hyde
I was fairly successful in converting the algorithm for double-precision floating-point to string (actually, the other numeric conversions were also much improved as well, I've discussed them elsewhere). I wondered if the ARM code could be translated to x86 code to see similar improvements. So I set about translating the 32-bit ARM code into 64-bit x86 assembly language. The conversion was very simplistic, I just did a line-by-line (instruction-by-instruction) from ARM to x86. So the resulting code is nowhere near as fast as it could be. However, comparing the new x86 code against the algorithm i used in "The Art of 64-bit Assembly Language (volume 1)" the result is better than 2.6x faster. I have attached some sample code that demonstrates the new algorithm (alongside the original FPU-based code, with several bug fixes, from Ao64A).
Comparing the performance of these two algorithms is slightly an apples-to-oranges comparison. The new algorithm handles 64-bit double-precision values only (using SSE instructions), whereas the original algorithm uses the FPU and supports real10 (80-bit) extended-precision values. Nevertheless, FPU usage has fallen out of favor recently, so having code that supports SSE/AVX instructions is probably more useful, anyway.
I have also included two text files providing the output of the two algorithms. They largely match for the sample data I used. The differences fall into two categories: differences due to the 64-bit vs 80-bit floating-point precision, and certain special values (NaN, Inf, and denormalized values) which the FPU-based code doesn't support.
The new algorithm achieves the performance gain by using several *really large* lookup tables. In particular, I use a 617-entry table (of double-precision floats) to convert a normalized double value to a value in the range 1.0e16 to just less than 1.0e17. I use a second 617-entry table (of dwords) to determine the decimal exponent. Finally, i use a 40,000-byte lookup table to convert a Radix-10,000 number (integers in the range 0 to 9,999) to a string of four ASCII characters (for fast mantissa-to-string conversions). All told, these tables (and the code) consume over 100K memory (a huge amount for an algorithm such as this, but still nothing on a modern CPU).
If you're wondering why it's worth the effort to create faster library code, keep in mind that library code should always be as fast as possible. Just because you're not writing speed-sensitive code using the library doesn't mean the next person doesn't need that speed. Library code should always be as fast as possible to handle the (often outlier) cases where the caller really does need the speed.
The attached source code assembles using the MASM assembler.
Cheers,
Randy Hyde