Developer Zone
Register  |  Login  
 Mainsoft.com
Search  
 

A Comparative Analysis of .NET and Java Formatting APIs, and Optimizing the Conversion of Strings to Numbers and Vice Versa

Because Internet protocols such as XML and HTML are text-based, software programs spend a considerable amount of time converting text to numbers (integers and floats) and vice versa. To optimize performance, Eyal Eliahu Alaluf, CTO of Mainsoft, developed high-performance algorithms for integers and real numbers that deliver up to 2.6x the number of conversions per second in .NET, and typically deliver 3x more conversions per second than the equivalent Java APIs.

Background

Mainsoft for Java EE software implements the .NET 2.0 APIs to format primitive types on top of the Java VM. To improve the performance of C# and Visual Basic server applications running on Java using Mainsoft software, we recently conducted a comparative analysis of the .NET and Java APIs that convert strings to integers and to real numbers, and we tested their performance. Then, we designed two algorithms that format integers and real numbers significantly faster than Microsoft's implementation of the .NET APIs, and with one exception, our algorithms also deliver 3x as many conversions per second as the Java APIs.

What follows are the results of our performance tests and a description of the algorithms we developed. The algorithms are included in the Mainsoft for Java EE v 2.x release and will be included in future Mono releases.

Performance testing: .NET and Java formatting APIs

Both .NET and Java provide a set of APIs that format primitive types into strings and vice versa. While their semantics differ, both define format types (decimal format, scientific format, fixed point format, etc.) and the required precision, and they support locale-dependent information, such as which negative sign and which decimal point to use.

During Q4 2007, Mainsoft conducted a comparative analysis and performance testing of the .NET and Java APIs' strings-to-integers and strings-to-numbers conversion rates (see the table below). We used a 1.8 GHz Pentium M with 2GB RAM to test the APIs, and we used a Sun J2SE 6.0 server virtual machine to measure J2SE 6.0 performance.

Conversion .NET 2.0 Conversions*
(000's per second)
J2SE 6.0 Conversions
(000's per second)
Integers
Integers using default formatting (e.g. 12345)
3722 11557
Integers using currency formatting (e.g. $12,345)
2250 1174
Real numbers
Doubles using default formatting (e.g. 0.12345)
1871 1036
Small doubles using default formatting (e. g. 1.2345E-200)
1633 142
Large doubles using default formatting (e. g. 1.2345E+200)
1583 165

*The implementation of the APIs to format strings-to-integers and strings-to-real-numbers is identical across .NET 2.0, .NET 3.0, and .NET 3.5.

The results indicate Java delivers 3x as many conversions per second as .NET in the simplest case, which uses default formatting for integers. As soon as you add culture-specific formatting, such as currencies, Java performance falls by a factor of ten.

When converting strings to real numbers, implementation of the .NET APIs outperforms the Java APIs. The performance differential between .NET and Java is even more pronounced when you consider conversion speeds for very large and very small double values.

Mainsoft's integer formatting algorithm

Overview

Mainsoft's integer formatting algorithm follows a 3-step process that is dictated by the semantics of the .NET APIs:

  1. The format is parsed to retrieve the type of format and the precision.

  2. The number is converted to a list of digits.

  3. The list of digits is converted to a string by applying the logic of the specific format and the culture-specific information.

We improve the conversion performance by:

  • Formatting digits representation using bitwise operations, which CPU registers are optimized to use, rather than array traversals.
  • Using thread-specific storage to minimize memory allocations.

Digits representation

The conventional representation of digits uses a byte array in which each digit occupies one cell. Allocating a byte array and transforming an integer is a resource-intensive operation when you consider the extremely high demands placed on this algorithm.

Rather than using a byte array, Mainsoft's algorithm encodes the digits list using binary-coded decimals (BCD), which represents the number 12345 as the number 0x12345 (which has a decimal value of 74565). Then, the algorithm stores the digits BCD values within four integer fields. This representation works for up to 32 digits, which accommodates the .NET decimal data type.

Using BCD, digit manipulations and conversions to characters are performed with simple and fast bitwise operations. In addition, the operation of converting an integer to its BCD representation is an efficient and simple arithmetic operation.

Converting an integer to a binary-coded decimal

The BCD conversion algorithm minimizes the number of the comparatively slow operations of division and remainders for 8-64 bits signed and unsigned integers.

  • The number is divided iteratively by 100,000,000.
  • The remainder of the number from 100,000,000 is divided again by 10,000.
  • The remainder of the last operation is converted to BCD using a pre-calculated cache.
  • The divisor of the last operation is also smaller than 10,000, and is converted to BCD using the same pre-calculated cache.
  • The two BCD results are stored in the lower 16 bits and upper 16 bits of the first (or second, or third, according to the current iteration) BCD integer fields.

Minimizing memory allocations

Mainsoft's integer formatting algorithm uses a class to maintain the data about the formatted number. The class contains the format, precision, the four BCD fields, and the character array used to create the resulting string.

Allocating memory for this conversion class, and allocating the character array, carries a significant overhead compared to the required performance. This issue is typical in high throughput operations, where the performance degradation caused by memory allocations can be significant.

To dramatically reduce the use of memory allocations, Mainsoft's algorithm places the conversion class as a member of the current thread class, allowing both the class and character array to be reused for different numbers. The algorithm also allows for further optimizing the lookup of the culture-specific formatting information by having the thread update the class whenever the thread culture information changes.

Mainsoft's real numbers formatting algorithm

Real numbers in both .NET and Java are defined by the IEEE 754 floating point standard.

Mainsoft's algorithm for implementing the .NET APIs:

  1. Converts a double precision floating point into a 64-bit integer and a power of ten using a very efficient O(1) conversion formula.

  2. Applies the integer formatting algorithm to the 64-bit integer, taking into account where the decimal point should be placed according to the given power of ten.

We'll use an example to demonstrate the issues and efficiencies of the O(1) conversion formula.

Converting 0.12345

Consider the real number 0.12345. Its double precision representation is 8895509983982204 * 2 ** -56. In order to transform this representation into 0.12345, the algorithm pre-calculates 2 ** -56 as 1387778780781445675 * 10 ** -35. The result of the multiplication is 12345000000000000412733332592767700 * 10 ** -35. When we convert the result into digits, round the number, and place the decimal point in the correct place, we get the desired 0.12345.

In this above calculation, converting the result of the multiplication into digits is a resource-intensive operation, since the result (in this case) has 35 digits while the required precision is 17 digits. We achieve this by:

  • Pre-calculating 2 ** -56 as 2560000000000000000 * 2 ** -64 * 10 ** -16. The result is 22772505558994442240000000000000000 * 2 ** -64 * 10 ** -16.
  • Shifting the number by 64 bits, we get 1234500000000000 * 10 ** -16.
  • Placing the decimal point at the right place to produce the desired result: 0.12345.

The conversion algorithm

The IEEE 754 standard divides the bits of a double precision floating point into a sign S, and exponent E, and a fraction F. The real number value is (-1) ** S * 2 ** (E – 1075) * (1F), where 1F is intended to represent the binary number created by prefixing F with an implicit leading 1. We'll ignore the special cases E=0 and E=2047, to simplify matters.

The algorithm pre-calculates the values of 2 ** (E – 1075) for the valid range of E (i.e. [0..2047]) using I * 2 ** -64 * 10 ** T where I is a 64-bit integer and T is a 32-bit integer. The resulting multiplication is ((1F * I) * 2 ** -64) * 10 ** T. This formula produces a 64-bit integer by taking the upper 64 bits of 1F * I and a Ten's power (T).

In order to preserve the desired precision, the algorithm is required to ensure that the 64-bit integer has exactly 17 decimal digits. The algorithm does this by multiplying 1F * I by 10 until its upper 64 bits have 17 decimal digits. Note that by the same requirement of preserving precision, I must always be bigger then 2 ** 64 / 10.

Performance testing

We measured the performance of Mainsoft's implementation of the .NET APIs using the same hardware and Sun J2SE 6.0 server virtual machine that was used to measure conversion performance of J2SE 6.0. The results are listed in the right-hand column, in the table below.

Conversion .NET 2.0
(000's per second)
J2SE 6.0
(000's per second)
Mainsoft
(000's per second)
Integers
Integers using default formatting (e.g. 12345)
3722 11557 9714
Integers using currency formatting (eg $12,345)
2250 1174 3027
Real numbers
Doubles using default formatting (e.g. 0.12345)
1871 1036 2717
Small doubles using default formatting 1.2345E-200
1633 142 2324
Large doubles using default formatting 1.2345E+200
1583 165 2307

Mainsoft's implementation of the .NET APIs significantly out-performs the Microsoft .NET APIs implementation in all cases. And, apart from the specific case of converting string to integers using default formatting, which uses a specialized Java algorithm, Mainsoft's algorithms deliver 3 to 15 times the number of conversions as the Java APIs.

Conclusion

I invite you to review our implementation—posted at NumberFormatter.cs and NumberFormatter.jvm.cs—and provide feedback and questions. I'd also welcome any questions you may have regarding Mainsoft's commitment to deliver equivalent performance of C# and Visual Basic applications running on the Java VM. Please provide your comments in our developer forums.


I look forward to hearing from you!

Display a printable version of this page     Email this page     Add to favorites     This page has been viewed 12264 times.


Home  Site map  Privacy statement  Legal notice  Contact us
Mainsoft Product Validations: Optimized for Microsoft Visual Studio, Java Powered for the Enterprise, and Ready for IBM WebSphere.
Read more about: .NET Java and .NET for Linux

Copyright © Mainsoft Corporation 2005-2009. All rights reserved