Friday, March 4, 2011

StringBuilder vs StringBuffer vs String.concat - done right

StringBuilder vs StringBuffer vs String.concat - done right

illustration
How long is a piece of String?

Introduction

Concatenation of Strings is very easy in Java - all you need is a '+'. It can't get any easier than that, right? Unfortunately there are a few pitfalls. One thing you should remember from your first Java lessons is a small albeit important detail: String objects are immutable. Once constructed they cannot be changed anymore.
Whenever you "change" the value of a String you create a new object and make that variable reference this new object. Appending a String to another existing one is the same kind of deal: a new String containing the stuff from both is created and the old one is dropped.
You might wonder why Strings are immutable in first place. There are two very compelling reasons for it:
  1. Immutable basic types makes things easier. If you pass a String to a function you can be sure that its value won't change.
  2. Security. With mutable Strings one could bypass security checks by changing the value right after the check. (Same thing as the first point, really.)

The performance impact of String.concat()

Each time you append something via '+' (String.concat()) a new String is created, the old stuff is copied, the new stuff is appended, and the old String is thrown away. The bigger the String gets the longer it takes - there is more to copy and more garbage is produced.
Creating a String with a length of 65536 (character by character) already takes about 22 seconds on an AMD64 X2 4200+. The following diagram illustrates the exponentially growing amount of required time:
String.concat() - exponential growth
Figure 1: StringBuilder vs StringBuffer vs String.concat
StringBuilder and StringBuffer are also shown, but at this scale they are right onto the x-axis. As you can see String.concat() is slow. Amazingly slow in fact. It's so bad that the guys over at FindBugs added a detector for String.concat inside loops to their static code analysis tool.

When to use '+'

Using the '+' operator for concatenation isn't bad per se though. It's very readable and it doesn't necessarily affect performance. Let's take a look at the kind of situations where you should use '+'.
a) Multi-line Strings:
String text=
    "line 1\n"+
    "line 2\n"+
    "line 3";
Since Java doesn't feature a proper multi-line String construct like other languages, this kind of pattern is often used. If you really have to you can embed massive blocks of text this way and there are no downsides at all. The compiler creates a single String out of this mess and no concatenation happens at runtime.
b) Short messages and the like:
System.out.println("x:"+x+" y:"+y);
The compiler transforms this to:
System.out.println((new StringBuilder()).append("x:").append(x).append(" y:").append(y).toString());
Looks pretty silly, doesn't it? Well, it's great that you don't have to write that kind of code yourself. ;)
If you're interested in byte code generation: Accordingly to Arno Unkrig (the amazing dude behind Janino) the optimal strategy is to use String.concat() for 2 or 3 operands, and StringBuilder for 4 or more operands (if available - otherwise StringBuffer). Sun's compiler always uses StringBuilder/StringBuffer though. Well, the difference is pretty negligible.

When to use StringBuilder and StringBuffer

This one is easy to remember: use 'em whenever you assembe a String in a loop. If it's a short piece of example code, a test program, or something completely unimportant you won't necessarily need that though. Just keep in mind that '+' isn't always a good idea.

StringBuilder and StringBuffer compared

StringBuilder is rather new - it was introduced with 1.5. Unlike StringBuffer it isn't synchronized, which makes it a tad faster:
StringBuilder compared with StringBuffer
Figure 2: StringBuilder vs StringBuffer
As you can see the graphs are sort of straight with a few bumps here and there caused by re-allocation. Also StringBuilder is indeed quite a bit faster. Use that one if you can.

Initial capacity

Both - StringBuilder and StringBuffer - allow you to specify the initial capacity in the constructor. Of course this was also a thing I had to experiment with. Creating a 0.5mb String 50 times with different initial capacities:
different initial capacities compared
Figure 3: StringBuilder and StringBuffer with different initial capacities
The step size was 8 and the default capacity is 16. So, the default is the third dot. 16 chars is pretty small and as you can see it's a very sensible default value.
If you take a closer look you can also see that there is some kind of rhythm: the best initial capacities (local optimum) are always a power of two. And the worst results are always just before the next power of two. The perfect results are of course achieved if the required size is used from the very beginning (shown as dashed lines in the diagram) and no resizing happens at all.

Some insight

That "PoT beat" is of course specific to Sun's implementations of StringBuilder and StringBuffer. Other implementations may show a slightly different behavior. However, if these particular implementations are taken as target one can derive two golden rules from these results:
  1. If you set the capacity use a power of two value.
  2. Do not use the String/CharSequence constructors ever. They set the capacity to the length of the given String/CharSequence + 16, which can be virtually anything.

Benchmarking method

In order to get meaningful results I took care of a few things:
  • VM warmup
  • separate runs for each test
  • each sample is the median of 5 runs
  • inner loops were inside of each bench unit

2 comments: