Showing posts with label substring. Show all posts
Showing posts with label substring. Show all posts

Friday, 24 June 2011

String and memory leaks

Probably most of the Java users is aware that String object is more complex than just an array of char. To make the usage of strings in Java more robust additional measures were taken – for instance the String pool was created to save memory by reusing the same String objects instead of allocating new ones. Another optimization that I want to talk about today is adding the offset and count fields to the String object instances.
Why those fields were added? Their purpose is to help to reuse already allocated structures when using some of the string functionalities – like calculating substrings of a given string. The concept is that instead of creating an new char array for a substring we could just ‘reuse’ the old one. To be exact this is what String.substring() method does: instead of copying an char array for the returned object it creates a new String reusing char[] of the old one. Only the values of offset and count fields (which indicate the beginning and the length of a new string) are changed. See here first to see - how substring function works? Because the substring operation is quite often used this mechanism helps to save a lot of memory. It is important to add that this can work only because String objects are immutable. See the following snippet:

public static void sendEmail(String emailUrl) {
String email = emailUrl.substring(7); // 'mailto:' prefix has 7 letters
String userName = email.substring(0, email.indexOf("@"));
String domainName = email.substring(email.indexOf("@"));
}

public static void main(String[] args) {
sendEmail("mailto:user_name@domain_name.com");
}

Thanks to the way substring() is implemented when we extract the email, userName and domainName three new String objects are created, but the char array is not copied. All new string variables reuse the character array from emailUrl. Thanks to that reuse for really long urls we can save approximately 2/3 of memory we would use otherwise. Great, right?

Ok… now the dark side of that optimization! Check out this snippet:

public final static String START_TAG = "<title>";
public final static String END_TAG = "</title>";

public static String getPageTitle(String pageUrl) {
// retrieve the HTML with a helper function:
String htmlPage = getPageContent(pageUrl);

// parse the page content to get the title
int start = htmlPage.indexOf(START_TAG);
start = start + START_TAG.length();
int end = htmlPage.indexOf(END_TAG);
String title = htmlPage.substring(start, end);
return title;
}

In here we are extracting from the HTML page its title – can you see the problem with this code? Looks simple and correct, right?

Now, try to imagine that the htmlPage String is huge – more than 100.000 characters, but the title of this page has only 50 characters. Because of the optimization mentioned above the returned object will reuse the char array of the htmlPage instead of creating a new one… and this means that instead of returning a small string object you get back a huge String with 100.000 characters array!! If your code will invoke getPageTitle() method many times you may find out that you have stored only a thousand titles and already you are out of memory!! Scary, right?

Of course there is an easy solution for that – instead of returning the title in line 13, you can return new String(title). The String(String) constructor is always doing a subcopy of the underlying char array, so the created title will actually have only 50 characters. Now we are safe:)

So what is the lesson here? Always use new String(String)? No… In general the String optimizations are really helpful and it is worth to take advantage of them. You just have to be careful with what you are doing and be aware of what is going on ‘under the hood’ of your code. String class API is in some situations not intuitive, so beware!

Tuesday, 31 May 2011

How can final fields appear to change their values in old Java Memory Model

One of the best examples of how final fields' values can be seen to change involves one particular implementation of the String class.
A String class is immutable in java. So are the wrapper classes like Integer, Double. But they provide corresponding mutable classes like StringBuffer for String and BitSet for Integer. The reason why these classes are immutable because of security. Methods in IO take String as parameter and not StringBuffer. That is because of security reason.
A String can be implemented as an object with three fields ( normal string class have even more, but consider 3 for now) -- a character array, an offset into that array, and a length. So they are:
final char[] buffer = new char[1000];//say 1000;
final int offset;
final int length;


The rationale for implementing String this way, instead of having only the character array, is that it lets multiple String and StringBuffer objects share the same character array and avoid additional object allocation and copying.

Suppose now you call substring() method on this String. A new string is returned by this call.But String.substring() can be implemented by creating a new string which shares the same character array with the original String and merely differs in the length and offset fields. For a String, these fields are all final fields. See - How substring works in Java?

String s1="Kinshuk";
String s2=s1.substring(4);//returns string huk


Now this is 1 parameter substring method which takes beginIndex. So substring starts from 4th character(numbering starts from 0). So it is h.

So now what happens is since substring() returns new String, it will call the constructor. Now constructor will further call Object class constructor, which will call JVM to allocate memory, and further do default initialization and returns the address (or reference) back to new string. But like I said, this array remains the same just the offset and length change. Due to default initialization length and offset have value 0. Now it is possible for the Thread T2 say to see the value of substring s2 and empty because offset and length are 0 due to default initialization rather than 4 and 3 respectively. Another case may be like length is read fine by the thread but not offset. So it may read Kin sometimes and sometimes huk.

The original Java Memory Model allowed this behavior; several JVMs have exhibited this behavior. The new Java Memory Model makes this illegal.

So what is the solution for this?

Some say make constructor synchronized. But this was not a proper solution. Rather Java came up with new JMM and it solved some of the issues.
We'll see this here - How does final fields work under new JVM?

Saturday, 14 May 2011

How substring() works in java?

substring() is a function to get substring from string. A String class is immutable in java.
A String can be implemented as an object with three fields ( normal string class have even more, but consider 3 for now) -- a character array, an offset into that array, and a length.
Consider the following code
String s1 = "Monday";
String s = s1.substring(0,3);
or
s1.substring(0,3).equals("Mon");

substring() creates a new String and returns it back. But substring is clever. It does not make a deep copy of the substring the way most languages do. It just creates a pointer into the original immutable String, i.e. points to the value char[] of the base string, and tracks the starting offset where the substring starts and count of how long the substring is. So only length and offset are different per string but the character array is shared with the original string class. This can be shown in figure:
substring-buffer-java


The downside of this cleverness is a tiny substring of a giant base String could suppress garbage collection of that big String in memory even if the whole String were no longer needed. (actually its value char[] array is held in RAM; the String object itself could be collected.)
If you know a tiny substring is holding a giant string in RAM, that would otherwise be garbage collected, you can break the bond by using

String s = new String(s1.substring(0,3));