Showing posts with label String. Show all posts
Showing posts with label String. Show all posts

Wednesday, 29 June 2011

Beware of String functions - replaceAll and replace et. al

I was going through replaceAll function. I had following string:
String str = "com.vaani.src.dynamic.CompilationHelloWorld";

I had to replace all dot's with /.So I tried
str.replaceAll(".","/");

But what I was getting was,that my string converted to - //////////////////////////////////////////////////////////////////////////////////////

The reason is simple. It was using regex, so dot(.) means replace all with something, here /.
So I solved it using \\ before dot:
str.replaceAll("\\.","/");

Another solution is that because I am using character only, so why not use '' quotes rather than "", ie:
you can do following:
str.replaceAll('.','/');

We discussed something similar here as well.

Monday, 27 June 2011

JDK 7 : String in Switch

A long overdue and seemingly basic feature but better late than never. So now Java has the feature which dotNet had earlier.
public class StringsInSwitch {

public static void main(String[] args) {
for (String a : new String[]{"foo", "bar", "baz"}) {
switch (a) {
case "foo":
System.out.println("received foo!");
break;
case "bar":
System.out.println("received bar!");
break;
case "baz":
System.out.println("received baz!");
break;
}
}
}

}

Friday, 24 June 2011

String and memory leaks

Probably most of the Java users is aware that String object is more complex than just an array of char. To make the usage of strings in Java more robust additional measures were taken – for instance the String pool was created to save memory by reusing the same String objects instead of allocating new ones. Another optimization that I want to talk about today is adding the offset and count fields to the String object instances.
Why those fields were added? Their purpose is to help to reuse already allocated structures when using some of the string functionalities – like calculating substrings of a given string. The concept is that instead of creating an new char array for a substring we could just ‘reuse’ the old one. To be exact this is what String.substring() method does: instead of copying an char array for the returned object it creates a new String reusing char[] of the old one. Only the values of offset and count fields (which indicate the beginning and the length of a new string) are changed. See here first to see - how substring function works? Because the substring operation is quite often used this mechanism helps to save a lot of memory. It is important to add that this can work only because String objects are immutable. See the following snippet:

public static void sendEmail(String emailUrl) {
String email = emailUrl.substring(7); // 'mailto:' prefix has 7 letters
String userName = email.substring(0, email.indexOf("@"));
String domainName = email.substring(email.indexOf("@"));
}

public static void main(String[] args) {
sendEmail("mailto:user_name@domain_name.com");
}

Thanks to the way substring() is implemented when we extract the email, userName and domainName three new String objects are created, but the char array is not copied. All new string variables reuse the character array from emailUrl. Thanks to that reuse for really long urls we can save approximately 2/3 of memory we would use otherwise. Great, right?

Ok… now the dark side of that optimization! Check out this snippet:

public final static String START_TAG = "<title>";
public final static String END_TAG = "</title>";

public static String getPageTitle(String pageUrl) {
// retrieve the HTML with a helper function:
String htmlPage = getPageContent(pageUrl);

// parse the page content to get the title
int start = htmlPage.indexOf(START_TAG);
start = start + START_TAG.length();
int end = htmlPage.indexOf(END_TAG);
String title = htmlPage.substring(start, end);
return title;
}

In here we are extracting from the HTML page its title – can you see the problem with this code? Looks simple and correct, right?

Now, try to imagine that the htmlPage String is huge – more than 100.000 characters, but the title of this page has only 50 characters. Because of the optimization mentioned above the returned object will reuse the char array of the htmlPage instead of creating a new one… and this means that instead of returning a small string object you get back a huge String with 100.000 characters array!! If your code will invoke getPageTitle() method many times you may find out that you have stored only a thousand titles and already you are out of memory!! Scary, right?

Of course there is an easy solution for that – instead of returning the title in line 13, you can return new String(title). The String(String) constructor is always doing a subcopy of the underlying char array, so the created title will actually have only 50 characters. Now we are safe:)

So what is the lesson here? Always use new String(String)? No… In general the String optimizations are really helpful and it is worth to take advantage of them. You just have to be careful with what you are doing and be aware of what is going on ‘under the hood’ of your code. String class API is in some situations not intuitive, so beware!

Wednesday, 22 June 2011

Beware of String functions - they may use simple strings as regex

Some string function may look simple and perform task accordingly but surprise you sometimes. eg. consider the Split function:

public class LineParser {
private final String[] values;

public LineParser(String line, String separator) {
values = line.split(separator);
}

public String getValue(int index) {
return values[index];
}
}

It’s a simple class that encapsulates parsing a text line and stores the result. Let's see.
public static void main(String[] args) {
LineParser parser1 = new LineParser("A,B,C", ",");
System.out.println("parser1:" + parser1.getValue(1));

LineParser parser2 = new LineParser("A B C", " ");
System.out.println("parser2:" + parser2.getValue(1));

LineParser parser3 = new LineParser("A|B|C", "|");
System.out.println("parser3:" + parser3.getValue(1));

LineParser parser4 = new LineParser("A\\B\\C", "\\");
System.out.println("parser4:" + parser4.getValue(1));
}

Output
For the first and second parser there is no surprise: the second value is ‘B’ and that’s exactly what gets printed. The third one instead of a second value prints ‘A’ – the first one… If that’s not strange enough the last parser throws an exception! That’s really unexpected!!

So where’s the catch? What’s wrong? Some of you already knew it, some probably start to suspect it… It’s all because of String.split() method – instead of taking a separator String as a parameter (which I tried to silently imply in the code) it takes a regular expression. Because of that two last parsers failed – both pipe and backslash signs have special meaning in Java regexps!
Mystery solved, so problem is gone… is it really? Of course you might be tempted just to fix the snippet above by writing the regexps correctly – this would be fine for this code. Now go home and check your code: do you use user-provided values in String.split()? What about String.replaceAll()? If you do you might be in real trouble… The real lesson is that some of the String methods take as a parameter plain Strings (eg: String.regionMatches()) while other expect a String with a regular expression (eg: String.matches()). Beware and double check!


Checking whether string is parseable to integer or double

This seems basic, right? In most cases it is, but as almost everything in Java this problem has its subtle pitfalls and problems. It is mainly because Java does not provide a simple utility method that can answer this question. Today I wanted to share with you several ways of solving this problem and describe their good and bad sides.

Why should you care?

Checking for that in many cases is unnecessary. If the format of data is defined and its contract states that the string is an integer you can just parse it and deal with unlikely exception that an error occurs. The problem is when there is no such a contract and you have to decide based on whether the string is an integer what actions to perform next. In that case plain try-catch check may be too expensive for you:

public boolean isInteger(String string) {
try {
Integer.valueOf(string);
return true;
} catch (NumberFormatException e) {
return false;
}
}

This method’s execution cost is high because of two factors: one is that to determine if string is an integer we have to do the whole parsing and throw away the result. Second is that we use exception throwing (which is expensive) to direct the program flow. The good thing about this code is its simplicity – you can at a glance say the method is correct.

Let’s use RegExp!

Much faster is to create a regular expression and use it to check whether string contains an integer or double. The good thing about this approach is that the regexp can be precompiled and used several times after:

private static Pattern doublePattern = Pattern.compile("-?\\d+(\\.\\d*)?");

public boolean isDouble(String string) {
return doublePattern.matcher(string).matches();
}

Unfortunately this method has important flaws: the pattern above will work for the most basic string representation of Double, but what about more advanced like “1.23E-12″. Even if you improve this pattern (belive me, its difficult) there are still some checks that it will not be able to perform, for instance checking if the integer is above Integer.MAX_INT.


What about Scanner?

There is a way of combining the two approaches shown above together: first check with regexp if string is possibly be an integer and if it seems to be one, try to perform the actual parsing. If the regexp is ‘good enough’ the number of false positives resulting in NumberFormatException will be acceptable. The good news is this approach is already implemented by a Scanner class. See the following example:

public static void main(String[] args) {
Scanner scanner = new Scanner("Test string: 12.3 dog 12345 cat 1.2E-3");

while (scanner.hasNext()) {
if (scanner.hasNextDouble()) {
Double doubleValue = scanner.nextDouble();
} else {
String stringValue = scanner.next();
}
}
}

In essence Scanner breaks down the given string into tokens around whitespace and allows you to iterate trough them. It gives you useful access methods like ‘hasNextDouble()’ to check whether the next token is a Double or not and allows you to get it in a parsed version as a Double with ‘nextDouble()’ method.

Internals of Scanner show that it in fact combines both the regexp and exception catching methods, which makes it quite efficient. The downside is that the Scanner object itself is heavy and prepared to parsing larger text strings, so it may be ineffective if you need to use it on a simple strings like “123″.

Wait! It does not work for me!!

It is possible that you start using one of the methods above on a real life data and at some point things stop making sense… Why? Because we forgot about something important: the numbers are locale-sensitive and its string representation depends from country to country. For instance ten thousand in US is 10,000, in Poland 10 000 and in Italy 10.000. See that none of the methods above could successfully parse neither Polish or Italian numbers! What can you do in those cases? You have to use for parsing a NumberFormat class with specified locale:

private static NumberFormat italianDouble =
NumberFormat.getNumberInstance(Locale.ITALIAN);

public boolean isItalianDouble(String string) {
return (italianDouble.parse(string) != null);
}

Now you can finally see that 10,000 is a valid integer. Unfortunately with NumberFormat you get another set of problems – it is too liberal in parsing numbers! The method above will return true for 10,000 and false for both abc and x1, but it will return true also for 10abc as it looks only for a suffix in the string, not a total match.

Conclusion

As you can see none of the solutions shown above is perfect – each of the method aboves has its flaws and advantages. Because of that the choice which one is the best for you strongly depends on the context of your program. The important factors are: how often do you need to do a check like that, what is the false result ratio, whether you parse long human readable text or just few given values and whether you care about locale specific issues. It is also possible that in your code you’ll need a combination of them or to add some specific tweaks to one of them.

String utility : Merging 2 string arrays

public static String[] mergeStringArrays(String array1[], String array2[]) {  
if (array1 == null || array1.length == 0)
return array2;
if (array2 == null || array2.length == 0)
return array1;
List array1List = Arrays.asList(array1);
List array2List = Arrays.asList(array2);
List result = new ArrayList(array1List);
List tmp = new ArrayList(array1List);
tmp.retainAll(array2List);
result.removeAll(tmp);
result.addAll(array2List);
return ((String[]) result.toArray(new String[result.size()]));
}

String utility function : Converting int[] to String

Method 1 - Writing our own function

public static String toString(int[] intArray) {

String separator =
",";

StringBuilder sb =
new StringBuilder("");

if (intArray != null && intArray.length > 0) {

for (int i = 0; i < intArray.length; i++) {

sb.append(intArray[i]);

if (i < (intArray.length - 1)) {

sb.append(separator);

}

}

}

return sb.toString();

}

Method2 - Using Arrays.asList()
Arrays.asList(intArray).toString()

Somethings to remember about saving memory with the String.intern method

Imagine that you have a flat file in csv format. And it has a 100 million rows from which you are about to read data to store and process in your app.
The data is in the format (orderId, storeIdentifier, amountDue).
What optimization can you do here?
Note that the storeIdentifier is going to be repeated a lot. Every time you read a record and split it into a string and possibly store it into an in-memory data structure, you will be creating a new String object. So 100 million String objects will be created for the storeIdentifier. But you know that there are only (say) 100 stores in all! So there is a massive amount of wasted memory.
What you can do here is – Right after you have read the storeIdentifier string, do this -
storeIdentifier = storeIdentifier.intern();
That would put the store identifier into the String pool and keep the number of String instances with the same data minimal by returning the String from the pool once it has been put into it by the first invocation for the string.
Points to Note
  1. Use intern() only if you really need to use it. And only if you know the extra instances are going to be a problem. And only if you really understand how it works.
  2. Older JVMs had a problem collecting interned strings. Newer JVMs handle this fine. Don’t worry about leaks due to a growing pool. If other references are gone, interned strings will be collected by the GC.
  3. Interned strings go into the PermGen Space area of memory in some JVMs. This is not part of the normal heap. If you send too many strings here, an OutOfMemoryError will hit you even though your heap may have several GB available.
  4. Interned strings can be compared with == rather than .equals(). This is a bit faster. But it is rarely worth the brittle code.
  5. Calling String.intern() can be a performance hit. It takes CPU cycles to maintain the pool and do the comparisons. Are you sure you are saving enough memory to make it worth the CPU? Measure. Don’t guess.
  6. Use String.intern() only if the set of possible Strings that will be interned has a bound tight enough such that the set of different strings is much smaller than the total number of strings that will be read.

String utility : unquoting the string

This is another simple Java string utility method, which can be used to unquote a string value. This method takes care of single and double quotes both along with handling the null string. It returns the string as it is when the string is not quoted.

public static String unquote(String s) {
if (s != null && ((s.startsWith("\"") && s.endsWith("\""))
|| (s.startsWith("'") && s.endsWith("'")))) {

s = s.substring(1, s.length() - 1);
}
return s;
}

Generating random unique strings with JAVA

Method1
If you need randomly generated Strings in your java code you can use the below functions. This function is using SecureRandom class to get its work done.
public String generateRandomString(String s) {
try {
SecureRandom prng = SecureRandom.getInstance("SHA1PRNG");
String randomNum = new Integer(prng.nextInt()).toString();
randomNum += s;
MessageDigest sha = MessageDigest.getInstance("SHA-1");
byte[] result = sha.digest(randomNum.getBytes());
return hexEncode(result);
} catch (NoSuchAlgorithmException e) {
return System.currentTimeMillis()+"_"+username;
}
}

The classical hexEncode method:
protected String hexEncode(byte[] aInput) {
StringBuffer result = new StringBuffer();
char[] digits = { '0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
'a', 'b', 'c', 'd', 'e', 'f' };
for (int idx = 0; idx < aInput.length; ++idx) {
byte b = aInput[idx];
result.append(digits[(b & 0xf0) >> 4]);
result.append(digits[b & 0x0f]);
}
return result.toString();
}

Suppose that you need to generate unique random session ids for your logged in users. You can use the above function as follows :
String sessionId = generateRandomString(username);

Method 2 : Using Long.toHesString()
Long.toHexString(Double.doubleToLongBits(Math.random()));

Just use simple String of numbers and alphabets to get Random string of desired length:
static final String AB = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ";
static Random rnd = new Random();

String randomString( int len )
{
StringBuilder sb = new StringBuilder( len );
for( int i = 0; i < len; i++ )
sb.append( AB.charAt( rnd.nextInt(AB.length()) ) );
return sb.toString();
}


Method 3

Apache classes also provide ways to generate random string using org.apache.commons.lang.RandomStringUtils (commons-lang).



A short unique string identifier for shorten URL

How to shorten URLs?
As far as you know, there is a lot of Short URL Redirection Services, such as "bit.ly", which used to convert a long url to some short url.

Its Mechanism seems simple. The sequential key for long URL is enough.
For example, the "Wa0e" is the key of "http://bit.ly/Wa0e"
for "http://www.beachbody.com/product/fitness_programs/p90x.do?code=P90XDOTCOM".
The mod_rewrite module could be used to remove file extension and parameters (e.g., short.php?key=Wa0e)

Key looks like a random string, but I guess it's just a sequential key. Because, that way is simple, and same with total number of random combination in conclusion.

How to generate sequential key? Below is my example code.

INDEX.length is 62. So, (62^4 -1) URLs could stored in the combination of four ciphers.

private static String[] INDEX = new String[] { "0", "1", "2", "3", "4", "5",
"6", "7", "8", "9", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j",
"k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x",
"y", "z", "A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L",
"M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z" };

private static String getNextURL(String in) {
if (in == null) {
return INDEX[0];
}

char[] result = new char[in.length()];
boolean rounded = false;

for (int i = in.length() - 1; i > -1; i--) {
String subStr = Character.toString(in.charAt(i));
try {
if (!rounded) {
result[i] = getNext(subStr).charAt(0);
rounded = true;
} else {
result[i] = in.charAt(i);
}
} catch (ArrayIndexOutOfBoundsException e) {
result[i] = INDEX[0].charAt(0);
}
}

return new String(result);
}

private static String getNext(String subStr) {
if (subStr.equals("Z")) {
throw new ArrayIndexOutOfBoundsException();
}

int x = 0;
for (int i = 0; i < INDEX.length; i++) {
if (INDEX[i].equals(subStr))
break;

x++;
}

return INDEX[x + 1];
}




Saturday, 11 June 2011

String empty check is more easy now with JDK6 and StringUtils

Prior to JDK 6, we can check if a string is empty in 2 ways:
if(s != null && s.length() == 0)

//OR

if(("").equals(s))

Support from JDK6
Checking its length is more readable and may be a little faster. Starting from JDK 6, String class has a new convenience method isEmpty():

boolean isEmpty()
Returns true if, and only if, length() is 0.

It is just a shorthand for checking length. Of course, if the String is null, you will still get NullPointerException.
I don't see much value in adding this convenience method. Instead,
I'd like to see a static utility method that also handle null value:

public static boolean notEmpty(String s) {
return (s != null && s.length() > 0);
}



Support from StringUtils
Another option, use StringUtils.isEmpty(String str) of Apache commons , can be downloaded from - http://commons.apache.org/

It checks for null string also and return true for empty
public static boolean isEmpty(String str) {
return str == null str.length() == 0;
}


Insufficient memory problem with StringBuffer

Using string buffer without selecting the proper construction can lead to memory leak.

Lets have a look of the constructor of string buffer

Constructs a string buffer with no characters in it and an initial capacity of 16 characters.
public StringBuffer() {
super(16);
}


Suppose you are creating objects of type StringBuffer in a loop, no of objects may change depnding upon the input. Every time it will create object to store at least 16 characters, you may not need all the 16, in that case remaining space will be unused and cannout be allocated for other purpose.

At some point these unused memory location may lead to Out of memory problem.

Instead of that we can use another constructor

Constructs a string buffer with no characters in it and the specified initial capacity.
public StringBuffer(int capacity) {
super(capacity);
}

Tuesday, 31 May 2011

How can final fields appear to change their values in old Java Memory Model

One of the best examples of how final fields' values can be seen to change involves one particular implementation of the String class.
A String class is immutable in java. So are the wrapper classes like Integer, Double. But they provide corresponding mutable classes like StringBuffer for String and BitSet for Integer. The reason why these classes are immutable because of security. Methods in IO take String as parameter and not StringBuffer. That is because of security reason.
A String can be implemented as an object with three fields ( normal string class have even more, but consider 3 for now) -- a character array, an offset into that array, and a length. So they are:
final char[] buffer = new char[1000];//say 1000;
final int offset;
final int length;


The rationale for implementing String this way, instead of having only the character array, is that it lets multiple String and StringBuffer objects share the same character array and avoid additional object allocation and copying.

Suppose now you call substring() method on this String. A new string is returned by this call.But String.substring() can be implemented by creating a new string which shares the same character array with the original String and merely differs in the length and offset fields. For a String, these fields are all final fields. See - How substring works in Java?

String s1="Kinshuk";
String s2=s1.substring(4);//returns string huk


Now this is 1 parameter substring method which takes beginIndex. So substring starts from 4th character(numbering starts from 0). So it is h.

So now what happens is since substring() returns new String, it will call the constructor. Now constructor will further call Object class constructor, which will call JVM to allocate memory, and further do default initialization and returns the address (or reference) back to new string. But like I said, this array remains the same just the offset and length change. Due to default initialization length and offset have value 0. Now it is possible for the Thread T2 say to see the value of substring s2 and empty because offset and length are 0 due to default initialization rather than 4 and 3 respectively. Another case may be like length is read fine by the thread but not offset. So it may read Kin sometimes and sometimes huk.

The original Java Memory Model allowed this behavior; several JVMs have exhibited this behavior. The new Java Memory Model makes this illegal.

So what is the solution for this?

Some say make constructor synchronized. But this was not a proper solution. Rather Java came up with new JMM and it solved some of the issues.
We'll see this here - How does final fields work under new JVM?

Sunday, 22 May 2011

String concatenation operator in java

String concatenation operator is +.
eg. "ab" + "cd" = "abcd"

i.e.

String s1 = "ab";
String s2 = "cd";
String s3 = s1+s2;

Saturday, 14 May 2011

How substring() works in java?

substring() is a function to get substring from string. A String class is immutable in java.
A String can be implemented as an object with three fields ( normal string class have even more, but consider 3 for now) -- a character array, an offset into that array, and a length.
Consider the following code
String s1 = "Monday";
String s = s1.substring(0,3);
or
s1.substring(0,3).equals("Mon");

substring() creates a new String and returns it back. But substring is clever. It does not make a deep copy of the substring the way most languages do. It just creates a pointer into the original immutable String, i.e. points to the value char[] of the base string, and tracks the starting offset where the substring starts and count of how long the substring is. So only length and offset are different per string but the character array is shared with the original string class. This can be shown in figure:
substring-buffer-java


The downside of this cleverness is a tiny substring of a giant base String could suppress garbage collection of that big String in memory even if the whole String were no longer needed. (actually its value char[] array is held in RAM; the String object itself could be collected.)
If you know a tiny substring is holding a giant string in RAM, that would otherwise be garbage collected, you can break the bond by using

String s = new String(s1.substring(0,3));

Saturday, 30 April 2011

Difference between concat and append function

The concat function is present in String class but the append functin is present in StringBuffer/StringBuilder class.
The concat function concats the string on which it is invoked with the string passed as parameter to this function. The result returned is a new String with both the strings concatanated.


The append function appends a String to the String represent by the StringBuffer/StringBuilder object on which the append() method is invoked. The result returned is the same StringBuffer/StringBuilder object on which this method was invoked.

How to convert a String array to ArrayList?

Method 1:

String[] words = {"ace", "boom", "crew", "dog", "eon"};
List<String> wordList = Arrays.asList(words);


Method 2:


List<String> list = new ArrayList<String>(words.length);
for (String s : words) {
list.add(s);
}



Method 3 :


Collections.addAll(myList, myStringArray);


Note: In the method no 1 Arrays.asList() is efficient because it doesn't need to copy the content of the array. This method returns a List that is a "view" onto the array - a wrapper that makes the array look like a list. When you change an element in the list, the element in the original array is also changed. Note that the list is fixed size - if you try to add elements to the list, you'll get an exception.



If you only need read access to the array as if it is a List and you don't want to add or remove elements from the list, then only use method no 1.

Thursday, 28 April 2011

Easy Java Bean toString() using BeanUtils

Its always good to write toString() function for the bean, but doing it manually is really tedious for lazy developers, where one concatenates the fields in the class to create a toString() method.  This code snippet using Apache Commons (a.k.a. Jakarta Commons) is very helpful for just such occasions:
public String toString() {
try {
return BeanUtils.describe(this).toString();
} catch (Exception e) {
Logger.getLogger(this.getClass()).error("Error converting object to String", e);
}
return super.toString();
}

How Many String Objects Are Created ?

Consider the code fragment shown below:

String s1="abc";
String s2="def"
System.out.println(s1 + " " + s2);


The question is how many String objects are created after the last line is compiled by Java.

To solve questions like this one can use Eclipse decompiler (JAD) to see how the Java compiler actually interpreted that third line.

1) For  any java developer it will be very easy to tell that first two lines create one object each and hence two objects in first two line.

2) The third line in the above code is interpreted by JDK 1.5 compiler as:
system.out.println((new StringBuilder(string.valueOf(s1))).append(" ").append(s2).toString());

This clearly tells us that 2 more string objects (" " and "abc def") are created in the last line.
3) Hence we can say that 4 String and 1 String Builder objects are created by that piece of code.

So the following code shows it all:
string s1 = "abc";
string s2 = "def";
system.out.println((new StringBuilder(string.valueOf(s1))).append(" ").append(s2).toString());