Java Rumblings: regex / regular expression

Showing posts with label regex / regular expression. Show all posts

Wednesday, 29 June 2011

Beware of String functions - replaceAll and replace et. al

I was going through replaceAll function. I had following string:

String str = "com.vaani.src.dynamic.CompilationHelloWorld";

I had to replace all dot's with /.So I tried

str.replaceAll(".","/");

But what I was getting was,that my string converted to - //////////////////////////////////////////////////////////////////////////////////////

The reason is simple. It was using regex, so dot(.) means replace all with something, here /.
So I solved it using \\ before dot:

str.replaceAll("\\.","/");

Another solution is that because I am using character only, so why not use '' quotes rather than "", ie:
you can do following:

str.replaceAll('.','/');

We discussed something similar here as well.

Wednesday, 22 June 2011

Beware of String functions - they may use simple strings as regex

Some string function may look simple and perform task accordingly but surprise you sometimes. eg. consider the Split function:

public class LineParser {
    private final String[] values;

    public LineParser(String line, String separator) {
        values = line.split(separator);
    }

    public String getValue(int index) {
        return values[index];
    }
}

It’s a simple class that encapsulates parsing a text line and stores the result. Let's see.

public static void main(String[] args) {
    LineParser parser1 = new LineParser("A,B,C", ",");
    System.out.println("parser1:" + parser1.getValue(1));

    LineParser parser2 = new LineParser("A B C", " ");
    System.out.println("parser2:" + parser2.getValue(1));

    LineParser parser3 = new LineParser("A|B|C", "|");
    System.out.println("parser3:" + parser3.getValue(1));

    LineParser parser4 = new LineParser("A\\B\\C", "\\");
    System.out.println("parser4:" + parser4.getValue(1));
}

Output
For the first and second parser there is no surprise: the second value is ‘B’ and that’s exactly what gets printed. The third one instead of a second value prints ‘A’ – the first one… If that’s not strange enough the last parser throws an exception! That’s really unexpected!!

So where’s the catch? What’s wrong? Some of you already knew it, some probably start to suspect it… It’s all because of String.split() method – instead of taking a separator String as a parameter (which I tried to silently imply in the code) it takes a regular expression. Because of that two last parsers failed – both pipe and backslash signs have special meaning in Java regexps!
Mystery solved, so problem is gone… is it really? Of course you might be tempted just to fix the snippet above by writing the regexps correctly – this would be fine for this code. Now go home and check your code: do you use user-provided values in String.split()? What about String.replaceAll()? If you do you might be in real trouble… The real lesson is that some of the String methods take as a parameter plain Strings (eg: String.regionMatches()) while other expect a String with a regular expression (eg: String.matches()). Beware and double check!

Checking whether string is parseable to integer or double

This seems basic, right? In most cases it is, but as almost everything in Java this problem has its subtle pitfalls and problems. It is mainly because Java does not provide a simple utility method that can answer this question. Today I wanted to share with you several ways of solving this problem and describe their good and bad sides.

Why should you care?

Checking for that in many cases is unnecessary. If the format of data is defined and its contract states that the string is an integer you can just parse it and deal with unlikely exception that an error occurs. The problem is when there is no such a contract and you have to decide based on whether the string is an integer what actions to perform next. In that case plain try-catch check may be too expensive for you:

public boolean isInteger(String string) {
    try {
        Integer.valueOf(string);
        return true;
    } catch (NumberFormatException e) {
        return false;
    }
}

This method’s execution cost is high because of two factors: one is that to determine if string is an integer we have to do the whole parsing and throw away the result. Second is that we use exception throwing (which is expensive) to direct the program flow. The good thing about this code is its simplicity – you can at a glance say the method is correct.

Let’s use RegExp!

Much faster is to create a regular expression and use it to check whether string contains an integer or double. The good thing about this approach is that the regexp can be precompiled and used several times after:

private static Pattern doublePattern = Pattern.compile("-?\\d+(\\.\\d*)?");

public boolean isDouble(String string) {
    return doublePattern.matcher(string).matches();
}

Unfortunately this method has important flaws: the pattern above will work for the most basic string representation of Double, but what about more advanced like “1.23E-12″. Even if you improve this pattern (belive me, its difficult) there are still some checks that it will not be able to perform, for instance checking if the integer is above Integer.MAX_INT.

What about Scanner?

There is a way of combining the two approaches shown above together: first check with regexp if string is possibly be an integer and if it seems to be one, try to perform the actual parsing. If the regexp is ‘good enough’ the number of false positives resulting in NumberFormatException will be acceptable. The good news is this approach is already implemented by a Scanner class. See the following example:

public static void main(String[] args) {
    Scanner scanner = new Scanner("Test string: 12.3 dog 12345 cat 1.2E-3");

    while (scanner.hasNext()) {
        if (scanner.hasNextDouble()) {
            Double doubleValue = scanner.nextDouble();
        } else {
            String stringValue = scanner.next();
        }
    }
}

In essence Scanner breaks down the given string into tokens around whitespace and allows you to iterate trough them. It gives you useful access methods like ‘hasNextDouble()’ to check whether the next token is a Double or not and allows you to get it in a parsed version as a Double with ‘nextDouble()’ method.

Internals of Scanner show that it in fact combines both the regexp and exception catching methods, which makes it quite efficient. The downside is that the Scanner object itself is heavy and prepared to parsing larger text strings, so it may be ineffective if you need to use it on a simple strings like “123″.

Wait! It does not work for me!!

It is possible that you start using one of the methods above on a real life data and at some point things stop making sense… Why? Because we forgot about something important: the numbers are locale-sensitive and its string representation depends from country to country. For instance ten thousand in US is 10,000, in Poland 10 000 and in Italy 10.000. See that none of the methods above could successfully parse neither Polish or Italian numbers! What can you do in those cases? You have to use for parsing a NumberFormat class with specified locale:

private static NumberFormat italianDouble =
        NumberFormat.getNumberInstance(Locale.ITALIAN);

public boolean isItalianDouble(String string) {
    return (italianDouble.parse(string) != null);
}

Now you can finally see that 10,000 is a valid integer. Unfortunately with NumberFormat you get another set of problems – it is too liberal in parsing numbers! The method above will return true for 10,000 and false for both abc and x1, but it will return true also for 10abc as it looks only for a suffix in the string, not a total match.

Conclusion

As you can see none of the solutions shown above is perfect – each of the method aboves has its flaws and advantages. Because of that the choice which one is the best for you strongly depends on the context of your program. The important factors are: how often do you need to do a check like that, what is the false result ratio, whether you parse long human readable text or just few given values and whether you care about locale specific issues. It is also possible that in your code you’ll need a combination of them or to add some specific tweaks to one of them.

Wednesday, 4 May 2011

Using Pattern and Matcher class in regular expression

Some commonly used regular expressions

Regex tutorial in java

Using The Pattern Class in regular expression matching

In Java, you compile a regular expression by using the Pattern.compile() class factory. This factory returns an object of type Pattern. E.g.:

Pattern myPattern = Pattern.compile("regex");

You can specify certain options as an optional second parameter.

Pattern.compile("regex", 
Pattern.CASE_INSENSITIVE 
| Pattern.DOTALL 
| Pattern.MULTILINE);

makes the regex case insensitive for ASCII characters, causes the dot to match line breaks and causes the start and end of string anchors to match at embedded line breaks as well.

When working with Unicode strings, specify Pattern.UNICODE_CASE if you want to make the regex case insensitive for all characters in all languages. You should always specify Pattern.CANON_EQ to ignore differences in Unicode encodings, unless you are sure your strings contain only ASCII characters and you want to increase performance.

If you will be using the same regular expression often in your source code, you should create a Pattern object to increase performance. Creating a Pattern object also allows you to pass matching options as a second parameter to the Pattern.compile() class factory. If you use one of the String methods above, the only way to specify options is to embed mode modifier into the regex. Putting (?i) at the start of the regex makes it case insensitive. (?m) is the equivalent of Pattern.MULTILINE, (?s) equals Pattern.DOTALL and (?u) is the same as Pattern.UNICODE_CASE. Unfortunately, Pattern.CANON_EQ does not have an embedded mode modifier equivalent.

Use myPattern.split("subject") to split the subject string using the compiled regular expression. This call has exactly the same results as myString.split("regex"). The difference is that the former is faster since the regex was already compiled.

Regular expression Using The Matcher Class

Except for splitting a string (see previous paragraph), you need to create a Matcher object from the Pattern object. The Matcher will do the actual work. The advantage of having two separate classes is that you can create many Matcher objects from a single Pattern object, and thus apply the regular expression to many subject strings simultaneously.

To create a Matcher object, simply call Pattern.matcher() like this:

myMatcher = Pattern.matcher("subject");

If you already created a Matcher object from the same pattern, call myMatcher.reset("newsubject") instead of creating a new matcher object, for reduced garbage and increased performance. Either way, myMatcher is now ready for duty.

To find the first match of the regex in the subject string, call myMatcher.find(). To find the next match, call myMatcher.find() again. When myMatcher.find() returns false, indicating there are no further matches, the next call to myMatcher.find() will find the first match again. The Matcher is automatically reset to the start of the string when find() fails.

The Matcher object holds the results of the last match. Call its methods start(), end() and group() to get details about the entire regex match and the matches between capturing parentheses. Each of these methods accepts a single int parameter indicating the number of the backreference. Omit the parameter to get information about the entire regex match. start() is the index of the first character in the match. end() is the index of the first character after the match. Both are relative to the start of the subject string. So the length of the match is end() - start(). group() returns the string matched by the regular expression or pair of capturing parentheses.

myMatcher.replaceAll("replacement") has exactly the same results as myString.replaceAll("regex", "replacement"). Again, the difference is speed.

The Matcher class allows you to do a search-and-replace and compute the replacement text for each regex match in your own code. You can do this with the appendReplacement() and appendTail() Here is how:

StringBuffer myStringBuffer = new StringBuffer();
myMatcher = myPattern.matcher("subject");
 while (myMatcher.find()) {
   if (checkIfThisMatchShouldBeReplaced())
   {
     myMatcher.appendReplacement(myStringBuffer, computeReplacementString());
   }
 }
 myMatcher.appendTail(myStringBuffer)

Obviously, checkIfThisMatchShouldBeReplaced() and computeReplacementString() are placeholders for methods that you supply. The first returns true or false indicating if a replacement should be made at all. Note that skipping replacements is way faster than replacing a match with exactly the same text as was matched. computeReplacementString() returns the actual replacement string.

Example

public static boolean checkDate(String date, boolean isEnglish){  
    String monthExpression = "[0-1][1-9]";  
    String dayExpression = "(0[1-9]|[12][0-9]|3[01])";  
    boolean isValid = false;  
    //RegEx to validate date in US format.  
    String expression = "^" + monthExpression +"[- / ]?" + dayExpression + "[- /]?(18|19|20|21)\\d{2}";  
    if(isEnglish){  
        //RegEx to validate date in Metric format.  
        expression = "^"+dayExpression + "[- / ]?" + monthExpression + "[- /]?(18|19|20|21)\\d{2,4}";  
    }  
    CharSequence inputStr = date;  
    Pattern pattern = Pattern.compile(expression,Pattern.CASE_INSENSITIVE);  
    Matcher matcher = pattern.matcher(inputStr);  
    if(matcher.matches()){  
        isValid=true;  
    }  
   }

Validating email address with java regex

/** isEmailValid: Validate email address using Java reg ex. 
* This method checks if the input string is a valid email address. 
* @param email String. Email address to validate 
* @return boolean: true if email address is valid, false otherwise. 
*/  

public static boolean isEmailValid(String email){  
boolean isValid = false;  
  
/* 
Email format: A valid email address will have following format: 
        [\\w\\.-]+: Begins with word characters, (may include periods and hypens). 
    @: It must have a '@' symbol after initial characters. 
    ([\\w\\-]+\\.)+: '@' must follow by more alphanumeric characters (may include hypens.). 
This part must also have a "." to separate domain and subdomain names. 
    [A-Z]{2,4}$ : Must end with two to four alaphabets. 
(This will allow domain names with 2, 3 and 4 characters e.g pa, com, net, wxyz) 
 
Examples: Following email addresses will pass validation 
abc@xyz.net; ab.c@tx.gov 
*/  
  
//Initialize reg ex for email.  
String expression = "^[\\w\\.-]+@([\\w\\-]+\\.)+[A-Z]{2,4}$";  
CharSequence inputStr = email;  
//Make the comparison case-insensitive.  
Pattern pattern = Pattern.compile(expression,Pattern.CASE_INSENSITIVE);  
Matcher matcher = pattern.matcher(inputStr);  
if(matcher.matches()){  
isValid = true;  
}  
return isValid;  
}

Way to check if a java string is a number

We have seen here how to convert java string to number. So this is one way of checking whether the given string is number or not. We can simply write:

int age = new Integer(ageString).intValue();

But there is a possibility that the user might have entered an invalid number. Probably they entered “thirty” as their age.
If you try to convert “thirty” to a number you will get NumberFormatException. One way to avoid this is to catch and handle the NumberFormatException. But this is not the ideal and the most elegant solution to convert a string to a number in Java.

Another approach is to validate the input string before performing the conversion using Java regular expression. I like that second approach because it is more elegant and it will keep your code clean.

So here is a function which will take a String and check whether its number or not using regular expression:

public static boolean isStringANumber(String number){  
         boolean isValid = false;  
         /*Explaination: 
            [-+]?: Can have an optional - or + sign at the beginning. 
            [0-9]*: Can have any numbers of digits between 0 and 9 
            \\.? : the digits may have an optional decimal point. 
        [0-9]+$: The string must have a digit at the end. 
        If you want to consider x. as a valid number change 
            the expression as follows. (but I treat this as an invalid number.). 
           String expression = "[-+]?[0-9]*\\.?[0-9\\.]+$"; 
           */  
           String expression = "[-+]?[0-9]*\\.?[0-9]+$";  
           CharSequence inputStr = number;  
           Pattern pattern = Pattern.compile(expression);  
           Matcher matcher = pattern.matcher(inputStr);  
           if(matcher.matches()){  
              isValid = true;  
           }  
           return isValid;  
 }

Java regular expressions with java.util.regex

The java.util.regex package primarily consists of three classes: Pattern, Matcher, and PatternSyntaxException.

A Pattern object is a compiled representation of a regular expression. The Pattern class provides no public constructors. To create a pattern, you must first invoke one of its public static compile methods, which will then return a Pattern object. These methods accept a regular expression as the first argument; the first few lessons of this trail will teach you the required syntax.
A Matcher object is the engine that interprets the pattern and performs match operations against an input string. Like the Pattern class, Matcher defines no public constructors. You obtain a Matcher object by invoking the matcher method on a Pattern object.
A PatternSyntaxException object is an unchecked exception that indicates a syntax error in a regular expression pattern.

Tuesday, 3 May 2011

Using Regular Expressions with String.matches()

s,matches("regex") evaluates true if the WHOLE string can be matched with string s.

Now the regex will be containing character classes, to define some pattern to be searched. These character classes are defined over here. Now with the help of this we can see some simple regular expression handling in java.

So to match bad or Bad we can use regex as “[Bb]ad”

str.mathces("[Bb]ad");

To match alternatives for a whole string, we use pipe:

str.mathces("bad|ugly");

Using ranges

To match hexadecimal digit:

[0-9A-F]

To match hexadecimal digit with case insensitive way:
[0-9A-Fa-fA-F]

Seeing the whole examples:

public class StringMatcher{
// Returns true if the string matches exactly "true"
public static boolean isTrue(String s){
return s.matches("true");
}
// Returns true if the string matches exactly "true" or "True"
public static boolean isTrueVersion2(String s){
return s.matches("[tT]rue");
}

// Returns true if the string matches exactly "true" or "True"
// or "yes" or "Yes"
public static boolean isTrueOrYes(String s){
return s.matches("[tT]rue|[yY]es");
}

// Returns true if the string contains exactly "true"
public static boolean containsTrue(String s){
return s.matches(".*true.*");
}


// Returns true if the string contains of three letters
public static boolean isThreeLetters(String s){
return s.matches("[a-zA-Z]{3}");
// Simpler from for
//        return s.matches("[a-Z][a-Z][a-Z]");
}



// Returns true if the string does not have a number at the beginning
public boolean isNoNumberAtBeginning(String s){
return s.matches("^[^\\d].*");
}
// Returns true if the string contains a arbitrary number of characters except b
public static boolean isIntersection(String s){
return s.matches("([\\w&&[^b]])*");
}
// Returns true if the string contains a number less then 300
public static boolean isLessThenThreeHundret(String s){
return s.matches("[^0-9]*[12]?[0-9]{1,2}[^0-9]*");
}

}

Using Regular expressions with String in java

Class String provides several methods for performing regular expression operations.
3 methods provides by strings are:

s.matches("regex")
s.split("regex")
s.replace("regex", "replacement")

Basic regex expressions with String.matches()

Regular Expression introduction in java

Regular expressions are sequences of characters and symbols that define a set of strings. They are useful for validating input and ensuring that data is in a particular format. For example, a ZIP code must consist of five digits, and a last name must contain only letters, spaces, apostrophes and hyphens.

Operations of Regular expression

There are various operations which can be performed with help of regular expressions:

Searching
Splitting expressions after searching the regex
Replacing expression with other expression at which regex matches
Counting the number of time regex is found in expression

Regular expressions helping symbols

Let X and Z be 2 regex to be searched.

Symbol	Description
.X	Matches any character
^X	regex must match at the beginning of the line
X$	Finds regex must match at the end of the line
[abc]	Set definition, can match the letter a or b or c. Note that it matches only 1 character.
[^abc]	When a "^" appears as the first character inside [] when it negates the pattern. This can match any character except a or b or c
[abc[vz]]	Set definition, can match a or b or c followed by either v or z
[a-d]	Ranges between a and d…a,b,c,d. Its kind of inclusive range, where it includes a and d as well.
[a-d1-3]	Ranges between a and d…a,b,c,d and numbers in range of 1-4, ie. 1,2,3,4
X\|Z	Finds X or Z
XZ	Finds X directly followed by Z
$	Checks if a line end follows

Also java supports predefined patterns as well as quantifiers. Read here for more on this.

Built-in support for Regex with String in Java
Class String provides several methods for performing regular expression operations.
3 methods provides by strings are:

s.matches("regex")
s.split("regex")
s.replace("regex", "replacement")

matches() evaluates true if the WHOLE string can be matched with string s.
split() creates array with substrings of s divided at occurrence of "regex". "regex" is not included in the result.
replace() replaces "regex" with "replacement.
See here for regular expressions with strings in java.

Using Pattern and Matcher class in regular expression
For advanced regular expressions the classes you java.util.regex.Pattern and java.util.regex.Matcher are used.
See here for Pattern class and here for Matcher class.

Following steps are followed to get regular expression matches in the text.
1. Compile the pattern
2. Use matcher object and perform various operations like find, group, replace, replaceAll.

Example

String source = "hello mr. DJ, i like only PJ";

Pattern pattern = Pattern.compile("\\w+");
// In case you would like to ignore case sensitivity you could use this
// statement
// Pattern pattern = Pattern.compile("\\s+", Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(source);
// Check all occurance
while (matcher.find()) {
        System.out.print("Start index: " + matcher.start());
        System.out.print(" End index: " + matcher.end() + " ");
        System.out.println(matcher.group());
 }

Also see some regex examples.

Common symbols to represent Regex pattern in java

Regular expressions helping symbols

Let X and Z be 2 regex to be searched.

Symbol	Description
.X	Matches any character
^X	regex must match at the beginning of the line
X$	Finds regex must match at the end of the line
[abc]	Set definition, can match the letter a or b or c. Note that it matches only 1 character.
[^abc]	When a "^" appears as the first character inside [] when it negates the pattern. This can match any character except a or b or c
[abc[vz]]	Set definition, can match a or b or c followed by either v or z
[a-d]	Ranges between a and d…a,b,c,d. Its kind of inclusive range, where it includes a and d as well.
[a-d1-3]	Ranges between a and d…a,b,c,d and numbers in range of 1-4, ie. 1,2,3,4
X\|Z	Finds X or Z
XZ	Finds X directly followed by Z
$	Checks if a line end follows

Common predefined patterns

\d	any digit
\D	any non digit
\w	any word character
\W	any non-word character
\s	any white space
\S	any non white space
\S+	Several non-white space character

Quantifiers

Symbol	Description	Example
*	Occurs zero or more times, is short for {0,}	X* - Finds no or several letter X, .* - any character sequence
+	Occurs one or more times, is short for {1,}	X+ - Finds one or several letter X
?	Occurs no or one times, ? is short for {0,1}	X? -Finds no or exactly one letter X
{X}	Occurs X number of times, {} describes the order of the preceding liberal	\d{3} - Three digits, .{10} - any character sequence of length 10
{X,Y}	.Occurs between X and Y times,	\d{1,4}- \d must occur at least once and at a maximum of four
*?	? after a qualifier makes it a "reluctant quantifier", it tries to find the smallest match.

Note:

The backslash is an escape character in Java Strings. e.g. backslash has a predefine meaning in Java. You have to use "\\" to define a single backslash. If you want to define "\w" then you must be using "\\w" in your regex. If you want to use backslash you as a literal you have to type \\\\ as \ is also a escape charactor in regular expressions.

Applications of Regular expressions or regex

There are various applications of regular expression. We can just think of them. To list few:

One application of regular expressions is to facilitate the construction of a compiler. Often, a large and complex regular expression is used to validate the syntax of a program. If the program code does not match the regular expression, the compiler knows that there is a syntax error within the code.

Pages

Wednesday, 29 June 2011

Wednesday, 22 June 2011

Wednesday, 4 May 2011

Tuesday, 3 May 2011

Operations of Regular expression

Regular expressions helping symbols

Regular expressions helping symbols

Common predefined patterns

Quantifiers