SDSU CS 596 Java Programming
Fall Semester, 1998
Java Regular Expressions
To Lecture Notes Index
© 1998, All Rights Reserved, SDSU & Roger Whitney
San Diego State University -- This page last updated 13-Nov-98

Contents of Doc 23, Java Regular Expressions


Reference:

http://www.cacas.org/~wes/java/syntax.html

Doc 23, Java Regular Expressions Slide # 2

Regular Expressions


A regular expression is a string where some characters have special meanings. A regular expression is used to search and replace text. This set of notes gives some examples of using the gnu.regexp Java classes that are part of the SDSU Java class library. The gnu.regexp library can also be downloaded at http://www.cacas.org/~wes/java/. An on-line reference manual for regular expressions is at: http://www.cs.utah.edu/csinfo/texinfo/regex/regex_toc.html. Mastering Regular Expressions , Friedl, O'Reilly & Associates is a text on how to use regular expressions. This set of notes is not meant to be a tutorial on regular expressions.

Doc 23, Java Regular Expressions Slide # 3
Using GNU.REGEXP package
In this example, we show how to use a regular expression to replace "at" by "ow". The substitute() method replaces the first instance of the pattern in the text. The substituteAll() method replaces all instances of the pattern in the text.

import gnu.regexp.REException;
import gnu.regexp.RE;
public class RegularExpression 
   {
   public static void main( String args[] ) throws REException 
      {
      String text = "cat sat that hat";
      String pattern = "at";
      String replacement = "ow";
      
      RE magic = new RE( pattern );
      
      String result = magic.substitute( text, replacement );
      System.out.println( result );
      
      String all = magic.substituteAll( text, replacement );
      System.out.println( all );
      }
   }
Output
cow sat that hat
cow sow thow how

Doc 23, Java Regular Expressions Slide # 4
Getting all Matches
public class AccessingMatches  {
   public static void main( String args[] )  throws REException{
      String text = "cat sat that hat";
      String pattern = "at";
      String replacement = "ow";
      
      RE magic = new RE( pattern );
      // Show how to access the all matches
      REMatch[] allMatches = magic.getAllMatches( text);
      
      REMatch lastMatch = allMatches[ allMatches.length - 1 ];
      // Perform just the last match
      int patternIndex = lastMatch.getStartIndex();
      String last = text.substring( 0, patternIndex) + 
         magic.substitute(text, replacement, patternIndex );
      System.out.println( last );
      
      Enumeration matches = magic.getMatchEnumeration( text);
      while ( matches.hasMoreElements() ){
         REMatch aMatch = (REMatch) matches.nextElement();
         System.out.println( "A match at location: " + 
               aMatch.getStartIndex());
      }
   }
Output
cat sat that how
A match at location: 1
A match at location: 5
A match at location: 10
A match at location: 14

Doc 23, Java Regular Expressions Slide # 5
Parameters of RE methods
The previous examples of RE methods (substitute, substituteAll, getMatchEnumeration) use Strings as an argument for the text. The methods of RE can take Strings, char[], Stringbuffers and Inputstreams for the text.

Regular Expression Dialects

The gnu.regexp package supports a number of different dialects of regular expressions. The dialects differ in what special characters they use. The following code segment shows how to create a RE object that will use different dialect. The default dialect is Perl 5. See the class RESyntax for a list of supported dialects and an explanation of the second argument.

new RE( pattern, 0, RESyntax.RE_SYNTAX_EMACS );


Doc 23, Java Regular Expressions Slide # 6

Special Characters

The rest of the slides contain tables that show the special characters used in gnu.regexp. The tables and text are copied directly from http://www.cacas.org/~wes/java/syntax.html

To illustrate how some of the special characters operate, I added some examples. The examples were all produced using the following method. Note that the double quote character is added to the strings in the println method to show were the strings start and end.

   public static void replaceAll(String text, 
                                 String pattern, 
                                 String replacement) throws REException
      {
      RE magic = new RE( pattern );
      String result = magic.substituteAll( text, replacement );
      System.out.println( "Text\t\"" + text + "\"" );
      System.out.println( "Pattern\t\"" + pattern + "\"");
      System.out.println( "Replacement\t\"" + replacement + "\"");
      System.out.println( "Result\t\"" + result + "\"");
      }

Doc 23, Java Regular Expressions Slide # 7

Positional Operators

^
matches at the beginning of a line
$
matches at the end of a line
\A
matches the start of the entire string
\Z
matches the end of the entire string

Examples

Text
"cat cat cat"
Pattern
"cat$"
Replacement
"dog"
Result
"cat cat dog"

Text
"cat cat cat"
Pattern
"^cat"
Replacement
"dog"
Result
"dog cat cat"

Doc 23, Java Regular Expressions Slide # 8

One-Character Operators

.
matches any single character
\d
matches any decimal digit
\D
matches any non-digit
\n
matches a newline character
\r
matches a return character
\s
matches any whitespace character
\S
matches any non-whitespace character
\t
matches a horizontal tab character
\w
matches any word (alphanumeric) character
\W
matches any non-word (alphanumeric) character
\x
matches the character x, if x is not one of the above listed escape sequences.

Examples
Text
"cat cat cat"
Pattern
"\scat"
Replacement
" dog"
Result
"cat dog dog"

Text
"cat bat sat"
Pattern
"\s.at"
Replacement
" dog"
Result
"cat dog dog"

Doc 23, Java Regular Expressions Slide # 9

Character Class Operator

[abc]
matches any character in the set a, b or c
[^abc]
matches any character not in the set a, b or c
[a-z]
matches any character in the range a to z, inclusive A leading or trailing dash will be interpreted literally.

Examples
Text
"cat bat mat"
Pattern
"[cm]at"
Replacement
"dog"
Result
"dog bat dog"

Text
"cat bat mat"
Pattern
"[^bc]at"
Replacement
"dog"
Result
"cat bat dog"

Within a character class expression, the following sequences have special meaning if the syntax bit RE_CHAR_CLASSES is on:

[:alnum:]
Any alphanumeric character
[:alpha:]
Any alphabetical character
[:blank:]
A space or horizontal tab
[:cntrl:]
A control character
[:digit:]
A decimal digit
[:graph:]
A non-space, non-control character
[:lower:]
A lowercase letter
[:print:]
Same as graph, but also space and tab
[:punct:]
A punctuation character
[:space:]
Any whitespace character, including newline and return
[:upper:]
An uppercase letter
[:xdigit:]
A valid hexadecimal digit

Doc 23, Java Regular Expressions Slide # 10

Branching (Alternation) Operator


a|b
matches whatever the expression a would match, or whatever the expression b would match.

Example
Text
"cat chet here het"
Pattern
"c(a|he)t"
Replacement
"dog"
Result
"dog dog here het"

Doc 23, Java Regular Expressions Slide # 11

Subexpressions and Backreferences


(abc)
matches whatever the expression abc would match, and saves it as a subexpression. Also used for grouping.
(?:...)
pure grouping operator, does not save contents
(?#...)
embedded comment, ignored by engine
\n
where 0 < n < 10, matches the same thing the nth subexpression matched.

Examples
Text
"cat chet her"
Pattern
"c(a|he)t"
Replacement
"d$1g"
Result
"dag dheg her"

Text
"cat cata cathe chethe cheta"
Pattern
"c(a|he)t\1"
Replacement
"d$1g"
Result
"cat dag cathe dheg cheta"

Text
"catty cote code"
Pattern
"c(.)t(.)"
Replacement
"d$2g$1"
Result
"dtgay dego code"


Doc 23, Java Regular Expressions Slide # 12

Repeating Operators

These symbols operate on the previous atomic expression.

?
matches the preceding expression or the null string
*
matches the null string or any number of repetitions of the preceding expression
+
matches one or more repetitions of the preceding expression
{m}
matches exactly m repetitions of the one-character expression
{m,n}
matches between m and n repetitions of the preceding expression, inclusive
{m,}
matches m or more repetitions of the preceding expression

Examples
Text
"ct cat caat caaat caaaat"
Pattern
"ca+t"
Replacement
"dog"
Result
"ct dog dog dog dog"

Text
"ct cat caat caaat caaaat"
Pattern
"ca*t"
Replacement
"dog"
Result
"dog dog dog dog dog"

Text
"ct cat caat caaat caaaat"
Pattern
"ca{2,3}t"
Replacement
"dog"
Result
"ct cat dog dog caaaat"

Text
"ct cat caat caaat caaaat"
Pattern
"ca?t"
Replacement
"dog"
Result
"dog dog caat caaat caaaat"


Copyright © 1998 SDSU & Roger Whitney, 5500 Campanile Drive, San Diego, CA 92182-7700 USA.
All rights reserved.

visitors since 13-Nov-98