RE is an efficient, lightweight regular expression evaluator/matcher
class. Regular expressions are pattern descriptions which enable
sophisticated matching of strings. In addition to being able to
match a string against a pattern, you can also extract parts of the
match. This is especially useful in text parsing! Details on the
syntax of regular expression patterns are given below.
To compile a regular expression (RE), you can simply construct an RE
matcher object from the string specification of the pattern, like this:
RE r = new RE("a*b");
Once you have done this, you can call either of the RE.match methods to
perform matching on a String. For example:
boolean matched = r.match("aaaab");
will cause the boolean matched to be set to true because the
pattern "a*b" matches the string "aaaab".
If you were interested in the
number of a's which matched the
first part of our example expression, you could change the expression to
"(a*)b". Then when you compiled the expression and matched it against
something like "xaaaab", you would get results like this:
RE r = new RE("(a*)b"); // Compile expression
boolean matched = r.match("xaaaab"); // Match against "xaaaab"
String wholeExpr = r.getParen(0); // wholeExpr will be 'aaaab'
String insideParens = r.getParen(1); // insideParens will be 'aaaa'
int startWholeExpr = r.getParenStart(0); // startWholeExpr will be index 1
int endWholeExpr = r.getParenEnd(0); // endWholeExpr will be index 6
int lenWholeExpr = r.getParenLength(0); // lenWholeExpr will be 5
int startInside = r.getParenStart(1); // startInside will be index 1
int endInside = r.getParenEnd(1); // endInside will be index 5
int lenInside = r.getParenLength(1); // lenInside will be 4
You can also refer to the contents of a parenthesized expression
within a regular expression itself. This is called a
'backreference'. The first backreference in a regular expression is
denoted by \1, the second by \2 and so on. So the expression:
([0-9]+)=\1
will match any string of the form n=n (like 0=0 or 2=2).
The full regular expression syntax accepted by RE is described here:
Characters
unicodeChar Matches any identical unicode character
\ Used to quote a meta-character (like '*')
\\ Matches a single '\' character
\0nnn Matches a given octal character
\xhh Matches a given 8-bit hexadecimal character
\\uhhhh Matches a given 16-bit hexadecimal character
\t Matches an ASCII tab character
\n Matches an ASCII newline character
\r Matches an ASCII return character
\f Matches an ASCII form feed character
Character Classes
[abc] Simple character class
[a-zA-Z] Character class with ranges
[^abc] Negated character class
NOTE: Incomplete ranges will be interpreted as "starts
from zero" or "ends with last character".
I.e. [-a] is the same as [\\u0000-a], and [a-] is the same as [a-\\uFFFF],
[-] means "all characters".
Standard POSIX Character Classes
[:alnum:] Alphanumeric characters.
[:alpha:] Alphabetic characters.
[:blank:] Space and tab characters.
[:cntrl:] Control characters.
[:digit:] Numeric characters.
[:graph:] Characters that are printable and are also visible.
(A space is printable, but not visible, while an
`a' is both.)
[:lower:] Lower-case alphabetic characters.
[:print:] Printable characters (characters that are not
control characters.)
[:punct:] Punctuation characters (characters that are not letter,
digits, control characters, or space characters).
[:space:] Space characters (such as space, tab, and formfeed,
to name a few).
[:upper:] Upper-case alphabetic characters.
[:xdigit:] Characters that are hexadecimal digits.
Non-standard POSIX-style Character Classes
[:javastart:] Start of a Java identifier
[:javapart:] Part of a Java identifier
Predefined Classes
. Matches any character other than newline
\w Matches a "word" character (alphanumeric plus "_")
\W Matches a non-word character
\s Matches a whitespace character
\S Matches a non-whitespace character
\d Matches a digit character
\D Matches a non-digit character
Boundary Matchers
^ Matches only at the beginning of a line
$ Matches only at the end of a line
\b Matches only at a word boundary
\B Matches only at a non-word boundary
Greedy Closures
A* Matches A 0 or more times (greedy)
A+ Matches A 1 or more times (greedy)
A? Matches A 1 or 0 times (greedy)
A{n} Matches A exactly n times (greedy)
A{n,} Matches A at least n times (greedy)
A{n,m} Matches A at least n but not more than m times (greedy)
Reluctant Closures
A*? Matches A 0 or more times (reluctant)
A+? Matches A 1 or more times (reluctant)
A?? Matches A 0 or 1 times (reluctant)
Logical Operators
AB Matches A followed by B
A|B Matches either A or B
(A) Used for subexpression grouping
(?:A) Used for subexpression clustering (just like grouping but
no backrefs)
Backreferences
\1 Backreference to 1st parenthesized subexpression
\2 Backreference to 2nd parenthesized subexpression
\3 Backreference to 3rd parenthesized subexpression
\4 Backreference to 4th parenthesized subexpression
\5 Backreference to 5th parenthesized subexpression
\6 Backreference to 6th parenthesized subexpression
\7 Backreference to 7th parenthesized subexpression
\8 Backreference to 8th parenthesized subexpression
\9 Backreference to 9th parenthesized subexpression
All closure operators (+, *, ?, {m,n}) are greedy by default, meaning
that they match as many elements of the string as possible without
causing the overall match to fail. If you want a closure to be
reluctant (non-greedy), you can simply follow it with a '?'. A
reluctant closure will match as few elements of the string as
possible when finding matches. {m,n} closures don't currently
support reluctancy.
Line terminators
A line terminator is a one- or two-character sequence that marks
the end of a line of the input character sequence. The following
are recognized as line terminators:
- A newline (line feed) character ('\n'),
- A carriage-return character followed immediately by a newline character ("\r\n"),
- A standalone carriage-return character ('\r'),
- A next-line character ('\u0085'),
- A line-separator character ('\u2028'), or
- A paragraph-separator character ('\u2029).
RE runs programs compiled by the RECompiler class. But the RE
matcher class does not include the actual regular expression compiler
for reasons of efficiency. In fact, if you want to pre-compile one
or more regular expressions, the 'recompile' class can be invoked
from the command line to produce compiled output like this:
// Pre-compiled regular expression "a*b"
char[] re1Instructions =
{
0x007c, 0x0000, 0x001a, 0x007c, 0x0000, 0x000d, 0x0041,
0x0001, 0x0004, 0x0061, 0x007c, 0x0000, 0x0003, 0x0047,
0x0000, 0xfff6, 0x007c, 0x0000, 0x0003, 0x004e, 0x0000,
0x0003, 0x0041, 0x0001, 0x0004, 0x0062, 0x0045, 0x0000,
0x0000,
};
REProgram re1 = new REProgram(re1Instructions);
You can then construct a regular expression matcher (RE) object from
the pre-compiled expression re1 and thus avoid the overhead of
compiling the expression at runtime. If you require more dynamic
regular expressions, you can construct a single RECompiler object and
re-use it to compile each expression. Similarly, you can change the
program run by a given matcher object at any time. However, RE and
RECompiler are not threadsafe (for efficiency reasons, and because
requiring thread safety in this class is deemed to be a rare
requirement), so you will need to construct a separate compiler or
matcher object for each thread (unless you do thread synchronization
yourself). Once expression compiled into the REProgram object, REProgram
can be safely shared across multiple threads and RE objects.
ISSUES:
- com.weusours.util.re is not currently compatible with all
standard POSIX regcomp flags
- com.weusours.util.re does not support POSIX equivalence classes
([=foo=] syntax) (I18N/locale issue)
- com.weusours.util.re does not support nested POSIX character
classes (definitely should, but not completely trivial)
- com.weusours.util.re Does not support POSIX character collation
concepts ([.foo.] syntax) (I18N/locale issue)
- Should there be different matching styles (simple, POSIX, Perl etc?)
- Should RE support character iterators (for backwards RE matching!)?
- Should RE support reluctant {m,n} closures (does anyone care)?
- Not *all* possibilities are considered for greediness when backreferences
are involved (as POSIX suggests should be the case). The POSIX RE
"(ac*)c*d[ac]*\1", when matched against "acdacaa" should yield a match
of acdacaa where \1 is "a". This is not the case in this RE package,
and actually Perl doesn't go to this extent either! Until someone
actually complains about this, I'm not sure it's worth "fixing".
If it ever is fixed, test #137 in RETest.txt should be updated.
E_ALNUM
(package private) static final char E_ALNUM
E_BOUND
(package private) static final char E_BOUND
E_DIGIT
(package private) static final char E_DIGIT
E_NALNUM
(package private) static final char E_NALNUM
E_NBOUND
(package private) static final char E_NBOUND
E_NDIGIT
(package private) static final char E_NDIGIT
E_NSPACE
(package private) static final char E_NSPACE
E_SPACE
(package private) static final char E_SPACE
MATCH_CASEINDEPENDENT
public static final int MATCH_CASEINDEPENDENT
Flag to indicate that matching should be case-independent (folded)
MATCH_MULTILINE
public static final int MATCH_MULTILINE
Newlines should match as BOL/EOL (^ and $)
MATCH_NORMAL
public static final int MATCH_NORMAL
Specifies normal, case-sensitive matching behaviour.
MATCH_SINGLELINE
public static final int MATCH_SINGLELINE
Consider all input a single body of text - newlines are matched by .
MAX_PAREN
(package private) static final int MAX_PAREN
OP_ANY
(package private) static final char OP_ANY
OP_ANYOF
(package private) static final char OP_ANYOF
OP_ATOM
(package private) static final char OP_ATOM
OP_BACKREF
(package private) static final char OP_BACKREF
OP_BOL
(package private) static final char OP_BOL
OP_BRANCH
(package private) static final char OP_BRANCH
OP_CLOSE
(package private) static final char OP_CLOSE
OP_CLOSE_CLUSTER
(package private) static final char OP_CLOSE_CLUSTER
OP_END
(package private) static final char OP_END
*
The format of a node in a program is: *
*
[ OPCODE ] [ OPDATA ] [ OPNEXT ] [ OPERAND ] *
*
char OPCODE - instruction *
char OPDATA - modifying data *
char OPNEXT - next node (relative offset) *
*
OP_EOL
(package private) static final char OP_EOL
OP_ESCAPE
(package private) static final char OP_ESCAPE
OP_GOTO
(package private) static final char OP_GOTO
OP_MAYBE
(package private) static final char OP_MAYBE
OP_NOTHING
(package private) static final char OP_NOTHING
OP_OPEN
(package private) static final char OP_OPEN
OP_OPEN_CLUSTER
(package private) static final char OP_OPEN_CLUSTER
OP_PLUS
(package private) static final char OP_PLUS
OP_POSIXCLASS
(package private) static final char OP_POSIXCLASS
OP_RELUCTANTMAYBE
(package private) static final char OP_RELUCTANTMAYBE
OP_RELUCTANTPLUS
(package private) static final char OP_RELUCTANTPLUS
OP_RELUCTANTSTAR
(package private) static final char OP_RELUCTANTSTAR
OP_STAR
(package private) static final char OP_STAR
POSIX_CLASS_ALNUM
(package private) static final char POSIX_CLASS_ALNUM
POSIX_CLASS_ALPHA
(package private) static final char POSIX_CLASS_ALPHA
POSIX_CLASS_BLANK
(package private) static final char POSIX_CLASS_BLANK
POSIX_CLASS_CNTRL
(package private) static final char POSIX_CLASS_CNTRL
POSIX_CLASS_DIGIT
(package private) static final char POSIX_CLASS_DIGIT
POSIX_CLASS_GRAPH
(package private) static final char POSIX_CLASS_GRAPH
POSIX_CLASS_JPART
(package private) static final char POSIX_CLASS_JPART
POSIX_CLASS_JSTART
(package private) static final char POSIX_CLASS_JSTART
POSIX_CLASS_LOWER
(package private) static final char POSIX_CLASS_LOWER
POSIX_CLASS_PRINT
(package private) static final char POSIX_CLASS_PRINT
POSIX_CLASS_PUNCT
(package private) static final char POSIX_CLASS_PUNCT
POSIX_CLASS_SPACE
(package private) static final char POSIX_CLASS_SPACE
POSIX_CLASS_UPPER
(package private) static final char POSIX_CLASS_UPPER
POSIX_CLASS_XDIGIT
(package private) static final char POSIX_CLASS_XDIGIT
REPLACE_ALL
public static final int REPLACE_ALL
Flag bit that indicates that subst should replace all occurrences of this
regular expression.
REPLACE_BACKREFERENCES
public static final int REPLACE_BACKREFERENCES
Flag bit that indicates that subst should replace backreferences
REPLACE_FIRSTONLY
public static final int REPLACE_FIRSTONLY
Flag bit that indicates that subst should only replace the first occurrence
of this regular expression.
end0
(package private) int end0
end1
(package private) int end1
end2
(package private) int end2
endBackref
(package private) int[] endBackref
endn
(package private) int[] endn
matchFlags
(package private) int matchFlags
maxNode
(package private) static final int maxNode
maxParen
(package private) int maxParen
nodeSize
(package private) static final int nodeSize
offsetNext
(package private) static final int offsetNext
offsetOpcode
(package private) static final int offsetOpcode
offsetOpdata
(package private) static final int offsetOpdata
parenCount
(package private) int parenCount
start0
(package private) int start0
start1
(package private) int start1
start2
(package private) int start2
startBackref
(package private) int[] startBackref
startn
(package private) int[] startn
allocParens
private final void allocParens()
Performs lazy allocation of subexpression arrays
compareChars
private int compareChars(char c1,
char c2,
boolean caseIndependent)
Compares two characters.
c1
- first character to compare.c2
- second character to compare.caseIndependent
- whether comparision is case insensitive or not.
- negative, 0, or positive integer as the first character
less than, equal to, or greater then the second.
getMatchFlags
public int getMatchFlags()
Returns the current match behaviour flags.
- Current match behaviour flags (RE.MATCH_*).
MATCH_NORMAL // Normal (case-sensitive) matching
MATCH_CASEINDEPENDENT // Case folded comparisons
MATCH_MULTILINE // Newline matches as BOL/EOL
getParen
public String getParen(int which)
Gets the contents of a parenthesized subexpression after a successful match.
which
- Nesting level of subexpression
getParenCount
public int getParenCount()
Returns the number of parenthesized subexpressions available after a successful match.
- Number of available parenthesized subexpressions
getParenEnd
public final int getParenEnd(int which)
Returns the end index of a given paren level.
which
- Nesting level of subexpression
getParenLength
public final int getParenLength(int which)
Returns the length of a given paren level.
which
- Nesting level of subexpression
- Number of characters in the parenthesized subexpression
getParenStart
public final int getParenStart(int which)
Returns the start index of a given paren level.
which
- Nesting level of subexpression
getProgram
public REProgram getProgram()
Returns the current regular expression program in use by this matcher object.
- Regular expression program
grep
public String[] grep(Object[] search)
Returns an array of Strings, whose toString representation matches a regular
expression. This method works like the Perl function of the same name. Given
a regular expression of "a*b" and an array of String objects of [foo, aab, zzz,
aaaab], the array of Strings returned by grep would be [aab, aaaab].
search
- Array of Objects to search
- Array of Strings whose toString() value matches this regular expression.
internalError
protected void internalError(String s)
throws Error
Throws an Error representing an internal error condition probably resulting
from a bug in the regular expression compiler (or possibly data corruption).
In practice, this should be very rare.
isNewline
private boolean isNewline(int i)
- true if character at i-th position in the
search
string is a newline
match
public boolean match(String search)
Matches the current regular expression program against a String.
search
- String to match against
match
public boolean match(String search,
int i)
Matches the current regular expression program against a character array,
starting at a given index.
search
- String to match againsti
- Index to start searching at
match
public boolean match(CharacterIterator search,
int i)
Matches the current regular expression program against a character array,
starting at a given index.
search
- String to match againsti
- Index to start searching at
matchAt
protected boolean matchAt(int i)
Match the current regular expression program against the current
input string, starting at index i of the input string. This method
is only meant for internal use.
i
- The input string index to start matching at
- True if the input matched the expression
matchNodes
protected int matchNodes(int firstNode,
int lastNode,
int idxStart)
Try to match a string against a subset of nodes in the program
firstNode
- Node to start at in programlastNode
- Last valid node (used for matching a subexpression without
matching the rest of the program as well).idxStart
- Starting position in character array
- Final input array index if match succeeded. -1 if not.
setMatchFlags
public void setMatchFlags(int matchFlags)
Sets match behaviour flags which alter the way RE does matching.
matchFlags
- One or more of the RE match behaviour flags (RE.MATCH_*):
MATCH_NORMAL // Normal (case-sensitive) matching
MATCH_CASEINDEPENDENT // Case folded comparisons
MATCH_MULTILINE // Newline matches as BOL/EOL
setParenEnd
protected final void setParenEnd(int which,
int i)
Sets the end of a paren level
which
- Which paren leveli
- Index in input array
setParenStart
protected final void setParenStart(int which,
int i)
Sets the start of a paren level
which
- Which paren leveli
- Index in input array
setProgram
public void setProgram(REProgram program)
Sets the current regular expression program used by this matcher object.
program
- Regular expression program compiled by RECompiler.
simplePatternToFullRegularExpression
public static String simplePatternToFullRegularExpression(String pattern)
Converts a 'simplified' regular expression to a full regular expression
pattern
- The pattern to convert
- The full regular expression
split
public String[] split(String s)
Splits a string into an array of strings on regular expression boundaries.
This function works the same way as the Perl function of the same name.
Given a regular expression of "[ab]+" and a string to split of
"xyzzyababbayyzabbbab123", the result would be the array of Strings
"[xyzzy, yyz, 123]".
Please note that the first string in the resulting array may be an empty
string. This happens when the very first character of input string is
matched by the pattern.
s
- String to split on this regular exression
subst
public String subst(String substituteIn,
String substitution)
Substitutes a string for this regular expression in another string.
This method works like the Perl function of the same name.
Given a regular expression of "a*b", a String to substituteIn of
"aaaabfooaaabgarplyaaabwackyb" and the substitution String "-", the
resulting String returned by subst would be "-foo-garply-wacky-".
substituteIn
- String to substitute withinsubstitution
- String to substitute for all matches of this regular expression.
- The string substituteIn with zero or more occurrences of the current
regular expression replaced with the substitution String (if this regular
expression object doesn't match at any position, the original String is returned
unchanged).
subst
public String subst(String substituteIn,
String substitution,
int flags)
Substitutes a string for this regular expression in another string.
This method works like the Perl function of the same name.
Given a regular expression of "a*b", a String to substituteIn of
"aaaabfooaaabgarplyaaabwackyb" and the substitution String "-", the
resulting String returned by subst would be "-foo-garply-wacky-".
It is also possible to reference the contents of a parenthesized expression
with $0, $1, ... $9. A regular expression of "http://[\\.\\w\\-\\?/~_@&=%]+",
a String to substituteIn of "visit us: http://www.apache.org!" and the
substitution String "<a href=\"$0\">$0</a>", the resulting String
returned by subst would be
"visit us: <a href=\"http://www.apache.org\">http://www.apache.org</a>!".
Note: $0 represents the whole match.
substituteIn
- String to substitute withinsubstitution
- String to substitute for matches of this regular expressionflags
- One or more bitwise flags from REPLACE_*. If the REPLACE_FIRSTONLY
flag bit is set, only the first occurrence of this regular expression is replaced.
If the bit is not set (REPLACE_ALL), all occurrences of this pattern will be
replaced. If the flag REPLACE_BACKREFERENCES is set, all backreferences will
be processed.
- The string substituteIn with zero or more occurrences of the current
regular expression replaced with the substitution String (if this regular
expression object doesn't match at any position, the original String is returned
unchanged).