This document provides an introduction to regular expressions. It explains that regular expressions are a shorthand for patterns that can be used to match text strings. The document gives examples of regular expressions to find specific words, validate phone numbers and zip codes, and extract named groups from matches. It also covers special characters, lookarounds, capturing groups, backreferences, and making expressions greedy or lazy.
2. What are Regular Expressions?
Regular expressions are an extension of wildcards (i.e. *.doc).
Code that manipulates text needs to locate strings that match
complex patterns.
A regular expression is a shorthand for a pattern.
w+ is a concise way to say match any non-null strings of alphanumeric
characters.
3. Finding Nemo
nemo
Find nemo
When ignoring case, will match Nemo, NEMO, or nEmO.
Will also match characters 9-12 of Johnny Mnemonic, or Finding Nemo 2.
bnemob
Find nemo as a whole word
b is a code that says match the position at the beginning of end of any word.
Will only match complete words spelled nemo with any combination of upper and
lowercase letters.
bnemob.*b2b Find text with nemo followed by 2
The special characters that give Regular Expressions their power is already making
them hard for humans to read.
4. Determining the Validity of Phone
Numbers
bddd-ddd-dddd
d
-
Matches any single digit.
Literal hyphen (has no special meaning).
bd{3}-d{3}-d{4}
{3}
Find ten-digit US phone number
Better way to find the number
Follows d to mean repeat the preceding character three times.
5. Special Characters
baw*b Find words that start with the letter a
b
a
w*
b
d+
+
The beginning of a word.
The letter a.
Any number of repetitions of alphanumeric characters.
The end of a word.
Find repeated strings of digits
Similar to *, but requires one repetition.
6. Special Characters, continued
bw{6}b Find six letter words
.
w
s
d
b
^
$
Match any character except newline
Match any alphanumeric character
Match any whitespace character
Match any digit
Match the beginning or end of a word
Match the beginning of the string
Match the end of the string
7. Beginnings and Endings
^d{3}-d{3}-d{4}$
Validate an entire string as a phone number
^
The beginning of the string.
$
The end of the string.
In .NET, use RegexOptions.Multiline to match the beginning and end of a line.
^$1000$
^
$
1000
$
Find $1000 as the entire string
The beginning of the string.
Escaped $ (literal $).
Literal 1000.
The end of the string.
8. Wash, Rinse, Repeat
*
+
?
{n}
{n,m}
{n,}
Repeat any number of times
Repeat one or more times
Repeat zero or one time
Repeat n times
Repeat at least n, but not more than m times
Repeat at least n times.
9. Wash, Rinse, Repeat, continued
bw{5,6}b
w{5,6}
Find all five and six letter words
Word with at least 5, but not more than 6, characters.
b+d{1,3}sd{3}-d{3}-d{4}
Find phone numbers formatted for intl calling
s
White space
d{3}-d{2}-d{4} Find social security numbers
^w*
Find first word in string
10. Character Classes
[aeiou]
Matches any vowel
[.?!]
Matches punctuation at the end of a sentence
.
?
Literal ., losing its special meaning because its inside brackets
Literal ?
(?d{3}[) ]s?d{3}[ ]d{4}
Matches a 10-digit phone number
(?
Zero or one left parentheses.
[) ]
A right parenthesis or a space.
Will also match 480) 555-1212.
11. Negation
W
S
D
B
[^x]
[^aeiou]
Match any character that is NOT alphanumeric
Match any character that is NOT whitespace
Match any character that is NOT a digit
Match a position that is NOT a word boundary
Match any character that is NOT x
Match any character that is NOT one of the chars aeiou
S+
All strings that do not contain whitespace characters
12. Alternatives
|
Pipe symbol separates alternatives
bd{5}-d{4}b|bd{5}b
Five and nine digit Zip Codes
bd{5}-d{4}b Leftmost alternative first: nine digit Zip Codes.
bd{5}b
Second: five digit Zip Codes.
bd{5}b|bd{5}-d{4}b
Only matches five digit Zip Codes
((d{3})|d{3})s?d{3}[- ]d{4}
((d{3})|d{3})
Ten digit phone numbers
Matches (480) or 480.
13. Grouping
Parentheses delimit a subexpression to allow repetition or special
treatment.
(d{1,3}.){3}d{1,3}
A simple IP address finder
(d{1,3}.)
A one to three digit number following by a literal period.
{3}
Repeats the preceding three times.
Also matches invalid IP addresses like 999.999.999.999.
((2[0-4]d|25[0-5]|[01]?dd?).){3}(2[0-4]d|25[0-5]|[01]?dd?)
A better IP address finder
14. Backreferences
Backreferences search for a recurrence of previously matched text that
has been captured by a group.
b(w+)bs*1b
(w+)
s*
1
Find repeated words
Finds a string of at least one character within group 1.
Finds any amount of whitespace.
Finds a repetition of the captured text.
15. Backreferences, continued
Automatic numbering of groups can be overridden by specifying an
explicit name or number.
b(?<Word>w+)bs*k<Word>b
Capture repeated word in a named group
(?<Word>w+) Names this capture group Word.
16. Captures and Lookarounds
Captures
(exp)
Match exp & capture in an automatically numbered group.
(?<name>exp) Match exp and capture it in a group named name.
(?:exp)
Match exp, but do not capture it.
Lookarounds
text
(?=exp)
(?<exp)
(?!exp)
(?<!exp)
Match a position like ^ or b and never match any
Match any position preceding a suffix exp.
Match any position following a prefix exp.
Match any position after which the suffix exp isnt found.
Match any position before which the prefix exp isnt found.
17. Positive Lookaround
bw+(?=ingb)
(?=ing)
The beginning of words ending with ing
Zero-width positive lookahead assertion
Matches a position that precedes a given suffix.
(?<=bre)w+b
The end of words starting with re
(?<=bre)
Zero-width positive lookbehind assertion
Matches a position following a prefix.
(?<=d)d{3}b
3 digits at the end of a word, preceded by a digit
(?<=s)w+(?=s) Alphanumeric strings bounded by whitespace
18. Negative Lookaround
bw*q[^u]w*b
[^u]
Always matches a character. Iraq does not match.
bw*q(?!u)w*b
(?!u)
Words with q followed by NOT u
Search for words with q not followed by u
Zero-width negative lookahead assertion
Succeeds when u does not exist. Iraq matches.
(?<![a-z ])w{7} 7 alphanumerics not preceded by a letter or space
(?<![a-z ])
Zero-width negative lookbehind assertion
19. Greedy and Lazy
Be default, regular expressions are greedy. This means that when a
quantifier can accept a range of repeitions, as many characters as
possible will be matched.
a.*b
The longest string starting with a and ending with b
An input of aabab will match the entire string.
Quantifiers can be made lazy by adding a question mark.
a.*?b
The shortest string starting with an a and ending with a b
An input of aabab will match aab and then ab.
20. Greedy and Lazy, continued
*?
+?
??
{n,m}?
{n,}?
Repeat any number of times, but as few as possible.
Repeat one or more times, but as few as possible.
Repeat zero or one time, but as few as possible.
Repeat at least n, but no more than m, as few as possible.
Repeat at least n times, but as few as possible.