The document discusses regular expressions (regex), which are symbols used to match patterns in text. It provides examples of regex for identifiers, passwords, emails, IPv4 addresses, and IPv6 addresses. It also covers the basics of regex syntax, including characters, quantifiers, anchors, character classes, groups and backreferences. The document notes that regex are used for text matching, searching and replacing. It lists several programming languages and tools that support regex and provides areas where regex are commonly applied, such as validation and pattern matching.
2. Motivation For REGEX
System Administrators
Developers
QA Engineers
Support Engineers
User
^[A-Za-z][A-Za-z0-9_]*$ for identifiers
Pseduo-Code
state = FIRSTCHAR
for char in all_chars_in(string):
if state == FIRSTCHAR:
if char is not in the set "A-Z" or "a-z":
error "Invalid first character"
state = SUBSEQUENTCHARS
next char
if state == SUBSEQUENTCHARS:
if char is not in the set "A-Z" or "a-z" or "0-
9" or "_":
error "Invalid subsequent character"
state = SUBSEQUENTCHARS
next char
3. What is regex
Called by different names :-
Regular Expression, Regexp, Regex, ASCII puke.
/^Reg(exp?|ular expression)$/i
Symbols representing a text pattern.
Formal language interpreted by a regular expression processor.
Used for matching, searching, and replacing text.
Not a programming language.
Formal Definition Type 3 grammar in Chomsky Classification of Language
Recognized by FSA.
4. Type 3 Grammer
Languages defined byType-3 grammars
are accepted by finite state automata
Rules are of the form:
A 竜
A 留
A 留B
where
A, B N and 留 裡
5. Examples
List of all pdf files abc.pdf, 123.pdf, _12abc.pdf
/^w+.pdf$
A 6+ letter password with at least: one number, one letter and one symbol -: abc@1236
/^(?=.*d)(?=.*[a-z])(?=.*[W_]).{6,}$/i
Email format such as -: Aaavv_ahchj@bbb.cccc
/^[a-zA-z0-9_]+@[a-zA-Z0-9]+.[a-zA-z]{2,}$/
Ipv4 validation
Format is 0.0.0.0 to 255.255.255.255
^((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?).){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$
6. Regular expressions engines
C/C++
Java
JavaScript/ActionScript(ECMAScript)
.NET
Perl
PHP(PCRE)
Python
Ruby
Unix(POSIX BRE, POSIX ERE)
Apache (v1: POSIX ERE, v2: ERE)
MySQL etc.
7. Syntax Of Regular expression
g/Regular expression/p
It is case sensitive by default unless specified by the i flag.
Regular Symbols contains :-
Character
Symbols
Non-characters
Metacharacters
Some Rules
8. Symbols and their meanings
abc.. Letters
123 Digits
d Any Digit
D Any Non-Digit characters
. Any Character Period
Period
[abc] only a,b or c
[^abc] Not a,b nor c
[0-9] Number 0-9
w Any Alphanumeric character
b Boundary Between AWord And NonWord Character
B Boundary BetweenWord &Word Or Non-Word & Non-Word.
9. W Any Non-Alphanumeric character
{m} m repetitions
{m,n} m to n Repetitions
* Zero or more repetitions
+ One or more repetitions
? OptionalCharacter
s Any Whitspaces
S Any Non-whitespace character
^..$ Starts and end
(.) Capture Group
(a(bc)) Capture Sub-group
(.*) Capture all
(abc|def) Matches abc or def
10. Some Key Points
., .*, .+, .?, .{1,2} - Arbitrary characters and repetitions
^, $ - Start and end of subject (or line in multiline mode)
foo|bar - LogicalOr
(foo)(bar) - Subpattern grouping
/(foo|bar)baz1/ - Backreferences
[a-zA-Z] - Character classes.
Negative assertions, Positive assertions are possible -: a(?!x) , a(?=x)
foo not followed by another foo -: /foo(?!foo)/
foo followed by another foo -: /foo(?=foo)/
If condition - /(?(condition)yes-pattern|no-pattern)/
12. Ipv6 validation
(?:^|(?<=s))(([0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,7}:|([0-9a-fA-
F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,5}(:[0-9a-fA-F]{1,4}){1,2}|([0-9a-fA-
F]{1,4}:){1,4}(:[0-9a-fA-F]{1,4}){1,3}|([0-9a-fA-F]{1,4}:){1,3}(:[0-9a-fA-F]{1,4}){1,4}|([0-9a-fA-
F]{1,4}:){1,2}(:[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:((:[0-9a-fA-F]{1,4}){1,6})|:((:[0-9a-fA-
F]{1,4}){1,7}|:)|fe80:(:[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(ffff(:0{1,4}){0,1}:){0,1}((25[0-
5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9]).){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])|([0-9a-fA-
F]{1,4}:){1,4}:((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9]).){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-
9]){0,1}[0-9]))(?=s|$)
Sometimes you need to know where to stop
Allow some false positive rather than some false negative
Short hand for ipv6 regex
^((?:[0-9A-Fa-f]{1,4}(?::[0-9A-Fa-f]{1,4})*)?)::((?:[0-9A-Fa-f]{1,4}(?::[0-9A-Fa-f]{1,4})*)?)$
13. For Performance
Avoid greedy quantifiers :- /(x+x)+y/ xxxxxxxxxxxxxxxxxxxx
Dont forget anchors (^ and $)
Be as specific as possible
Prefer non-capturing groups ( ?: )
Minimize backtracking