際際滷

際際滷Share a Scribd company logo
REGEXThe Black Magic Of Programming
Submitted By
Brij Raj Kishore
Motivation For REGEX
 System Administrators
 Developers
 QA Engineers
 Support Engineers
 User
 ^[A-Za-z][A-Za-z0-9_]*$ for identifiers
 Pseduo-Code
 state = FIRSTCHAR
 for char in all_chars_in(string):
 if state == FIRSTCHAR:
 if char is not in the set "A-Z" or "a-z":
 error "Invalid first character"
 state = SUBSEQUENTCHARS
 next char
 if state == SUBSEQUENTCHARS:
 if char is not in the set "A-Z" or "a-z" or "0-
9" or "_":
 error "Invalid subsequent character"
 state = SUBSEQUENTCHARS
 next char
What is regex
 Called by different names :-
 Regular Expression, Regexp, Regex, ASCII puke.
 /^Reg(exp?|ular expression)$/i
 Symbols representing a text pattern.
 Formal language interpreted by a regular expression processor.
 Used for matching, searching, and replacing text.
 Not a programming language.
 Formal Definition  Type 3 grammar in Chomsky Classification of Language
 Recognized by FSA.
Type 3 Grammer
 Languages defined byType-3 grammars
are accepted by finite state automata
 Rules are of the form:
 A  竜
 A  留
 A  留B
 where
 A, B  N and 留  裡
Examples
 List of all pdf files  abc.pdf, 123.pdf, _12abc.pdf
 /^w+.pdf$
 A 6+ letter password with at least: one number, one letter and one symbol -: abc@1236
 /^(?=.*d)(?=.*[a-z])(?=.*[W_]).{6,}$/i
 Email format such as -: Aaavv_ahchj@bbb.cccc
 /^[a-zA-z0-9_]+@[a-zA-Z0-9]+.[a-zA-z]{2,}$/
 Ipv4 validation
 Format is 0.0.0.0 to 255.255.255.255
 ^((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?).){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$
Regular expressions engines
 C/C++
 Java
 JavaScript/ActionScript(ECMAScript)
 .NET
 Perl
 PHP(PCRE)
 Python
 Ruby
 Unix(POSIX BRE, POSIX ERE)
 Apache (v1: POSIX ERE, v2: ERE)
 MySQL etc.
Syntax Of Regular expression
 g/Regular expression/p
 It is case sensitive by default unless specified by the i flag.
 Regular Symbols contains :-
 Character
 Symbols
 Non-characters
 Metacharacters
 Some Rules
Symbols and their meanings
 abc.. Letters
 123 Digits
 d Any Digit
 D Any Non-Digit characters
 . Any Character Period
  Period
 [abc] only a,b or c
 [^abc] Not a,b nor c
 [0-9] Number 0-9
 w Any Alphanumeric character
 b Boundary Between AWord And NonWord Character
 B Boundary BetweenWord &Word Or Non-Word & Non-Word.
 W Any Non-Alphanumeric character
 {m} m repetitions
 {m,n} m to n Repetitions
 * Zero or more repetitions
 + One or more repetitions
 ? OptionalCharacter
 s Any Whitspaces
 S Any Non-whitespace character
 ^..$ Starts and end
 (.) Capture Group
 (a(bc)) Capture Sub-group
 (.*) Capture all
 (abc|def) Matches abc or def
Some Key Points
 ., .*, .+, .?, .{1,2} - Arbitrary characters and repetitions
 ^, $ - Start and end of subject (or line in multiline mode)
 foo|bar - LogicalOr
 (foo)(bar) - Subpattern grouping
 /(foo|bar)baz1/ - Backreferences
 [a-zA-Z] - Character classes.
 Negative assertions, Positive assertions are possible -: a(?!x) , a(?=x)
 foo not followed by another foo -: /foo(?!foo)/
 foo followed by another foo -: /foo(?=foo)/
 If condition - /(?(condition)yes-pattern|no-pattern)/
Application Areas
 Validation Framework
 Pattern Matching
 Translation Program
 DigitalCircuits
 Protocols
 Ipv6 validation
 (?:^|(?<=s))(([0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,7}:|([0-9a-fA-
F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,5}(:[0-9a-fA-F]{1,4}){1,2}|([0-9a-fA-
F]{1,4}:){1,4}(:[0-9a-fA-F]{1,4}){1,3}|([0-9a-fA-F]{1,4}:){1,3}(:[0-9a-fA-F]{1,4}){1,4}|([0-9a-fA-
F]{1,4}:){1,2}(:[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:((:[0-9a-fA-F]{1,4}){1,6})|:((:[0-9a-fA-
F]{1,4}){1,7}|:)|fe80:(:[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(ffff(:0{1,4}){0,1}:){0,1}((25[0-
5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9]).){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])|([0-9a-fA-
F]{1,4}:){1,4}:((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9]).){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-
9]){0,1}[0-9]))(?=s|$)
 Sometimes you need to know where to stop
 Allow some false positive rather than some false negative
 Short hand for ipv6 regex
 ^((?:[0-9A-Fa-f]{1,4}(?::[0-9A-Fa-f]{1,4})*)?)::((?:[0-9A-Fa-f]{1,4}(?::[0-9A-Fa-f]{1,4})*)?)$
For Performance
 Avoid greedy quantifiers :- /(x+x)+y/ xxxxxxxxxxxxxxxxxxxx
 Dont forget anchors (^ and $)
 Be as specific as possible
 Prefer non-capturing groups ( ?: )
 Minimize backtracking
References
 Websites
 https://regexone.com/
 https://github.com
 http://stackoverflow.com/questions/5284147/validating-ipv4-addresses-with-regexp
 https://www.youtube.com/watch?v=EkluES9Rvak
 https://www.youtube.com/watch?v=Ju8EybDDjBk
 Regular Expression Lynda
 Books :-
 Oreilly Mastering Regular Expressions 3rd Edition Aug 2006 - Jeffrey E.F. Friedl
 Regular Expression Pocket Reference, 2nd Edition - O'Reilly Media
ThankYou!
Questions??

More Related Content

Regular expressions

  • 1. REGEXThe Black Magic Of Programming Submitted By Brij Raj Kishore
  • 2. Motivation For REGEX System Administrators Developers QA Engineers Support Engineers User ^[A-Za-z][A-Za-z0-9_]*$ for identifiers Pseduo-Code state = FIRSTCHAR for char in all_chars_in(string): if state == FIRSTCHAR: if char is not in the set "A-Z" or "a-z": error "Invalid first character" state = SUBSEQUENTCHARS next char if state == SUBSEQUENTCHARS: if char is not in the set "A-Z" or "a-z" or "0- 9" or "_": error "Invalid subsequent character" state = SUBSEQUENTCHARS next char
  • 3. What is regex Called by different names :- Regular Expression, Regexp, Regex, ASCII puke. /^Reg(exp?|ular expression)$/i Symbols representing a text pattern. Formal language interpreted by a regular expression processor. Used for matching, searching, and replacing text. Not a programming language. Formal Definition Type 3 grammar in Chomsky Classification of Language Recognized by FSA.
  • 4. Type 3 Grammer Languages defined byType-3 grammars are accepted by finite state automata Rules are of the form: A 竜 A 留 A 留B where A, B N and 留 裡
  • 5. Examples List of all pdf files abc.pdf, 123.pdf, _12abc.pdf /^w+.pdf$ A 6+ letter password with at least: one number, one letter and one symbol -: abc@1236 /^(?=.*d)(?=.*[a-z])(?=.*[W_]).{6,}$/i Email format such as -: Aaavv_ahchj@bbb.cccc /^[a-zA-z0-9_]+@[a-zA-Z0-9]+.[a-zA-z]{2,}$/ Ipv4 validation Format is 0.0.0.0 to 255.255.255.255 ^((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?).){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$
  • 6. Regular expressions engines C/C++ Java JavaScript/ActionScript(ECMAScript) .NET Perl PHP(PCRE) Python Ruby Unix(POSIX BRE, POSIX ERE) Apache (v1: POSIX ERE, v2: ERE) MySQL etc.
  • 7. Syntax Of Regular expression g/Regular expression/p It is case sensitive by default unless specified by the i flag. Regular Symbols contains :- Character Symbols Non-characters Metacharacters Some Rules
  • 8. Symbols and their meanings abc.. Letters 123 Digits d Any Digit D Any Non-Digit characters . Any Character Period Period [abc] only a,b or c [^abc] Not a,b nor c [0-9] Number 0-9 w Any Alphanumeric character b Boundary Between AWord And NonWord Character B Boundary BetweenWord &Word Or Non-Word & Non-Word.
  • 9. W Any Non-Alphanumeric character {m} m repetitions {m,n} m to n Repetitions * Zero or more repetitions + One or more repetitions ? OptionalCharacter s Any Whitspaces S Any Non-whitespace character ^..$ Starts and end (.) Capture Group (a(bc)) Capture Sub-group (.*) Capture all (abc|def) Matches abc or def
  • 10. Some Key Points ., .*, .+, .?, .{1,2} - Arbitrary characters and repetitions ^, $ - Start and end of subject (or line in multiline mode) foo|bar - LogicalOr (foo)(bar) - Subpattern grouping /(foo|bar)baz1/ - Backreferences [a-zA-Z] - Character classes. Negative assertions, Positive assertions are possible -: a(?!x) , a(?=x) foo not followed by another foo -: /foo(?!foo)/ foo followed by another foo -: /foo(?=foo)/ If condition - /(?(condition)yes-pattern|no-pattern)/
  • 11. Application Areas Validation Framework Pattern Matching Translation Program DigitalCircuits Protocols
  • 12. Ipv6 validation (?:^|(?<=s))(([0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,7}:|([0-9a-fA- F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,5}(:[0-9a-fA-F]{1,4}){1,2}|([0-9a-fA- F]{1,4}:){1,4}(:[0-9a-fA-F]{1,4}){1,3}|([0-9a-fA-F]{1,4}:){1,3}(:[0-9a-fA-F]{1,4}){1,4}|([0-9a-fA- F]{1,4}:){1,2}(:[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:((:[0-9a-fA-F]{1,4}){1,6})|:((:[0-9a-fA- F]{1,4}){1,7}|:)|fe80:(:[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(ffff(:0{1,4}){0,1}:){0,1}((25[0- 5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9]).){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])|([0-9a-fA- F]{1,4}:){1,4}:((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9]).){3,3}(25[0-5]|(2[0-4]|1{0,1}[0- 9]){0,1}[0-9]))(?=s|$) Sometimes you need to know where to stop Allow some false positive rather than some false negative Short hand for ipv6 regex ^((?:[0-9A-Fa-f]{1,4}(?::[0-9A-Fa-f]{1,4})*)?)::((?:[0-9A-Fa-f]{1,4}(?::[0-9A-Fa-f]{1,4})*)?)$
  • 13. For Performance Avoid greedy quantifiers :- /(x+x)+y/ xxxxxxxxxxxxxxxxxxxx Dont forget anchors (^ and $) Be as specific as possible Prefer non-capturing groups ( ?: ) Minimize backtracking
  • 14. References Websites https://regexone.com/ https://github.com http://stackoverflow.com/questions/5284147/validating-ipv4-addresses-with-regexp https://www.youtube.com/watch?v=EkluES9Rvak https://www.youtube.com/watch?v=Ju8EybDDjBk Regular Expression Lynda Books :- Oreilly Mastering Regular Expressions 3rd Edition Aug 2006 - Jeffrey E.F. Friedl Regular Expression Pocket Reference, 2nd Edition - O'Reilly Media