際際滷

際際滷Share a Scribd company logo
Regular Expressions
      Redux
Scope

 medium to advanced
 30 minutes
 performance / backtracking irrelevant
 no compatibility charts (yet)
TOC

 basic matching, quanti鍖ers
 character classes, types, properties, anchors
 groups, options, replace string
 look-ahead/behind
 subexpressions
RE overview
RE overview

              match foo           replace with bar
  Perl        /foo/     (on $_)        s/foo/bar/ (on $_)

Javascript            /foo/       foolish.replace(/foo/, bar)

   Vi                 /foo/                 :s/foo/bar/

TextMate      -F, Find: foo       -F Find: foo, Replace: bar
RE overview

              match foo           replace with bar
  Perl        /foo/     (on $_)        s/foo/bar/ (on $_)

Javascript            /foo/       foolish.replace(/foo/, bar)

   Vi                 /foo/                 :s/foo/bar/

TextMate      -F, Find: foo       -F Find: foo, Replace: bar
RE overview

              match foo           replace with bar
  Perl        /foo/     (on $_)        s/foo/bar/ (on $_)

Javascript            /foo/       foolish.replace(/foo/, bar)

   Vi                 /foo/                 :s/foo/bar/

TextMate      -F, Find: foo       -F Find: foo, Replace: bar
河顎温稼岳庄鍖e姻壊
河顎温稼岳庄鍖e姻壊
 classic greedy: ?, *, +
河顎温稼岳庄鍖e姻壊
 classic greedy: ?, *, +
 speci鍖c:{1,5}, {,5}
河顎温稼岳庄鍖e姻壊
 classic greedy: ?, *, +
 speci鍖c:{1,5}, {,5}
     ? == {0,1}
河顎温稼岳庄鍖e姻壊
 classic greedy: ?, *, +
 speci鍖c:{1,5}, {,5}
     ? == {0,1}

     * == {0,}
河顎温稼岳庄鍖e姻壊
 classic greedy: ?, *, +
 speci鍖c:{1,5}, {,5}
     ? == {0,1}

     * == {0,}

     + == {1,}
河顎温稼岳庄鍖e姻壊
 classic greedy: ?, *, +
 speci鍖c:{1,5}, {,5}
     ? == {0,1}

     * == {0,}

     + == {1,}

 non-greedy: ??, *?, +?, {5,7}?
Example
This reveals that plain text is in fact the
technical user's way to regard a 鍖le or a
sequence of bytes. In this sense, there is no
plain text.

              /reveal(.*)plain/
             /reveal(.*?)plain/
                  /t.{2,3}t/
Example
This reveals that plain text is in fact the
technical user's way to regard a 鍖le or a
sequence of bytes. In this sense, there is no
plain text.

              /reveal(.*)plain/
             /reveal(.*?)plain/
                  /t.{2,3}t/
Example
This reveals that plain text is in fact the
technical user's way to regard a 鍖le or a
sequence of bytes. In this sense, there is no
plain text.

              /reveal(.*)plain/
             /reveal(.*?)plain/
                  /t.{2,3}t/
Example
This reveals that plain text is in fact the
technical user's way to regard a 鍖le or a
sequence of bytes. In this sense, there is no
plain text.

              /reveal(.*)plain/
             /reveal(.*?)plain/
                  /t.{2,3}t/
Character Classes /
    Properties
Character Classes /
      Properties
 [0-9a-z]   (classes)
Character Classes /
      Properties
 [0-9a-z]     (classes)
    +420[0-9]{9} = simpli鍖ed czech phone nr.
Character Classes /
      Properties
 [0-9a-z]      (classes)
    +420[0-9]{9} = simpli鍖ed czech phone nr.

    dont: [A-z0-]
Character Classes /
      Properties
 [0-9a-z]       (classes)
     +420[0-9]{9} = simpli鍖ed czech phone nr.

     dont: [A-z0-]

 [a-z&&[^j-n]] == [a-io-z]
Character Classes /
      Properties
 [0-9a-z]       (classes)
     +420[0-9]{9} = simpli鍖ed czech phone nr.

     dont: [A-z0-]

 [a-z&&[^j-n]] == [a-io-z]
 p{Upper} (properties)
Character Classes /
      Properties
 [0-9a-z]       (classes)
     +420[0-9]{9} = simpli鍖ed czech phone nr.

     dont: [A-z0-]

 [a-z&&[^j-n]] == [a-io-z]
 p{Upper} (properties)
     works great on Unicode text (Latin,Katakana)
Character Classes /
      Properties
 [0-9a-z]       (classes)
     +420[0-9]{9} = simpli鍖ed czech phone nr.

     dont: [A-z0-]

 [a-z&&[^j-n]] == [a-io-z]
 p{Upper} (properties)
     works great on Unicode text (Latin,Katakana)

 [:alnum:], [:^space:] (POSIX bracket)
Character Types
Character Types
 . == anything (apart from newline)
Character Types
 . == anything (apart from newline)
 s == space == [tnvfr ]
     more in unicode
Character Types
 . == anything (apart from newline)
 s == space == [tnvfr ]
     more in unicode

 w == word char == cca [0-9a-zA-Z_]
     is complicated in unicode
Character Types
 . == anything (apart from newline)
 s == space == [tnvfr ]
     more in unicode

 w == word char == cca [0-9a-zA-Z_]
     is complicated in unicode

 d == digit == [0-9]
     h == hexadecimal digit == [0-9a-fA-F]
Character Types
 . == anything (apart from newline)
 s == space == [tnvfr ]
     more in unicode

 w == word char == cca [0-9a-zA-Z_]
     is complicated in unicode

 d == digit == [0-9]
     h == hexadecimal digit == [0-9a-fA-F]

 SWD == [^s][^w][^d]
Example
This reveals that plain text is in fact the
technical user's way to regard a 鍖le or a
sequence of bytes. In this sense, there is no
plain text.

           /b[w&&[^aA]]+b/
              /W{2,}w+b/
Example
This reveals that plain text is in fact the
technical user's way to regard a 鍖le or a
sequence of bytes. In this sense, there is no
plain text.

           /b[w&&[^aA]]+b/
              /W{2,}w+b/
Anchors
Anchors

 ^ - begining (line, string)
Anchors

 ^ - begining (line, string)
 $ - end (line, string)
Anchors

 ^ - begining (line, string)
 $ - end (line, string)
 b - word boundary ~ wW (almost)
    b.{5}b != Ww{5}W
Anchors

 ^ - begining (line, string)
 $ - end (line, string)
 b - word boundary ~ wW (almost)
    b.{5}b != Ww{5}W

 zero width!
Options
Options
 /foo/imsx
    i - case insensitive

    m - multiline (^,$ represent start of string/鍖le)

    s - single line (. matches newlines)

    x - extended!

    g - global
Options
 /foo/imsx
     i - case insensitive

     m - multiline (^,$ represent start of string/鍖le)

     s - single line (. matches newlines)

     x - extended!

     g - global

 can be written inline
     (?imsx-imsx)

     (?imsx-imsx:...)
Options
 /foo/imsx
     i - case insensitive

     m - multiline (^,$ represent start of string/鍖le)

     s - single line (. matches newlines)

     x - extended!

     g - global                      (?x-i)
                                         #this is cool
 can be written inline                  (
                                            foo #my important value
                                           | #don't forget the alternative
      (?imsx-imsx)
                                            bar
                                        ) # result equals to (foo|bar)
      (?imsx-imsx:...)
Groups/Replacing
Groups/Replacing
 (...) - matched group
Groups/Replacing
 (...) - matched group
 $1 - $9
     alternatively 1 - 9 (not recommended)
Groups/Replacing
 (...) - matched group
 $1 - $9
     alternatively 1 - 9 (not recommended)

 nested groups ordered by left bracket
Groups/Replacing
 (...) - matched group
 $1 - $9
     alternatively 1 - 9 (not recommended)

 nested groups ordered by left bracket
 (?:...) - non-captured group
     useful for (?:foo)+ or (?:foo|bar)
Example
quot;foobarmanquot;.replace(
  /(?:f)((o)+)(bar)|(baz|man)/g,
  '$1, $2, $3, $4, $5')
Example
quot;foobarmanquot;.replace(
  /(?:f)((o)+)(bar)|(baz|man)/g,
  '$1, $2, $3, $4, $5')

     foobar
         1 -- oo

         2 -- o

         3 -- bar

         4 --
Example
quot;foobarmanquot;.replace(
  /(?:f)((o)+)(bar)|(baz|man)/g,
  '$1, $2, $3, $4, $5')

     foobar                        man
                                   
          1 -- oo                       1 --

                                   
          2 -- o                        2 --

                                   
          3 -- bar                      3 --

                                   
          4 --                          4 -- man
Look-ahead/behind
 de鍖nes custom zero-width anchors
Look-ahead/behind
 de鍖nes custom zero-width anchors
                   positive negative

          ahead     (?=...)   (?!...)

          behind   (?<=...)   (?<!...)
Example

zdenek@gooddata.com
   /.*?@gooddata/


zdenek@gooddata.com
 /.*?(?=@gooddata)/
Recursive RE

 very important!
    quote & bracket matching

    technically not part of regular grammar

 two styles
    g<name> or g<n> - TextMate

    (?R) - Perl
Example
(?x:

 ( # match the initial opening parenthesis

 # Now make a named group 'balanced' which
     # matches a balanced substring.

 (?<balanced>

 
 [^()] # A balanced substring is either something
             # that is not a parenthesis:

 
 | # or a parenthesised string:

 
 ( # A parenthesised string begins with an opening parenthesis

 
 
 g<balanced>* # followed by a sequence of balanced substrings

 
 ) # and ends with a closing parenthesis

 )* # Look for a sequence of balanced substrings

 ) # Finally, the outer closing parenthesis
)
Example
(?x:

 ( # match the initial opening parenthesis

 # Now make a named group 'balanced' which
     # matches a balanced substring.

 (?<balanced>

 
 [^()] # A balanced substring is either something
             # that is not a parenthesis:

 
 | # or a parenthesised string:

 
 ( # A parenthesised string begins with an opening parenthesis

 
 
 g<balanced>* # followed by a sequence of balanced substrings

 
 ) # and ends with a closing parenthesis

 )* # Look for a sequence of balanced substrings

 ) # Finally, the outer closing parenthesis
)

or: (([^()]|(?R))*)

More Related Content

Advanced Regular Expressions Redux

  • 2. Scope medium to advanced 30 minutes performance / backtracking irrelevant no compatibility charts (yet)
  • 3. TOC basic matching, quanti鍖ers character classes, types, properties, anchors groups, options, replace string look-ahead/behind subexpressions
  • 5. RE overview match foo replace with bar Perl /foo/ (on $_) s/foo/bar/ (on $_) Javascript /foo/ foolish.replace(/foo/, bar) Vi /foo/ :s/foo/bar/ TextMate -F, Find: foo -F Find: foo, Replace: bar
  • 6. RE overview match foo replace with bar Perl /foo/ (on $_) s/foo/bar/ (on $_) Javascript /foo/ foolish.replace(/foo/, bar) Vi /foo/ :s/foo/bar/ TextMate -F, Find: foo -F Find: foo, Replace: bar
  • 7. RE overview match foo replace with bar Perl /foo/ (on $_) s/foo/bar/ (on $_) Javascript /foo/ foolish.replace(/foo/, bar) Vi /foo/ :s/foo/bar/ TextMate -F, Find: foo -F Find: foo, Replace: bar
  • 10. 河顎温稼岳庄鍖e姻壊 classic greedy: ?, *, + speci鍖c:{1,5}, {,5}
  • 11. 河顎温稼岳庄鍖e姻壊 classic greedy: ?, *, + speci鍖c:{1,5}, {,5} ? == {0,1}
  • 12. 河顎温稼岳庄鍖e姻壊 classic greedy: ?, *, + speci鍖c:{1,5}, {,5} ? == {0,1} * == {0,}
  • 13. 河顎温稼岳庄鍖e姻壊 classic greedy: ?, *, + speci鍖c:{1,5}, {,5} ? == {0,1} * == {0,} + == {1,}
  • 14. 河顎温稼岳庄鍖e姻壊 classic greedy: ?, *, + speci鍖c:{1,5}, {,5} ? == {0,1} * == {0,} + == {1,} non-greedy: ??, *?, +?, {5,7}?
  • 15. Example This reveals that plain text is in fact the technical user's way to regard a 鍖le or a sequence of bytes. In this sense, there is no plain text. /reveal(.*)plain/ /reveal(.*?)plain/ /t.{2,3}t/
  • 16. Example This reveals that plain text is in fact the technical user's way to regard a 鍖le or a sequence of bytes. In this sense, there is no plain text. /reveal(.*)plain/ /reveal(.*?)plain/ /t.{2,3}t/
  • 17. Example This reveals that plain text is in fact the technical user's way to regard a 鍖le or a sequence of bytes. In this sense, there is no plain text. /reveal(.*)plain/ /reveal(.*?)plain/ /t.{2,3}t/
  • 18. Example This reveals that plain text is in fact the technical user's way to regard a 鍖le or a sequence of bytes. In this sense, there is no plain text. /reveal(.*)plain/ /reveal(.*?)plain/ /t.{2,3}t/
  • 19. Character Classes / Properties
  • 20. Character Classes / Properties [0-9a-z] (classes)
  • 21. Character Classes / Properties [0-9a-z] (classes) +420[0-9]{9} = simpli鍖ed czech phone nr.
  • 22. Character Classes / Properties [0-9a-z] (classes) +420[0-9]{9} = simpli鍖ed czech phone nr. dont: [A-z0-]
  • 23. Character Classes / Properties [0-9a-z] (classes) +420[0-9]{9} = simpli鍖ed czech phone nr. dont: [A-z0-] [a-z&&[^j-n]] == [a-io-z]
  • 24. Character Classes / Properties [0-9a-z] (classes) +420[0-9]{9} = simpli鍖ed czech phone nr. dont: [A-z0-] [a-z&&[^j-n]] == [a-io-z] p{Upper} (properties)
  • 25. Character Classes / Properties [0-9a-z] (classes) +420[0-9]{9} = simpli鍖ed czech phone nr. dont: [A-z0-] [a-z&&[^j-n]] == [a-io-z] p{Upper} (properties) works great on Unicode text (Latin,Katakana)
  • 26. Character Classes / Properties [0-9a-z] (classes) +420[0-9]{9} = simpli鍖ed czech phone nr. dont: [A-z0-] [a-z&&[^j-n]] == [a-io-z] p{Upper} (properties) works great on Unicode text (Latin,Katakana) [:alnum:], [:^space:] (POSIX bracket)
  • 28. Character Types . == anything (apart from newline)
  • 29. Character Types . == anything (apart from newline) s == space == [tnvfr ] more in unicode
  • 30. Character Types . == anything (apart from newline) s == space == [tnvfr ] more in unicode w == word char == cca [0-9a-zA-Z_] is complicated in unicode
  • 31. Character Types . == anything (apart from newline) s == space == [tnvfr ] more in unicode w == word char == cca [0-9a-zA-Z_] is complicated in unicode d == digit == [0-9] h == hexadecimal digit == [0-9a-fA-F]
  • 32. Character Types . == anything (apart from newline) s == space == [tnvfr ] more in unicode w == word char == cca [0-9a-zA-Z_] is complicated in unicode d == digit == [0-9] h == hexadecimal digit == [0-9a-fA-F] SWD == [^s][^w][^d]
  • 33. Example This reveals that plain text is in fact the technical user's way to regard a 鍖le or a sequence of bytes. In this sense, there is no plain text. /b[w&&[^aA]]+b/ /W{2,}w+b/
  • 34. Example This reveals that plain text is in fact the technical user's way to regard a 鍖le or a sequence of bytes. In this sense, there is no plain text. /b[w&&[^aA]]+b/ /W{2,}w+b/
  • 36. Anchors ^ - begining (line, string)
  • 37. Anchors ^ - begining (line, string) $ - end (line, string)
  • 38. Anchors ^ - begining (line, string) $ - end (line, string) b - word boundary ~ wW (almost) b.{5}b != Ww{5}W
  • 39. Anchors ^ - begining (line, string) $ - end (line, string) b - word boundary ~ wW (almost) b.{5}b != Ww{5}W zero width!
  • 41. Options /foo/imsx i - case insensitive m - multiline (^,$ represent start of string/鍖le) s - single line (. matches newlines) x - extended! g - global
  • 42. Options /foo/imsx i - case insensitive m - multiline (^,$ represent start of string/鍖le) s - single line (. matches newlines) x - extended! g - global can be written inline (?imsx-imsx) (?imsx-imsx:...)
  • 43. Options /foo/imsx i - case insensitive m - multiline (^,$ represent start of string/鍖le) s - single line (. matches newlines) x - extended! g - global (?x-i) #this is cool can be written inline ( foo #my important value | #don't forget the alternative (?imsx-imsx) bar ) # result equals to (foo|bar) (?imsx-imsx:...)
  • 45. Groups/Replacing (...) - matched group
  • 46. Groups/Replacing (...) - matched group $1 - $9 alternatively 1 - 9 (not recommended)
  • 47. Groups/Replacing (...) - matched group $1 - $9 alternatively 1 - 9 (not recommended) nested groups ordered by left bracket
  • 48. Groups/Replacing (...) - matched group $1 - $9 alternatively 1 - 9 (not recommended) nested groups ordered by left bracket (?:...) - non-captured group useful for (?:foo)+ or (?:foo|bar)
  • 50. Example quot;foobarmanquot;.replace( /(?:f)((o)+)(bar)|(baz|man)/g, '$1, $2, $3, $4, $5') foobar 1 -- oo 2 -- o 3 -- bar 4 --
  • 51. Example quot;foobarmanquot;.replace( /(?:f)((o)+)(bar)|(baz|man)/g, '$1, $2, $3, $4, $5') foobar man 1 -- oo 1 -- 2 -- o 2 -- 3 -- bar 3 -- 4 -- 4 -- man
  • 52. Look-ahead/behind de鍖nes custom zero-width anchors
  • 53. Look-ahead/behind de鍖nes custom zero-width anchors positive negative ahead (?=...) (?!...) behind (?<=...) (?<!...)
  • 54. Example zdenek@gooddata.com /.*?@gooddata/ zdenek@gooddata.com /.*?(?=@gooddata)/
  • 55. Recursive RE very important! quote & bracket matching technically not part of regular grammar two styles g<name> or g<n> - TextMate (?R) - Perl
  • 56. Example (?x: ( # match the initial opening parenthesis # Now make a named group 'balanced' which # matches a balanced substring. (?<balanced> [^()] # A balanced substring is either something # that is not a parenthesis: | # or a parenthesised string: ( # A parenthesised string begins with an opening parenthesis g<balanced>* # followed by a sequence of balanced substrings ) # and ends with a closing parenthesis )* # Look for a sequence of balanced substrings ) # Finally, the outer closing parenthesis )
  • 57. Example (?x: ( # match the initial opening parenthesis # Now make a named group 'balanced' which # matches a balanced substring. (?<balanced> [^()] # A balanced substring is either something # that is not a parenthesis: | # or a parenthesised string: ( # A parenthesised string begins with an opening parenthesis g<balanced>* # followed by a sequence of balanced substrings ) # and ends with a closing parenthesis )* # Look for a sequence of balanced substrings ) # Finally, the outer closing parenthesis ) or: (([^()]|(?R))*)

Editor's Notes

  1. escaping???
  2. escaping???
  3. escaping???
  4. examples! possessive (?+, *+, ++)
  5. examples! possessive (?+, *+, ++)
  6. examples! possessive (?+, *+, ++)
  7. examples! possessive (?+, *+, ++)
  8. examples! possessive (?+, *+, ++)
  9. examples! possessive (?+, *+, ++)
  10. unicode compat table!
  11. unicode compat table!
  12. unicode compat table!
  13. unicode compat table!
  14. unicode compat table!
  15. unicode compat table!
  16. unicode compat table!
  17. notice the space at the end, capital reverses
  18. notice the space at the end, capital reverses
  19. notice the space at the end, capital reverses
  20. notice the space at the end, capital reverses
  21. notice the space at the end, capital reverses
  22. how about /g??
  23. how about /g??
  24. how about /g??