ºÝºÝߣ

ºÝºÝߣShare a Scribd company logo
Perl 6 rules
Radek Kotowicz
2nd Polish Perl Workshop,
Pozna¨½, May 2014
Perl 5 & 6 regex complexity
Perl 5 & 6 regex complexity
What drives complexity of Perl 5 regex?
? Basic regular expressions (describing regular
languages) are easy to match: O(|T|) /
O(|r|*|T|)
? Backreferences:
¨C Exponential time O(2^|T|)
¨C NP-complete problem (3SAT reduction)
¨C implicate the use of backtracking algorithms
¡­
? Look-behind assertions add even more
complexity (input is not consumed): O(2|T|+|r|)
? Experimental features of code assertions
introduced undecidability:
a=~/<?{do 1 while 2;}>/
Apocalypse 5
Perl 5 REGEXes were:
? too compact and 'cute¡®
? had too much reliance on too few
metacharacters
? little support for named captures
? little support for grammars
? poor integration with the 'real' language
What¡¯s new in Perl 6 rules?
? Named captures (backported to 5.10)
? Simpler notation for non-capturing groups
? White spaces ignored unless backspaced
? Simplified code assertions (experimental in
5.18)
? New character classes and set operations on
char classes
Now the big things
? Rules are part of the ¡®real¡¯ language (same
lexer/parser)
? Support for Parsing Expression Grammars (PEGs)
? Backtracking/matching control
? Code assertions can fail or succeed unlike in Perl
5
$perl -e "'abrakadabra'=~/(ab)(?{0}).*(?{print 'OK'})/¡°
OK
$perl6 -e "'abrakadabra'~~/(ab)<?{0}>.*<?{say 'OK'}>/¡°
PEGs
? Similar to CFG but unambiguous in terms of parse
trees
? CFG rules explain how to produce words ¨C PEGs
explain how to parse them:
¨C CGF for {an bn : n >=1}:
? S -> 'a' S 'b¡®
? S -> ¦Å
¨C PEG
? S ¡û 'a' S? 'b'
? PEG: negative/positive look-ahead assertions
¡­
? With PEGs it¡¯s possible to describe some non-
context-free grammars {an bn cn : n >=1}
? Recognizing PEG expressions can be done in
linear time if there¡¯s enough memory
? Rakudo comes with a JSON parser
implementation written in Perl 6 rules!
? Naturaly, Perl6 grammar is also expressed in
PEG
Exploratory parsing
grammar BookGrammar {
rule TOP {<book>+}
regex book {<author><whitespace><title> | <title><whitespace><author>}
regex author { (<name>|<initial>) [<whitespace>[<name>|<initial>]]?
<whitespace> <surename> }
rule name {(<word>)<?{ %firstNames{$0}:exists}>}
rule surename {<word>}
rule title { '"'<word>[<whitespace><word>]*'"'}
token word { <alpha>+ }
token initial { <alpha>.? }
token whitespace {s+}
}
¡­
sub MAIN() {
my $match = BookGrammar.parse('Joseph Conrad "Lord Jim"');
say $match;
}
¡­
Joseph Conrad "Lord Jim""
book => "Joseph Conrad "Lord Jim""
author => "Joseph Conrad"
0 => "Joseph"
name => "Joseph"
0 => "Joseph"
word => "Joseph"
alpha => "J"
alpha => "o"
alpha => "s"
alpha => "e"
alpha => "p"
alpha => "h"
whitespace => " "
surename => "Conrad"
word => "Conrad"
alpha => "C"
alpha => "o"
alpha => "n"
alpha => "r"
alpha => "a"
alpha => "d"
whitespace => " "
title => ""Lord Jim""
word => "Lord"
alpha => "L"
alpha => "o"
alpha => "r"
alpha => "d"
whitespace => " "
word => "Jim"
alpha => "J"
alpha => "i"
alpha => "m"
use Grammar::Tracer
C:UsersI079489projectsplpw2014>c:rakudobinperl6.exe parse_books.pl
¡û[1mTOP¡û[0m
| ¡û[1mbook¡û[0m
| | ¡û[1mauthor¡û[0m
| | | ¡û[1mname¡û[0m
| | | | ¡û[1mword¡û[0m
| | | | | ¡û[1malpha¡û[0m
| | | | | * ¡û[37;42mMATCH¡û[0m¡û[37m "J"¡û[0m
| | | | | ¡û[1malpha¡û[0m
| | | | | * ¡û[37;42mMATCH¡û[0m¡û[37m "o"¡û[0m
| | | | | ¡û[1malpha¡û[0m
| | | | | * ¡û[37;42mMATCH¡û[0m¡û[37m "s"¡û[0m
| | | | | ¡û[1malpha¡û[0m
| | | | | * ¡û[37;42mMATCH¡û[0m¡û[37m "e"¡û[0m
| | | | | ¡û[1malpha¡û[0m
| | | | | * ¡û[37;42mMATCH¡û[0m¡û[37m "p"¡û[0m
| | | | | ¡û[1malpha¡û[0m
| | | | | * ¡û[37;42mMATCH¡û[0m¡û[37m "h"¡û[0m
| | | | | ¡û[1malpha¡û[0m
| | | | | * ¡û[37;41mFAIL¡û[0m
| | | | * ¡û[37;42mMATCH¡û[0m¡û[37m "Joseph"¡û[0m
| | | * ¡û[37;42mMATCH¡û[0m¡û[37m "Joseph"¡û[0m
| | | ¡û[1mwhitespace¡û[0m
| | | * ¡û[37;42mMATCH¡û[0m¡û[37m " "¡û[0m
| | | ¡û[1mname¡û[0m
| | | | ¡û[1mword¡û[0m
| | | | | ¡û[1malpha¡û[0m
| | | | | * ¡û[37;42mMATCH¡û[0m¡û[37m "C"¡û[0m
| | | | | ¡û[1malpha¡û[0m
| | | | | * ¡û[37;42mMATCH¡û[0m¡û[37m "o"¡û[0m
| | | | | ¡û[1malpha¡û[0m
| | | | | * ¡û[37;42mMATCH¡û[0m¡û[37m "n"¡û[0m
| | | | | ¡û[1malpha¡û[0m
| | | | | * ¡û[37;42mMATCH¡û[0m¡û[37m "r"¡û[0m
That¡¯s it
Thank you!

More Related Content

Perl 5 & 6 regex complexity

  • 1. Perl 6 rules Radek Kotowicz 2nd Polish Perl Workshop, Pozna¨½, May 2014
  • 4. What drives complexity of Perl 5 regex? ? Basic regular expressions (describing regular languages) are easy to match: O(|T|) / O(|r|*|T|) ? Backreferences: ¨C Exponential time O(2^|T|) ¨C NP-complete problem (3SAT reduction) ¨C implicate the use of backtracking algorithms
  • 5. ¡­ ? Look-behind assertions add even more complexity (input is not consumed): O(2|T|+|r|) ? Experimental features of code assertions introduced undecidability: a=~/<?{do 1 while 2;}>/
  • 6. Apocalypse 5 Perl 5 REGEXes were: ? too compact and 'cute¡® ? had too much reliance on too few metacharacters ? little support for named captures ? little support for grammars ? poor integration with the 'real' language
  • 7. What¡¯s new in Perl 6 rules? ? Named captures (backported to 5.10) ? Simpler notation for non-capturing groups ? White spaces ignored unless backspaced ? Simplified code assertions (experimental in 5.18) ? New character classes and set operations on char classes
  • 8. Now the big things ? Rules are part of the ¡®real¡¯ language (same lexer/parser) ? Support for Parsing Expression Grammars (PEGs) ? Backtracking/matching control ? Code assertions can fail or succeed unlike in Perl 5 $perl -e "'abrakadabra'=~/(ab)(?{0}).*(?{print 'OK'})/¡° OK $perl6 -e "'abrakadabra'~~/(ab)<?{0}>.*<?{say 'OK'}>/¡°
  • 9. PEGs ? Similar to CFG but unambiguous in terms of parse trees ? CFG rules explain how to produce words ¨C PEGs explain how to parse them: ¨C CGF for {an bn : n >=1}: ? S -> 'a' S 'b¡® ? S -> ¦Å ¨C PEG ? S ¡û 'a' S? 'b' ? PEG: negative/positive look-ahead assertions
  • 10. ¡­ ? With PEGs it¡¯s possible to describe some non- context-free grammars {an bn cn : n >=1} ? Recognizing PEG expressions can be done in linear time if there¡¯s enough memory ? Rakudo comes with a JSON parser implementation written in Perl 6 rules! ? Naturaly, Perl6 grammar is also expressed in PEG
  • 11. Exploratory parsing grammar BookGrammar { rule TOP {<book>+} regex book {<author><whitespace><title> | <title><whitespace><author>} regex author { (<name>|<initial>) [<whitespace>[<name>|<initial>]]? <whitespace> <surename> } rule name {(<word>)<?{ %firstNames{$0}:exists}>} rule surename {<word>} rule title { '"'<word>[<whitespace><word>]*'"'} token word { <alpha>+ } token initial { <alpha>.? } token whitespace {s+} }
  • 12. ¡­ sub MAIN() { my $match = BookGrammar.parse('Joseph Conrad "Lord Jim"'); say $match; }
  • 13. ¡­ Joseph Conrad "Lord Jim"" book => "Joseph Conrad "Lord Jim"" author => "Joseph Conrad" 0 => "Joseph" name => "Joseph" 0 => "Joseph" word => "Joseph" alpha => "J" alpha => "o" alpha => "s" alpha => "e" alpha => "p" alpha => "h" whitespace => " " surename => "Conrad" word => "Conrad" alpha => "C" alpha => "o" alpha => "n" alpha => "r" alpha => "a" alpha => "d" whitespace => " " title => ""Lord Jim"" word => "Lord" alpha => "L" alpha => "o" alpha => "r" alpha => "d" whitespace => " " word => "Jim" alpha => "J" alpha => "i" alpha => "m"
  • 14. use Grammar::Tracer C:UsersI079489projectsplpw2014>c:rakudobinperl6.exe parse_books.pl ¡û[1mTOP¡û[0m | ¡û[1mbook¡û[0m | | ¡û[1mauthor¡û[0m | | | ¡û[1mname¡û[0m | | | | ¡û[1mword¡û[0m | | | | | ¡û[1malpha¡û[0m | | | | | * ¡û[37;42mMATCH¡û[0m¡û[37m "J"¡û[0m | | | | | ¡û[1malpha¡û[0m | | | | | * ¡û[37;42mMATCH¡û[0m¡û[37m "o"¡û[0m | | | | | ¡û[1malpha¡û[0m | | | | | * ¡û[37;42mMATCH¡û[0m¡û[37m "s"¡û[0m | | | | | ¡û[1malpha¡û[0m | | | | | * ¡û[37;42mMATCH¡û[0m¡û[37m "e"¡û[0m | | | | | ¡û[1malpha¡û[0m | | | | | * ¡û[37;42mMATCH¡û[0m¡û[37m "p"¡û[0m | | | | | ¡û[1malpha¡û[0m | | | | | * ¡û[37;42mMATCH¡û[0m¡û[37m "h"¡û[0m | | | | | ¡û[1malpha¡û[0m | | | | | * ¡û[37;41mFAIL¡û[0m | | | | * ¡û[37;42mMATCH¡û[0m¡û[37m "Joseph"¡û[0m | | | * ¡û[37;42mMATCH¡û[0m¡û[37m "Joseph"¡û[0m | | | ¡û[1mwhitespace¡û[0m | | | * ¡û[37;42mMATCH¡û[0m¡û[37m " "¡û[0m | | | ¡û[1mname¡û[0m | | | | ¡û[1mword¡û[0m | | | | | ¡û[1malpha¡û[0m | | | | | * ¡û[37;42mMATCH¡û[0m¡û[37m "C"¡û[0m | | | | | ¡û[1malpha¡û[0m | | | | | * ¡û[37;42mMATCH¡û[0m¡û[37m "o"¡û[0m | | | | | ¡û[1malpha¡û[0m | | | | | * ¡û[37;42mMATCH¡û[0m¡û[37m "n"¡û[0m | | | | | ¡û[1malpha¡û[0m | | | | | * ¡û[37;42mMATCH¡û[0m¡û[37m "r"¡û[0m