Friday, 4 April 2014

PS_07_Perl - Pattern Matching

Pattern Matching

Regular Expressions
Regular expressions are patterns to be matched against a string. The two basic operations performed using patterns are matching and substitution:

Matching /pattern/
Substitution s /pattern/newstring/

The simplest kind of regular expression is a literal string. More complicated expressions include metacharacters to represent other characters or combinations of them. The […] construct is used to list a set of characters (a character class) of which one will match. Ranges of characters are denoted with a hyphen (-), and a negation is denoted with a circumflex (^). Examples of character classes are shown below:

[a-zA-Z] Any single letter
[0-9] Any digit
[^0-9] Any character not a digit

Some common character classes have their own predefined symbols:
Code Matches
. Any character
\d A digit, such as [0-9]
\D A nondigit, same as [^0-9]
\w A word character (alphanumeric) [a-zA-Z_0-9]
\W A nonword character [^a-zA-Z_0-9]
\s A whitespace character [ \t\n\r\f]
\S A non-whitespace character [^ \t\n\r\f]

Regular expressions also allow for the use of both variable interpolation and backslashed representations of certain characters:
Code Matches
\n Newline
\r Carriage return
\t Tab
\f Formfeed
\/ Literal forward slash

Anchors don’t match any characters; they match places within a string.
Assertion Meaning
^ Matches at the beginning of string
$ Matches at the end of string
\b Matches on word boundary
\B Matches except at word boundary
\A Matches at the beginning of string
\Z Matches at the end of string or before a newline
\z Matches only at the end of string

Quantifiers are used to specify how many instances of the previous element can match.
Maximal Minimal Allowed Range
{n,m} {n,m}? Must occur at least n times, but no more than m times
{n,} {n,}? Must occur at least n times
{n} {n}? Must match exactly n times
* *? 0 or more times (same as {0,})
+ +? 1 or more times (same as {1,})
? ?? 0 or 1 time (same as {0,1})

It is important to note that quantifiers are greedy by nature. If two quantified patterns are represented in the same regular expression, the leftmost is greediest. To force your quantifiers to be non-greedy, append a question mark.

If you are looking for two possible patterns in a string, you can use the alternation operator (|). For example,
/you|me|him|her/;
will match against any one of these four words. You may also use parentheses to provide boundaries for alternation:
/And(y|rew)/;
will match either “Andy” or “Andrew”.

Parentheses are used to group characters and expressions. They also have the effect of “remembering” parts of a matched pattern for further processing. To recall the “memorized” portion of the string, include a backslash followed by an integer representing the location of the parentheses in the expression:
/fred(.)barney\1/;

Outside of the expression, these “memorized” portions are accessible as the special variables $1, $2, $3, etc. Other special variables are as follows:
$& Part of string matching regexp
$` Part of string before the match
$’ Part of string after the match

Regular expression grouping precedence
Parentheses () (?: )
Quantifiers ? + * {m,n} ?? +? *?
Sequence and abc ^ $ \A \Z (?= ) (?! )
anchoring
Alternation |

To select a target for matching/substitution other than the default variable ($_), use the =~ operator:
$var =~ /pattern/;

Operators

m/pattern/gimosx
The “match” operator searches a string for a pattern match. The preceding “m” is usually omitted. The trailing modifiers are as follows
Modifier
Meaning
g
Match globally; find all occurrences
i
Do case-insensitive matching
m
Treat string as multiple lines
o
Only compile pattern once
s
Treat string as a single line
x
Use extended regular expressions

s/pattern/replacement/egimosx
Searches a string for a pattern, and replaces any match with replacement. The
trailing modifiers are all the same as for the match operator, with the exception of “e”, which evaluates the right-hand side as an expression. The substitution operator works on the default variable ($_), unless the =~ operator changes the target to another variable.

tr/pattern1/pattern2/cds
This operator scans a string and, character by character, replaces any characters matching pattern1 with those from pattern2. Trailing modifiers are:
Modifier
Meaning
c
Complement pattern1
d
Delete found but unreplaced characters
s
Squash duplicated replaced characters
This can be used to force letters to all uppercase:
tr/a-z/A-Z/;

@fields = split(pattern,$input);
Split looks for occurrences of a regular expression and breaks the input string at those points. Without any arguments, split breaks on the whitespace in $_:
@words = split; is equivalent to
@words = split(/\s+/,$_);

$output = join($delimiter,@inlist);
Join, the complement of split, takes a list of values and glues them together with the provided delimiting string.

No comments :

Post a Comment