Pattern Matching
Regular Expressions
Regular
expressions are patterns to be matched against a string. The two
basic operations performed using patterns are matching and
substitution:
Matching
/pattern/
Substitution
s /pattern/newstring/
The
simplest kind of regular expression is a literal string. More
complicated expressions include metacharacters to represent other
characters or combinations of them. The […] construct is used to
list a set of characters (a character class) of which one will match.
Ranges of characters are denoted with a hyphen (-), and a negation is
denoted with a circumflex (^). Examples of character classes are
shown below:
[a-zA-Z]
Any single letter
[0-9]
Any digit
[^0-9]
Any character not a digit
Some
common character classes have their own predefined symbols:
Code | Matches |
. | Any character |
\d | A digit, such as [0-9] |
\D | A nondigit, same as [^0-9] |
\w | A word character (alphanumeric) [a-zA-Z_0-9] |
\W | A nonword character [^a-zA-Z_0-9] |
\s | A whitespace character [ \t\n\r\f] |
\S | A non-whitespace character [^ \t\n\r\f] |
Regular
expressions also allow for the use of both variable interpolation and
backslashed representations of certain characters:
Code | Matches |
\n | Newline |
\r | Carriage return |
\t | Tab |
\f | Formfeed |
\/ | Literal forward slash |
Anchors
don’t match any characters; they match places within a string.
Assertion | Meaning |
^ | Matches at the beginning of string |
$ | Matches at the end of string |
\b | Matches on word boundary |
\B | Matches except at word boundary |
\A | Matches at the beginning of string |
\Z | Matches at the end of string or before a newline |
\z | Matches only at the end of string |
Quantifiers
are used to specify how many instances of the previous element can
match.
Maximal | Minimal | Allowed Range |
{n,m} | {n,m}? | Must occur at least n times, but no more than m times |
{n,} | {n,}? | Must occur at least n times |
{n} | {n}? | Must match exactly n times |
* | *? | 0 or more times (same as {0,}) |
+ | +? | 1 or more times (same as {1,}) |
? | ?? | 0 or 1 time (same as {0,1}) |
It
is important to note that quantifiers are greedy by nature. If two
quantified patterns are represented in the same regular expression,
the leftmost is greediest. To force your quantifiers to be
non-greedy, append a question mark.
If
you are looking for two possible patterns in a string, you can use
the alternation operator (|). For example,
/you|me|him|her/;
will
match against any one of these four words. You may also use
parentheses to provide boundaries for alternation:
/And(y|rew)/;
will
match either “Andy” or “Andrew”.
Parentheses
are used to group characters and expressions. They also have the
effect of “remembering” parts of a matched pattern for further
processing. To recall the “memorized” portion of the string,
include a backslash followed by an integer representing the location
of the parentheses in the expression:
/fred(.)barney\1/;
Outside
of the expression, these “memorized” portions are accessible as
the special variables $1, $2, $3, etc. Other special variables are as
follows:
$&
Part of string matching regexp
$`
Part of string before the match
$’
Part of string after the match
Regular
expression grouping precedence
Parentheses
() (?: )
Quantifiers
? + * {m,n} ?? +? *?
Sequence
and abc ^ $ \A \Z (?= ) (?! )
anchoring
Alternation
|
To
select a target for matching/substitution other than the default
variable ($_), use the =~ operator:
$var
=~ /pattern/;
Operators
m/pattern/gimosx
The
“match” operator searches a string for a pattern match. The
preceding “m” is usually omitted. The trailing modifiers are as
follows
Modifier
|
Meaning
|
g
|
Match
globally; find all occurrences
|
i
|
Do
case-insensitive matching
|
m
|
Treat
string as multiple lines
|
o
|
Only
compile pattern once
|
s
|
Treat
string as a single line
|
x
|
Use
extended regular expressions
|
s/pattern/replacement/egimosx
Searches
a string for a pattern, and replaces any match with replacement. The
trailing
modifiers are all the same as for the match operator, with the
exception of “e”, which evaluates the right-hand side as
an expression. The substitution operator works on the default
variable ($_), unless the =~ operator changes the
target to another variable.
tr/pattern1/pattern2/cds
This
operator scans a string and, character by character, replaces any
characters matching pattern1 with those from pattern2.
Trailing modifiers are:
Modifier
|
Meaning
|
c
|
Complement
pattern1
|
d
|
Delete
found but unreplaced characters
|
s
|
Squash
duplicated replaced characters
|
This
can be used to force letters to all uppercase:
tr/a-z/A-Z/;
@fields
= split(pattern,$input);
Split
looks for occurrences of a regular expression and breaks the input
string at those points. Without any arguments, split breaks on the
whitespace in $_:
@words
= split; is equivalent to
@words
= split(/\s+/,$_);
$output
= join($delimiter,@inlist);
Join,
the complement of split, takes a list of values and glues them
together with the provided delimiting string.
No comments :
Post a Comment