There are many ways to do pattern matching, and many domains in which to use it. Snobol4 uses a very expressive pattern matching language over strings; Prolog uses a fairly simple pattern matching language over relations (trees). In both Snobol4 and Prolog, pattern matching is expensive, because it can imply backtracking; both languages provide imperative mechanisms for the programmer to override the natural declarative nature of the pattern matching so as to avoid backtracking.
It would be nice if there were a model of pattern matching that would allow us to use a pattern matching language that was as powerful as possible and yet not so powerful as to overcomplicate the mechanism and lead to inefficiency. Consider Unix file globbing patterns: they seem to be simple, but are they? Do they ever imply backtracking? If not, are they as powerful as they could be?
In fact, Unix file globbing patterns form an ad hoc pattern matching language, which, while efficient (non-backtracking), is not as powerful as it could be.
We can find a good model for pattern matching in the field of automata theory.
This raises several interesting questions:
A whole theory of computability and computation exists, based on a classification of pattern matching machines. These machines are models of the basic kinds of computers.
Sometimes programming languages are implemented by compilation or translation in an effort to gain efficiency by skipping a level of interpretation, but usually the levels of interpretation continue up to higher levels: a Lisp interpreter is written in C, for example; an object-oriented language like CLOS is written in Lisp; a CLOS programmer designs an interpreter for an application-oriented language in CLOS; if the application is extensible, a user of the appliation may design yet another interpreter in that language.
Each of these interpreters, except for the bottom-most layer, is a software machine. Pattern matchers are likewise software machines.
How do we prove this? We prove it by constructing an interpreter for the lower level machine on the higher level one: if we can construct such an interpreter, then clearly the higher level machine is capable of computing anything the lower level machine can compute: it can simply compute it on the interpreter!
The levels of the Chomsky hierarchy are as follows:
MACHINE CLASS LANGUAGE CLASS ------------- -------------- finite automata regular languages pushdown automata context-free languages linear bounded automata context-sensitive languages Turing machines recursively enumerable languagesThe simplest kind of machine is the finite automaton, while the most complex machine is the Turing machine.
Each of these language classes are of interest to computer programmers:
LANGUAGE CLASS EXAMPLE -------------- ------- regular languages pattern matching languages context-free languages simpler programming languages context-sensitive languages most programming languages recursively enumerable languages natural languages
Languages are usually defined by grammars, so there are also classes of grammars that correspond to these language classes:
LANGUAGE CLASS GRAMMAR CLASS -------------- ------------- regular languages regular expressions context-free languages Backus-Naur forms context-sensitive languages Van Wijngaarden (two-level) grammars recursively enumerable languages ???
Parsing technology for programming language implementation is based on algorithms that are known to be able to parse particular classes of languages based on particular grammars. Regular expressions are used in compilers to implement the lexical level of language definition. Because of the simplicity and efficiency of the finite automata that interpret regular expressions, they are also used for pattern matching in tools and other languages.
To run the automaton on a given input string, we simply iteratively execute the state transition function, calling it with the current state (starting with the initial state) and the current input symbol. Each time we call the function, we advance to the next input symbol, and use that with the new state returned by the function as the input to the next iteration. We terminate when we've exhausted the input string. If the process terminates with the function returning a state that's a member of the set of accepting states, we have recognized the input string; otherwise, we have rejected it.
Finite automata can be modelled with a state transition diagram, a digraph with the states represented by the nodes and the transitions represented by the arcs (labelled with symbols). For any regular language, there are an infinite number of possible finite automata that can be constructed to recognize that language.
Every finite automaton can be represented by a regular expression; regular expressions are the grammars of the regular languages. A regular expression (regexp) is defined as follows:
It's very easy to write an efficient interpreter for a given finite automaton in any programming language. All that's required is to represent that state transition function as a function or an array. It's also easy to automate the construction of efficient finite automata from regular expressions.
emacs
, vi
); pagers (less
,
more
); file searching tools (grep
,
egrep
, sed
); programming languages (Tcl,
Perl, C, Awk, Lex, etc).
grep
-like tools do this); you can move the cursor
in a text editor or pager to the next occurrence of a matching regular
expression; you can substitute new text for matching occurrences,
allowing you to change or delete based on regular expressions; and you
can extract subparts of a match.
Real regular expression notations differ from the simple notation
above in order to provide shorthands that are easier to write. I will
describe Tcl's regular expression notation (which is that of
egrep
) and explain the shorthands in terms of the basic
notation above.
A B 7 & @ \*
B
concatenated:
AB
|
operator; here are some
sums of regexps:
A|B 0|1|2|3|4|5|6|7|8|9
*
; here
are some iterated regexps:
A* 9* \**Iteration means zero or more occurrences of the preceding regexp.
(ABC) (ABC)* (0|1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)*
0|1|2|3|4|5|6|7|8|9 [0123456789] [11987654321111011]There is a special shorthand for use in a character class: a range of characters can be expressed by separating the lower and upper inclusive bounds with a hyphen; in this case, order matters because ASCII collating order is used to resolve the range. These three regexps are equivalent:
0|1|2|3|4|5|6|7|8|9 [0123456789] [0-9]
This shorthand makes the hyphen a metacharacter within a character class. How do we get a literal hyphen in a character class? Like this (all three are equivalent):
[-A-Z] [A-Z-] A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T|U|V|W|X|Y|Z|-
There is also a notation for the complement of a character class: if
the first character of the character class is a carat
(^
), the character class is complemented. So
[^a-zA-Z]
is the regexp that matches any non-alphabetic
ASCII character (including all control characters). To use a literal
carat in a character class, simply place it anywhere except in the
first position.
To include a literal ]
in the sequence, make it the
first character (following a possible ^
).
?
) operator makes the
preceding regexp optional (i.e, it can occur zero or one times). This
shorthand is equivalent to a common use of summing. These two regexps
are equivalent:
a(foo)? afoo|a
+
) is just like the
*
operator except that it specifies one or more
repetitions of the previous regexp. These two regexps are equivalent:
[0-9]+ [0-9][0-9]*
.
) metacharacter matches any single ASCII
character. It's a very important shorthand for the sum of all the
ASCII characters.
^
) will
only match if it occurs at the beginning of the input string. This is
the behavior we described for the basic regular expressions above.
Without a carat, Tcl regular expressions can actually begin anywhere
in the input string: we're not testing whether or not the entire
string matches, but whether or not a match occurs as a substring of
the input. It turns out that this is more useful, so it's the
default. If the anchored approach were the default, then most Tcl
regexps would have to start with .*
. Carat only has this
special interpretation if it's the first character of the regexp.
A regular expression that ends with dollar ($
) only
matches at the end of the input string. Combining these two anchors
allows us to constrain a regexp to match the entire input string, thus
simulating the more strict basic regexps described earlier.
a*b*
when matched against
the string aabaaabb
, matches aab
, the first
three letters of the input string.
regexp
Commandregsub
Command