source: trunk/Mars/datacenter/db/regexp.html@ 18066

Last change on this file since 18066 was 17386, checked in by tbretz, 11 years ago
Removed svn:executable property, these are no executables.
File size: 19.5 KB
Line 
1<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
2
3<html>
4<head>
5 <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
6 <meta name="Author" content="Thomas Bretz">
7 <title>MARS: Magic Analysis and Reconstruction Software</title>
8 <link rel="stylesheet" type="text/css" href="mars.css">
9</head>
10
11<body background="background.gif" text="#000000" bgcolor="#000099" link="#1122FF" vlink="#8888FF" alink="#FF0000">
12&nbsp;
13
14<center>
15<table class="Main" CELLPADDING=0>
16
17<tr>
18<td class="Edge"><img SRC="ecke.gif" ALT=""></td>
19<td class="Header">
20<B>M A R S</B><BR><B>M</B>agic <B>A</B>nalysis and <B>R</B>econstruction <B>S</B>oftware
21</td>
22</tr>
23
24<tr>
25<td COLSPAN=2 BGCOLOR="#FFFFFF">
26<hr SIZE=1 NOSHADE WIDTH="80%">
27<center><table class="Inner" CELLPADDING=15>
28
29<tr class="Block">
30<td><b><u><A NAME="OVERVIEW">MySQL Regular Expressions</A>:</u></b>
31<P>
32A <B>regular expression (regex)</B> is a powerful way of specifying a complex search. <P>
33
34 MySQL uses Henry Spencer's implementation of regular expressions, which is aimed at conformance with POSIX
35 1003.2. MySQL uses the extended version. <P>
36
37 This is a simplistic reference that skips the details. To get more exact information, see
38 Henry Spencer's <A HREF="#REGEX">regex(7)</A><P>
39
40 A regular expression describes a set of strings. The simplest regexp is one that has no special characters in it. For
41 example, the regexp <b>hello</B> matches <B>hello</B> and nothing else. <P>
42
43 Non-trivial regular expressions use certain special constructs so that they can match more than one string. For
44 example, the regexp hello|word matches either the string hello or the string word. <P>
45
46 As a more complex example, the regexp B[an]*s matches any of the strings Bananas, Baaaaas, Bs, and any
47 other string starting with a B, ending with an s, and containing any number of a or n characters in between. <P>
48
49 A regular expression may use any of the following special characters/constructs: <P>
50<pre>
51 ^ Match the beginning of a string.
52 mysql> SELECT "fo\nfo" REGEXP "^fo$"; -> 0
53 mysql> SELECT "fofo" REGEXP "^fo"; -> 1
54
55 $ Match the end of a string.
56 mysql> SELECT "fo\no" REGEXP "^fo\no$"; -> 1
57 mysql> SELECT "fo\no" REGEXP "^fo$"; -> 0
58
59 . Match any character (including newline).
60 mysql> SELECT "fofo" REGEXP "^f.*"; -> 1
61 mysql> SELECT "fo\nfo" REGEXP "^f.*"; -> 1
62
63 a* Match any sequence of zero or more a characters.
64 mysql> SELECT "Ban" REGEXP "^Ba*n"; -> 1
65 mysql> SELECT "Baaan" REGEXP "^Ba*n"; -> 1
66 mysql> SELECT "Bn" REGEXP "^Ba*n"; -> 1
67
68 a+ Match any sequence of one or more a characters.
69 mysql> SELECT "Ban" REGEXP "^Ba+n"; -> 1
70 mysql> SELECT "Bn" REGEXP "^Ba+n"; -> 0
71
72 a? Match either zero or one a character.
73 mysql> SELECT "Bn" REGEXP "^Ba?n"; -> 1
74 mysql> SELECT "Ban" REGEXP "^Ba?n"; -> 1
75 mysql> SELECT "Baan" REGEXP "^Ba?n"; -> 0
76
77 de|abc Match either of the sequences de or abc.
78 mysql> SELECT "pi" REGEXP "pi|apa"; -> 1
79 mysql> SELECT "axe" REGEXP "pi|apa"; -> 0
80 mysql> SELECT "apa" REGEXP "pi|apa"; -> 1
81 mysql> SELECT "apa" REGEXP "^(pi|apa)$"; -> 1
82 mysql> SELECT "pi" REGEXP "^(pi|apa)$"; -> 1
83 mysql> SELECT "pix" REGEXP "^(pi|apa)$"; -> 0
84
85 (abc)* Match zero or more instances of the sequence abc.
86 mysql> SELECT "pi" REGEXP "^(pi)*$"; -> 1
87 mysql> SELECT "pip" REGEXP "^(pi)*$"; -> 0
88 mysql> SELECT "pipi" REGEXP "^(pi)*$"; -> 1
89
90 {1} The is a more general way of writing regexps that match many
91 {2,3} occurrences of the previous atom.
92 a* Can be written as a{0,}.
93 a+ Can be written as a{1,}.
94 a? Can be written as a{0,1}.
95
96 To be more precise, an atom followed by a bound containing one
97 integer i and no comma matches a sequence of exactly i matches
98 of the atom. An atom followed by a bound containing one integer i
99 and a comma matches a sequence of i or more matches of the atom.
100 An atom followed by a bound containing two integers i and j matches
101 a sequence of i through j (inclusive) matches of the atom.
102
103 Both arguments must be in the range from 0 to RE_DUP_MAX (default 255),
104 inclusive. If there are two arguments, the second must be greater
105 than or equal to the first.
106
107 [a-dX] Matches any character which is (or is not, if ^ is used) either a, b, c,
108 [^a-dX] d or X. To include a literal ] character, it must immediately follow
109 the opening bracket [. To include a literal - character, it must be
110 written first or last. So [0-9] matches any decimal digit. Any character
111 that does not have a defined meaning inside a [] pair has no special
112 meaning and matches only itself.
113 mysql> SELECT "aXbc" REGEXP "[a-dXYZ]"; -> 1
114 mysql> SELECT "aXbc" REGEXP "^[a-dXYZ]$"; -> 0
115 mysql> SELECT "aXbc" REGEXP "^[a-dXYZ]+$"; -> 1
116 mysql> SELECT "aXbc" REGEXP "^[^a-dXYZ]+$"; -> 0
117 mysql> SELECT "gheis" REGEXP "^[^a-dXYZ]+$"; -> 1
118 mysql> SELECT "gheisa" REGEXP "^[^a-dXYZ]+$"; -> 0
119
120 [[.characters.]]
121 The sequence of characters of that collating element. characters is
122 either a single character or a character name like newline. You can
123 find the full list of character names in 'regexp/cname.h'.
124
125 [ =character_class=]
126 An equivalence class, standing for the sequences of characters of all
127 collating elements equivalent to that one, including itself.
128
129 For example, if o and (+) are the members of an equivalence class,
130 then [[=o=]], [[=(+)=]], and [o(+)] are all synonymous. An equivalence
131 class may not be an endpoint of a range.
132
133 [:character_class:]
134 Within a bracket expression, the name of a character class enclosed
135 in [: and :] stands for the list of all characters belonging to that
136 class. Standard character class names are:
137
138 These stand for the character classes defined in the ctype(3) manual
139 page. A locale may provide others. A character class may not be used
140 as an endpoint of a range.
141 mysql> SELECT "justalnums" REGEXP "[[:alnum:]]+"; -> 1
142 mysql> SELECT "!!" REGEXP "[[:alnum:]]+"; -> 0
143
144 [[:<:]] These match the null string at the beginning and end of a word
145 [[:>:]] respectively. A word is defined as a sequence of word characters
146 which is neither preceded nor followed by word characters. A word
147 character is an alnum character (as defined by ctype(3)) or an
148 underscore (_).
149 mysql> SELECT "a word a" REGEXP "[[:<:]]word[[:>:]]"; -> 1
150 mysql> SELECT "a xword a" REGEXP "[[:<:]]word[[:>:]]"; -> 0
151
152 mysql> SELECT "weeknights" REGEXP "^(wee|week)(knights|nights)$"; -> 1
153</pre>
154</td></tr>
155<tr class="Block">
156<td>
157<center><h3>--- <A NAME="REGEX"><U>REGEX</U></A>(7) ---</h3></center>
158<B>NAME</B><BR>
159 regex - POSIX 1003.2 regular expressions<P>
160
161<B>DESCRIPTION</B><BR>
162 Regular expressions (``RE''s), as defined in POSIX 1003.2,
163 come in two forms: modern REs (roughly those of egrep;
164 1003.2 calls these ``extended'' REs) and obsolete REs
165 (roughly those of ed; 1003.2 ``basic'' REs). Obsolete REs
166 mostly exist for backward compatibility in some old pro-
167 grams; they will be discussed at the end. 1003.2 leaves
168 some aspects of RE syntax and semantics open; `' marks
169 decisions on these aspects that may not be fully portable
170 to other 1003.2 implementations.<P>
171
172 A (modern) RE is one or more non-empty branches, separated
173 by `|'. It matches anything that matches one of the
174 branches.<P>
175
176 A branch is one or more pieces, concatenated. It matches
177 a match for the first, followed by a match for the second,
178 etc.<P>
179
180 A piece is an atom possibly followed by a single `*', `+',
181 `?', or bound. An atom followed by `*' matches a sequence
182 of 0 or more matches of the atom. An atom followed by `+'
183 matches a sequence of 1 or more matches of the atom. An
184 atom followed by `?' matches a sequence of 0 or 1 matches
185 of the atom.<P>
186
187 A bound is `{' followed by an unsigned decimal integer,
188 possibly followed by `,' possibly followed by another
189 unsigned decimal integer, always followed by `}'. The
190 integers must lie between 0 and RE_DUP_MAX (255) inclu-
191 sive, and if there are two of them, the first may not
192 exceed the second. An atom followed by a bound containing
193 one integer i and no comma matches a sequence of exactly i
194 matches of the atom. An atom followed by a bound contain-
195 ing one integer i and a comma matches a sequence of i or
196 more matches of the atom. An atom followed by a bound
197 containing two integers i and j matches a sequence of i
198 through j (inclusive) matches of the atom.<P>
199
200 An atom is a regular expression enclosed in `()' (matching
201 a match for the regular expression), an empty set of `()'
202 (matching the null string), a bracket expression (see
203 below), `.' (matching any single character), `^' (match-
204 ing the null string at the beginning of a line), `$'
205 (matching the null string at the end of a line), a `\'
206 followed by one of the characters `^.[$()|*+?{\' (matching
207 that character taken as an ordinary character), a `\' fol-
208 lowed by any other character (matching that character
209 taken as an ordinary character, as if the `\' had not been
210 present), or a single character with no other significance
211 (matching that character). A `{' followed by a character
212 other than a digit is an ordinary character, not the
213 beginning of a bound. It is illegal to end an RE with
214 `\'.<P>
215
216 A bracket expression is a list of characters enclosed in
217 `[]'. It normally matches any single character from the
218 list (but see below). If the list begins with `^', it
219 matches any single character (but see below) not from the
220 rest of the list. If two characters in the list are sepa-
221 rated by `-', this is shorthand for the full range of
222 characters between those two (inclusive) in the collating
223 sequence, e.g. `[0-9]' in ASCII matches any decimal digit.
224 It is illegal for two ranges to share an endpoint, e.g.
225 `a-c-e'. Ranges are very collating-sequence-dependent,
226 and portable programs should avoid relying on them.<P>
227
228 To include a literal `]' in the list, make it the first
229 character (following a possible `^'). To include a lit-
230 eral `-', make it the first or last character, or the sec-
231 ond endpoint of a range. To use a literal `-' as the
232 first endpoint of a range, enclose it in `[.' and `.]' to
233 make it a collating element (see below). With the excep-
234 tion of these and some combinations using `[' (see next
235 paragraphs), all other special characters, including `\',
236 lose their special significance within a bracket expres-
237 sion.<P>
238
239 Within a bracket expression, a collating element (a char-
240 acter, a multi-character sequence that collates as if it
241 were a single character, or a collating-sequence name for
242 either) enclosed in `[.' and `.]' stands for the sequence
243 of characters of that collating element. The sequence is
244 a single element of the bracket expression's list. A
245 bracket expression containing a multi-character collating
246 element can thus match more than one character, e.g. if
247 the collating sequence includes a `ch' collating element,
248 then the RE `[[.ch.]]*c' matches the first five characters
249 of `chchcc'.<P>
250
251 Within a bracket expression, a collating element enclosed
252 in `[=' and `=]' is an equivalence class, standing for the
253 sequences of characters of all collating elements equiva-
254 lent to that one, including itself. (If there are no
255 other equivalent collating elements, the treatment is as
256 if the enclosing delimiters were `[.' and `.]'.) For
257 example, if o and ^ are the members of an equivalence
258 class, then `[[=o=]]', `[[=^=]]', and `[o^]' are all syn-
259 onymous. An equivalence class may not be an endpoint of a
260 range.<P>
261
262 Within a bracket expression, the name of a character class
263 enclosed in `[:' and `:]' stands for the list of all char-
264 acters belonging to that class. Standard character class
265 names are:<P>
266<table>
267<tr><td>alnum</TD><td>digit</td><td>punct</td></tr>
268<tr><td>alpha</TD><td>graph</TD><td>space</td></tr>
269<tr><td>blank</TD><td>lower</TD><td>upper</td></tr>
270<tr><td>cntrl</TD><td>print</TD><td>xdigit</td></tr>
271</table>
272<P>
273 These stand for the character classes defined in ctype(3).
274 A locale may provide others. A character class may not be
275 used as an endpoint of a range.<P>
276
277 There are two special cases of bracket expressions: the
278 bracket expressions `[[:&lt;:]]' and `[[:&gt;:]]' match the null
279 string at the beginning and end of a word respectively. A
280 word is defined as a sequence of word characters which is
281 neither preceded nor followed by word characters. A word
282 character is an alnum character (as defined by ctype(3))
283 or an underscore. This is an extension, compatible with
284 but not specified by POSIX 1003.2, and should be used with
285 caution in software intended to be portable to other sys-
286 tems.<P>
287
288 In the event that an RE could match more than one sub-
289 string of a given string, the RE matches the one starting
290 earliest in the string. If the RE could match more than
291 one substring starting at that point, it matches the
292 longest. Subexpressions also match the longest possible
293 substrings, subject to the constraint that the whole match
294 be as long as possible, with subexpressions starting ear-
295 lier in the RE taking priority over ones starting later.
296 Note that higher-level subexpressions thus take priority
297 over their lower-level component subexpressions.<P>
298
299 Match lengths are measured in characters, not collating
300 elements. A null string is considered longer than no
301 match at all. For example, `bb*' matches the three middle
302 characters of `abbbc', `(wee|week)(knights|nights)'
303 matches all ten characters of `weeknights', when `(.*).*'
304 is matched against `abc' the parenthesized subexpression
305 matches all three characters, and when `(a*)*' is matched
306 against `bc' both the whole RE and the parenthesized
307 subexpression match the null string.<P>
308
309 If case-independent matching is specified, the effect is
310 much as if all case distinctions had vanished from the
311 alphabet. When an alphabetic that exists in multiple
312 cases appears as an ordinary character outside a bracket
313 expression, it is effectively transformed into a bracket
314 expression containing both cases, e.g. `x' becomes `[xX]'.
315 When it appears inside a bracket expression, all case
316 counterparts of it are added to the bracket expression, so
317 that (e.g.) `[x]' becomes `[xX]' and `[^x]' becomes
318 `[^xX]'.<P>
319
320 No particular limit is imposed on the length of REs. Pro-
321 grams intended to be portable should not employ REs longer
322 than 256 bytes, as an implementation can refuse to accept
323 such REs and remain POSIX-compliant.<P>
324
325 Obsolete (``basic'') regular expressions differ in several
326 respects. `|', `+', and `?' are ordinary characters and
327 there is no equivalent for their functionality. The
328 delimiters for bounds are `\{' and `\}', with `{' and `}'
329 by themselves ordinary characters. The parentheses for
330 nested subexpressions are `\(' and `\)', with `(' and `)'
331 by themselves ordinary characters. `^' is an ordinary
332 character except at the beginning of the RE or the begin-
333 ning of a parenthesized subexpression, `$' is an ordinary
334 character except at the end of the RE or the end of a
335 parenthesized subexpression, and `*' is an ordinary char-
336 acter if it appears at the beginning of the RE or the
337 beginning of a parenthesized subexpression (after a possi-
338 ble leading `^'). Finally, there is one new type of atom,
339 a back reference: `\' followed by a non-zero decimal digit
340 d matches the same sequence of characters matched by the
341 dth parenthesized subexpression (numbering subexpressions
342 by the positions of their opening parentheses, left to
343 right), so that (e.g.) `\([bc]\)\1' matches `bb' or `cc'
344 but not `bc'.<P>
345
346<B>SEE ALSO</B><BR>
347 POSIX 1003.2, section 2.8 (Regular Expression Notation).<P>
348
349<B>BUGS</B><BR>
350 Having two kinds of REs is a botch.<P>
351
352 The current 1003.2 spec says that `)' is an ordinary char-
353 acter in the absence of an unmatched `('; this was an
354 unintentional result of a wording error, and change is
355 likely. Avoid relying on it.<P>
356
357 Back references are a dreadful botch, posing major prob-
358 lems for efficient implementations. They are also some-
359 what vaguely defined (does `a\(\(b\)*\2\)*d' match
360 `abbbd'?). Avoid using them.<P>
361
362 1003.2's specification of case-independent matching is
363 vague. The ``one case implies all cases'' definition
364 given above is current consensus among implementors as to
365 the right interpretation.<P>
366
367 The syntax for word boundaries is incredibly ugly.<P>
368
369<B>AUTHOR</B><BR>
370 This page was taken from Henry Spencer's regex package.
371</td>
372</tr>
373
374</table></center>
375
376<center>
377<hr NOSHADE WIDTH="80%"><i><font color="#000099"><font size=-1>This Web Site is
378hosted by Apache for OS/2 and done by <a href="mailto:tbretz@astro.uni-wuerzburg.de">Thomas&nbsp;Bretz</a>.</font></font></i><BR>
379&nbsp;<BR>
380<a href="http://validator.w3.org/check/referer"><img border="0"
381 src="../../valid-html40.png" alt="Valid HTML 4.0!" height="20" width="66"></a>
382</center>&nbsp;
383</tr>
384</table>
385
386</center>
387
388</body>
389</html>
Note: See TracBrowser for help on using the repository browser.