1 | <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
|
---|
2 |
|
---|
3 | <html>
|
---|
4 | <head>
|
---|
5 | <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
|
---|
6 | <meta name="Author" content="Thomas Bretz">
|
---|
7 | <title>MARS: Magic Analysis and Reconstruction Software</title>
|
---|
8 | <link rel="stylesheet" type="text/css" href="../mars.css">
|
---|
9 | </head>
|
---|
10 |
|
---|
11 | <body background="background.gif" text="#000000" bgcolor="#000099" link="#1122FF" vlink="#8888FF" alink="#FF0000">
|
---|
12 |
|
---|
13 |
|
---|
14 | <center>
|
---|
15 | <table class="Main" CELLPADDING=0>
|
---|
16 |
|
---|
17 | <tr>
|
---|
18 | <td class="Edge"><img SRC="../ecke.gif" ALT=""></td>
|
---|
19 | <td class="Header">
|
---|
20 | <B>M A R S</B><BR><B>M</B>agic <B>A</B>nalysis and <B>R</B>econstruction <B>S</B>oftware
|
---|
21 | </td>
|
---|
22 | </tr>
|
---|
23 |
|
---|
24 | <tr>
|
---|
25 | <td COLSPAN=2 BGCOLOR="#FFFFFF">
|
---|
26 | <hr SIZE=1 NOSHADE WIDTH="80%">
|
---|
27 | <center><table class="Inner" CELLPADDING=15>
|
---|
28 |
|
---|
29 | <tr class="Block">
|
---|
30 | <td><b><u><A NAME="OVERVIEW">MySQL Regular Expressions</A>:</u></b>
|
---|
31 | <P>
|
---|
32 | A <B>regular expression (regex)</B> is a powerful way of specifying a complex search. <P>
|
---|
33 |
|
---|
34 | MySQL uses Henry Spencer's implementation of regular expressions, which is aimed at conformance with POSIX
|
---|
35 | 1003.2. MySQL uses the extended version. <P>
|
---|
36 |
|
---|
37 | This is a simplistic reference that skips the details. To get more exact information, see
|
---|
38 | Henry Spencer's <A HREF="#REGEX">regex(7)</A><P>
|
---|
39 |
|
---|
40 | A regular expression describes a set of strings. The simplest regexp is one that has no special characters in it. For
|
---|
41 | example, the regexp <b>hello</B> matches <B>hello</B> and nothing else. <P>
|
---|
42 |
|
---|
43 | Non-trivial regular expressions use certain special constructs so that they can match more than one string. For
|
---|
44 | example, the regexp hello|word matches either the string hello or the string word. <P>
|
---|
45 |
|
---|
46 | As a more complex example, the regexp B[an]*s matches any of the strings Bananas, Baaaaas, Bs, and any
|
---|
47 | other string starting with a B, ending with an s, and containing any number of a or n characters in between. <P>
|
---|
48 |
|
---|
49 | A regular expression may use any of the following special characters/constructs: <P>
|
---|
50 | <pre>
|
---|
51 | ^ Match the beginning of a string.
|
---|
52 | mysql> SELECT "fo\nfo" REGEXP "^fo$"; -> 0
|
---|
53 | mysql> SELECT "fofo" REGEXP "^fo"; -> 1
|
---|
54 |
|
---|
55 | $ Match the end of a string.
|
---|
56 | mysql> SELECT "fo\no" REGEXP "^fo\no$"; -> 1
|
---|
57 | mysql> SELECT "fo\no" REGEXP "^fo$"; -> 0
|
---|
58 |
|
---|
59 | . Match any character (including newline).
|
---|
60 | mysql> SELECT "fofo" REGEXP "^f.*"; -> 1
|
---|
61 | mysql> SELECT "fo\nfo" REGEXP "^f.*"; -> 1
|
---|
62 |
|
---|
63 | a* Match any sequence of zero or more a characters.
|
---|
64 | mysql> SELECT "Ban" REGEXP "^Ba*n"; -> 1
|
---|
65 | mysql> SELECT "Baaan" REGEXP "^Ba*n"; -> 1
|
---|
66 | mysql> SELECT "Bn" REGEXP "^Ba*n"; -> 1
|
---|
67 |
|
---|
68 | a+ Match any sequence of one or more a characters.
|
---|
69 | mysql> SELECT "Ban" REGEXP "^Ba+n"; -> 1
|
---|
70 | mysql> SELECT "Bn" REGEXP "^Ba+n"; -> 0
|
---|
71 |
|
---|
72 | a? Match either zero or one a character.
|
---|
73 | mysql> SELECT "Bn" REGEXP "^Ba?n"; -> 1
|
---|
74 | mysql> SELECT "Ban" REGEXP "^Ba?n"; -> 1
|
---|
75 | mysql> SELECT "Baan" REGEXP "^Ba?n"; -> 0
|
---|
76 |
|
---|
77 | de|abc Match either of the sequences de or abc.
|
---|
78 | mysql> SELECT "pi" REGEXP "pi|apa"; -> 1
|
---|
79 | mysql> SELECT "axe" REGEXP "pi|apa"; -> 0
|
---|
80 | mysql> SELECT "apa" REGEXP "pi|apa"; -> 1
|
---|
81 | mysql> SELECT "apa" REGEXP "^(pi|apa)$"; -> 1
|
---|
82 | mysql> SELECT "pi" REGEXP "^(pi|apa)$"; -> 1
|
---|
83 | mysql> SELECT "pix" REGEXP "^(pi|apa)$"; -> 0
|
---|
84 |
|
---|
85 | (abc)* Match zero or more instances of the sequence abc.
|
---|
86 | mysql> SELECT "pi" REGEXP "^(pi)*$"; -> 1
|
---|
87 | mysql> SELECT "pip" REGEXP "^(pi)*$"; -> 0
|
---|
88 | mysql> SELECT "pipi" REGEXP "^(pi)*$"; -> 1
|
---|
89 |
|
---|
90 | {1} The is a more general way of writing regexps that match many
|
---|
91 | {2,3} occurrences of the previous atom.
|
---|
92 | a* Can be written as a{0,}.
|
---|
93 | a+ Can be written as a{1,}.
|
---|
94 | a? Can be written as a{0,1}.
|
---|
95 |
|
---|
96 | To be more precise, an atom followed by a bound containing one
|
---|
97 | integer i and no comma matches a sequence of exactly i matches
|
---|
98 | of the atom. An atom followed by a bound containing one integer i
|
---|
99 | and a comma matches a sequence of i or more matches of the atom.
|
---|
100 | An atom followed by a bound containing two integers i and j matches
|
---|
101 | a sequence of i through j (inclusive) matches of the atom.
|
---|
102 |
|
---|
103 | Both arguments must be in the range from 0 to RE_DUP_MAX (default 255),
|
---|
104 | inclusive. If there are two arguments, the second must be greater
|
---|
105 | than or equal to the first.
|
---|
106 |
|
---|
107 | [a-dX] Matches any character which is (or is not, if ^ is used) either a, b, c,
|
---|
108 | [^a-dX] d or X. To include a literal ] character, it must immediately follow
|
---|
109 | the opening bracket [. To include a literal - character, it must be
|
---|
110 | written first or last. So [0-9] matches any decimal digit. Any character
|
---|
111 | that does not have a defined meaning inside a [] pair has no special
|
---|
112 | meaning and matches only itself.
|
---|
113 | mysql> SELECT "aXbc" REGEXP "[a-dXYZ]"; -> 1
|
---|
114 | mysql> SELECT "aXbc" REGEXP "^[a-dXYZ]$"; -> 0
|
---|
115 | mysql> SELECT "aXbc" REGEXP "^[a-dXYZ]+$"; -> 1
|
---|
116 | mysql> SELECT "aXbc" REGEXP "^[^a-dXYZ]+$"; -> 0
|
---|
117 | mysql> SELECT "gheis" REGEXP "^[^a-dXYZ]+$"; -> 1
|
---|
118 | mysql> SELECT "gheisa" REGEXP "^[^a-dXYZ]+$"; -> 0
|
---|
119 |
|
---|
120 | [[.characters.]]
|
---|
121 | The sequence of characters of that collating element. characters is
|
---|
122 | either a single character or a character name like newline. You can
|
---|
123 | find the full list of character names in 'regexp/cname.h'.
|
---|
124 |
|
---|
125 | [ =character_class=]
|
---|
126 | An equivalence class, standing for the sequences of characters of all
|
---|
127 | collating elements equivalent to that one, including itself.
|
---|
128 |
|
---|
129 | For example, if o and (+) are the members of an equivalence class,
|
---|
130 | then [[=o=]], [[=(+)=]], and [o(+)] are all synonymous. An equivalence
|
---|
131 | class may not be an endpoint of a range.
|
---|
132 |
|
---|
133 | [:character_class:]
|
---|
134 | Within a bracket expression, the name of a character class enclosed
|
---|
135 | in [: and :] stands for the list of all characters belonging to that
|
---|
136 | class. Standard character class names are:
|
---|
137 |
|
---|
138 | These stand for the character classes defined in the ctype(3) manual
|
---|
139 | page. A locale may provide others. A character class may not be used
|
---|
140 | as an endpoint of a range.
|
---|
141 | mysql> SELECT "justalnums" REGEXP "[[:alnum:]]+"; -> 1
|
---|
142 | mysql> SELECT "!!" REGEXP "[[:alnum:]]+"; -> 0
|
---|
143 |
|
---|
144 | [[:<:]] These match the null string at the beginning and end of a word
|
---|
145 | [[:>:]] respectively. A word is defined as a sequence of word characters
|
---|
146 | which is neither preceded nor followed by word characters. A word
|
---|
147 | character is an alnum character (as defined by ctype(3)) or an
|
---|
148 | underscore (_).
|
---|
149 | mysql> SELECT "a word a" REGEXP "[[:<:]]word[[:>:]]"; -> 1
|
---|
150 | mysql> SELECT "a xword a" REGEXP "[[:<:]]word[[:>:]]"; -> 0
|
---|
151 |
|
---|
152 | mysql> SELECT "weeknights" REGEXP "^(wee|week)(knights|nights)$"; -> 1
|
---|
153 | </pre>
|
---|
154 | </td></tr>
|
---|
155 | <tr class="Block">
|
---|
156 | <td>
|
---|
157 | <center><h3>--- <A NAME="REGEX"><U>REGEX</U></A>(7) ---</h3></center>
|
---|
158 | <B>NAME</B><BR>
|
---|
159 | regex - POSIX 1003.2 regular expressions<P>
|
---|
160 |
|
---|
161 | <B>DESCRIPTION</B><BR>
|
---|
162 | Regular expressions (``RE''s), as defined in POSIX 1003.2,
|
---|
163 | come in two forms: modern REs (roughly those of egrep;
|
---|
164 | 1003.2 calls these ``extended'' REs) and obsolete REs
|
---|
165 | (roughly those of ed; 1003.2 ``basic'' REs). Obsolete REs
|
---|
166 | mostly exist for backward compatibility in some old pro-
|
---|
167 | grams; they will be discussed at the end. 1003.2 leaves
|
---|
168 | some aspects of RE syntax and semantics open; `' marks
|
---|
169 | decisions on these aspects that may not be fully portable
|
---|
170 | to other 1003.2 implementations.<P>
|
---|
171 |
|
---|
172 | A (modern) RE is one or more non-empty branches, separated
|
---|
173 | by `|'. It matches anything that matches one of the
|
---|
174 | branches.<P>
|
---|
175 |
|
---|
176 | A branch is one or more pieces, concatenated. It matches
|
---|
177 | a match for the first, followed by a match for the second,
|
---|
178 | etc.<P>
|
---|
179 |
|
---|
180 | A piece is an atom possibly followed by a single `*', `+',
|
---|
181 | `?', or bound. An atom followed by `*' matches a sequence
|
---|
182 | of 0 or more matches of the atom. An atom followed by `+'
|
---|
183 | matches a sequence of 1 or more matches of the atom. An
|
---|
184 | atom followed by `?' matches a sequence of 0 or 1 matches
|
---|
185 | of the atom.<P>
|
---|
186 |
|
---|
187 | A bound is `{' followed by an unsigned decimal integer,
|
---|
188 | possibly followed by `,' possibly followed by another
|
---|
189 | unsigned decimal integer, always followed by `}'. The
|
---|
190 | integers must lie between 0 and RE_DUP_MAX (255) inclu-
|
---|
191 | sive, and if there are two of them, the first may not
|
---|
192 | exceed the second. An atom followed by a bound containing
|
---|
193 | one integer i and no comma matches a sequence of exactly i
|
---|
194 | matches of the atom. An atom followed by a bound contain-
|
---|
195 | ing one integer i and a comma matches a sequence of i or
|
---|
196 | more matches of the atom. An atom followed by a bound
|
---|
197 | containing two integers i and j matches a sequence of i
|
---|
198 | through j (inclusive) matches of the atom.<P>
|
---|
199 |
|
---|
200 | An atom is a regular expression enclosed in `()' (matching
|
---|
201 | a match for the regular expression), an empty set of `()'
|
---|
202 | (matching the null string), a bracket expression (see
|
---|
203 | below), `.' (matching any single character), `^' (match-
|
---|
204 | ing the null string at the beginning of a line), `$'
|
---|
205 | (matching the null string at the end of a line), a `\'
|
---|
206 | followed by one of the characters `^.[$()|*+?{\' (matching
|
---|
207 | that character taken as an ordinary character), a `\' fol-
|
---|
208 | lowed by any other character (matching that character
|
---|
209 | taken as an ordinary character, as if the `\' had not been
|
---|
210 | present), or a single character with no other significance
|
---|
211 | (matching that character). A `{' followed by a character
|
---|
212 | other than a digit is an ordinary character, not the
|
---|
213 | beginning of a bound. It is illegal to end an RE with
|
---|
214 | `\'.<P>
|
---|
215 |
|
---|
216 | A bracket expression is a list of characters enclosed in
|
---|
217 | `[]'. It normally matches any single character from the
|
---|
218 | list (but see below). If the list begins with `^', it
|
---|
219 | matches any single character (but see below) not from the
|
---|
220 | rest of the list. If two characters in the list are sepa-
|
---|
221 | rated by `-', this is shorthand for the full range of
|
---|
222 | characters between those two (inclusive) in the collating
|
---|
223 | sequence, e.g. `[0-9]' in ASCII matches any decimal digit.
|
---|
224 | It is illegal for two ranges to share an endpoint, e.g.
|
---|
225 | `a-c-e'. Ranges are very collating-sequence-dependent,
|
---|
226 | and portable programs should avoid relying on them.<P>
|
---|
227 |
|
---|
228 | To include a literal `]' in the list, make it the first
|
---|
229 | character (following a possible `^'). To include a lit-
|
---|
230 | eral `-', make it the first or last character, or the sec-
|
---|
231 | ond endpoint of a range. To use a literal `-' as the
|
---|
232 | first endpoint of a range, enclose it in `[.' and `.]' to
|
---|
233 | make it a collating element (see below). With the excep-
|
---|
234 | tion of these and some combinations using `[' (see next
|
---|
235 | paragraphs), all other special characters, including `\',
|
---|
236 | lose their special significance within a bracket expres-
|
---|
237 | sion.<P>
|
---|
238 |
|
---|
239 | Within a bracket expression, a collating element (a char-
|
---|
240 | acter, a multi-character sequence that collates as if it
|
---|
241 | were a single character, or a collating-sequence name for
|
---|
242 | either) enclosed in `[.' and `.]' stands for the sequence
|
---|
243 | of characters of that collating element. The sequence is
|
---|
244 | a single element of the bracket expression's list. A
|
---|
245 | bracket expression containing a multi-character collating
|
---|
246 | element can thus match more than one character, e.g. if
|
---|
247 | the collating sequence includes a `ch' collating element,
|
---|
248 | then the RE `[[.ch.]]*c' matches the first five characters
|
---|
249 | of `chchcc'.<P>
|
---|
250 |
|
---|
251 | Within a bracket expression, a collating element enclosed
|
---|
252 | in `[=' and `=]' is an equivalence class, standing for the
|
---|
253 | sequences of characters of all collating elements equiva-
|
---|
254 | lent to that one, including itself. (If there are no
|
---|
255 | other equivalent collating elements, the treatment is as
|
---|
256 | if the enclosing delimiters were `[.' and `.]'.) For
|
---|
257 | example, if o and ^ are the members of an equivalence
|
---|
258 | class, then `[[=o=]]', `[[=^=]]', and `[o^]' are all syn-
|
---|
259 | onymous. An equivalence class may not be an endpoint of a
|
---|
260 | range.<P>
|
---|
261 |
|
---|
262 | Within a bracket expression, the name of a character class
|
---|
263 | enclosed in `[:' and `:]' stands for the list of all char-
|
---|
264 | acters belonging to that class. Standard character class
|
---|
265 | names are:<P>
|
---|
266 | <table>
|
---|
267 | <tr><td>alnum</TD><td>digit</td><td>punct</td></tr>
|
---|
268 | <tr><td>alpha</TD><td>graph</TD><td>space</td></tr>
|
---|
269 | <tr><td>blank</TD><td>lower</TD><td>upper</td></tr>
|
---|
270 | <tr><td>cntrl</TD><td>print</TD><td>xdigit</td></tr>
|
---|
271 | </table>
|
---|
272 | <P>
|
---|
273 | These stand for the character classes defined in ctype(3).
|
---|
274 | A locale may provide others. A character class may not be
|
---|
275 | used as an endpoint of a range.<P>
|
---|
276 |
|
---|
277 | There are two special cases of bracket expressions: the
|
---|
278 | bracket expressions `[[:<:]]' and `[[:>:]]' match the null
|
---|
279 | string at the beginning and end of a word respectively. A
|
---|
280 | word is defined as a sequence of word characters which is
|
---|
281 | neither preceded nor followed by word characters. A word
|
---|
282 | character is an alnum character (as defined by ctype(3))
|
---|
283 | or an underscore. This is an extension, compatible with
|
---|
284 | but not specified by POSIX 1003.2, and should be used with
|
---|
285 | caution in software intended to be portable to other sys-
|
---|
286 | tems.<P>
|
---|
287 |
|
---|
288 | In the event that an RE could match more than one sub-
|
---|
289 | string of a given string, the RE matches the one starting
|
---|
290 | earliest in the string. If the RE could match more than
|
---|
291 | one substring starting at that point, it matches the
|
---|
292 | longest. Subexpressions also match the longest possible
|
---|
293 | substrings, subject to the constraint that the whole match
|
---|
294 | be as long as possible, with subexpressions starting ear-
|
---|
295 | lier in the RE taking priority over ones starting later.
|
---|
296 | Note that higher-level subexpressions thus take priority
|
---|
297 | over their lower-level component subexpressions.<P>
|
---|
298 |
|
---|
299 | Match lengths are measured in characters, not collating
|
---|
300 | elements. A null string is considered longer than no
|
---|
301 | match at all. For example, `bb*' matches the three middle
|
---|
302 | characters of `abbbc', `(wee|week)(knights|nights)'
|
---|
303 | matches all ten characters of `weeknights', when `(.*).*'
|
---|
304 | is matched against `abc' the parenthesized subexpression
|
---|
305 | matches all three characters, and when `(a*)*' is matched
|
---|
306 | against `bc' both the whole RE and the parenthesized
|
---|
307 | subexpression match the null string.<P>
|
---|
308 |
|
---|
309 | If case-independent matching is specified, the effect is
|
---|
310 | much as if all case distinctions had vanished from the
|
---|
311 | alphabet. When an alphabetic that exists in multiple
|
---|
312 | cases appears as an ordinary character outside a bracket
|
---|
313 | expression, it is effectively transformed into a bracket
|
---|
314 | expression containing both cases, e.g. `x' becomes `[xX]'.
|
---|
315 | When it appears inside a bracket expression, all case
|
---|
316 | counterparts of it are added to the bracket expression, so
|
---|
317 | that (e.g.) `[x]' becomes `[xX]' and `[^x]' becomes
|
---|
318 | `[^xX]'.<P>
|
---|
319 |
|
---|
320 | No particular limit is imposed on the length of REs. Pro-
|
---|
321 | grams intended to be portable should not employ REs longer
|
---|
322 | than 256 bytes, as an implementation can refuse to accept
|
---|
323 | such REs and remain POSIX-compliant.<P>
|
---|
324 |
|
---|
325 | Obsolete (``basic'') regular expressions differ in several
|
---|
326 | respects. `|', `+', and `?' are ordinary characters and
|
---|
327 | there is no equivalent for their functionality. The
|
---|
328 | delimiters for bounds are `\{' and `\}', with `{' and `}'
|
---|
329 | by themselves ordinary characters. The parentheses for
|
---|
330 | nested subexpressions are `\(' and `\)', with `(' and `)'
|
---|
331 | by themselves ordinary characters. `^' is an ordinary
|
---|
332 | character except at the beginning of the RE or the begin-
|
---|
333 | ning of a parenthesized subexpression, `$' is an ordinary
|
---|
334 | character except at the end of the RE or the end of a
|
---|
335 | parenthesized subexpression, and `*' is an ordinary char-
|
---|
336 | acter if it appears at the beginning of the RE or the
|
---|
337 | beginning of a parenthesized subexpression (after a possi-
|
---|
338 | ble leading `^'). Finally, there is one new type of atom,
|
---|
339 | a back reference: `\' followed by a non-zero decimal digit
|
---|
340 | d matches the same sequence of characters matched by the
|
---|
341 | dth parenthesized subexpression (numbering subexpressions
|
---|
342 | by the positions of their opening parentheses, left to
|
---|
343 | right), so that (e.g.) `\([bc]\)\1' matches `bb' or `cc'
|
---|
344 | but not `bc'.<P>
|
---|
345 |
|
---|
346 | <B>SEE ALSO</B><BR>
|
---|
347 | POSIX 1003.2, section 2.8 (Regular Expression Notation).<P>
|
---|
348 |
|
---|
349 | <B>BUGS</B><BR>
|
---|
350 | Having two kinds of REs is a botch.<P>
|
---|
351 |
|
---|
352 | The current 1003.2 spec says that `)' is an ordinary char-
|
---|
353 | acter in the absence of an unmatched `('; this was an
|
---|
354 | unintentional result of a wording error, and change is
|
---|
355 | likely. Avoid relying on it.<P>
|
---|
356 |
|
---|
357 | Back references are a dreadful botch, posing major prob-
|
---|
358 | lems for efficient implementations. They are also some-
|
---|
359 | what vaguely defined (does `a\(\(b\)*\2\)*d' match
|
---|
360 | `abbbd'?). Avoid using them.<P>
|
---|
361 |
|
---|
362 | 1003.2's specification of case-independent matching is
|
---|
363 | vague. The ``one case implies all cases'' definition
|
---|
364 | given above is current consensus among implementors as to
|
---|
365 | the right interpretation.<P>
|
---|
366 |
|
---|
367 | The syntax for word boundaries is incredibly ugly.<P>
|
---|
368 |
|
---|
369 | <B>AUTHOR</B><BR>
|
---|
370 | This page was taken from Henry Spencer's regex package.
|
---|
371 | </td>
|
---|
372 | </tr>
|
---|
373 |
|
---|
374 | </table></center>
|
---|
375 |
|
---|
376 | <center>
|
---|
377 | <hr NOSHADE WIDTH="80%"><i><font color="#000099"><font size=-1>This Web Site is
|
---|
378 | hosted by Apache for OS/2 and done by <a href="mailto:tbretz@astro.uni-wuerzburg.de">Thomas Bretz</a>.</font></font></i><BR>
|
---|
379 | <BR>
|
---|
380 | <a href="http://validator.w3.org/check/referer"><img border="0"
|
---|
381 | src="../../valid-html40.png" alt="Valid HTML 4.0!" height="20" width="66"></a>
|
---|
382 | </center>
|
---|
383 | </tr>
|
---|
384 | </table>
|
---|
385 |
|
---|
386 | </center>
|
---|
387 |
|
---|
388 | </body>
|
---|
389 | </html>
|
---|