Context Navigation

regexp.html@ 18066

Visit:

Last change on this file since 18066 was 17386, checked in by tbretz, 11 years ago
Removed svn:executable property, these are no executables.
File size: 19.5 KB

Line
1	<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
2
3	<html>
4	<head>
5	<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
6	<meta name="Author" content="Thomas Bretz">
7	<title>MARS: Magic Analysis and Reconstruction Software</title>
8	<link rel="stylesheet" type="text/css" href="mars.css">
9	</head>
10
11	<body background="background.gif" text="#000000" bgcolor="#000099" link="#1122FF" vlink="#8888FF" alink="#FF0000">
12
13
14	<center>
15	<table class="Main" CELLPADDING=0>
16
17	<tr>
18	<td class="Edge"><img SRC="ecke.gif" ALT=""></td>
19	<td class="Header">
20	<B>M A R S</B><BR><B>M</B>agic <B>A</B>nalysis and <B>R</B>econstruction <B>S</B>oftware
21	</td>
22	</tr>
23
24	<tr>
25	<td COLSPAN=2 BGCOLOR="#FFFFFF">
26	<hr SIZE=1 NOSHADE WIDTH="80%">
27	<center><table class="Inner" CELLPADDING=15>
28
29	<tr class="Block">
30	<td><b><u><A NAME="OVERVIEW">MySQL Regular Expressions</A>:</u></b>
31	<P>
32	A <B>regular expression (regex)</B> is a powerful way of specifying a complex search. <P>
33
34	MySQL uses Henry Spencer's implementation of regular expressions, which is aimed at conformance with POSIX
35	1003.2. MySQL uses the extended version. <P>
36
37	This is a simplistic reference that skips the details. To get more exact information, see
38	Henry Spencer's <A HREF="#REGEX">regex(7)</A><P>
39
40	A regular expression describes a set of strings. The simplest regexp is one that has no special characters in it. For
41	example, the regexp <b>hello</B> matches <B>hello</B> and nothing else. <P>
42
43	Non-trivial regular expressions use certain special constructs so that they can match more than one string. For
44	example, the regexp hello\|word matches either the string hello or the string word. <P>
45
46	As a more complex example, the regexp B[an]*s matches any of the strings Bananas, Baaaaas, Bs, and any
47	other string starting with a B, ending with an s, and containing any number of a or n characters in between. <P>
48
49	A regular expression may use any of the following special characters/constructs: <P>
50	<pre>
51	^ Match the beginning of a string.
52	mysql> SELECT "fo\nfo" REGEXP "^fo$"; -> 0
53	mysql> SELECT "fofo" REGEXP "^fo"; -> 1
54
55	$ Match the end of a string.
56	mysql> SELECT "fo\no" REGEXP "^fo\no$"; -> 1
57	mysql> SELECT "fo\no" REGEXP "^fo$"; -> 0
58
59	. Match any character (including newline).
60	mysql> SELECT "fofo" REGEXP "^f.*"; -> 1
61	mysql> SELECT "fo\nfo" REGEXP "^f.*"; -> 1
62
63	a* Match any sequence of zero or more a characters.
64	mysql> SELECT "Ban" REGEXP "^Ba*n"; -> 1
65	mysql> SELECT "Baaan" REGEXP "^Ba*n"; -> 1
66	mysql> SELECT "Bn" REGEXP "^Ba*n"; -> 1
67
68	a+ Match any sequence of one or more a characters.
69	mysql> SELECT "Ban" REGEXP "^Ba+n"; -> 1
70	mysql> SELECT "Bn" REGEXP "^Ba+n"; -> 0
71
72	a? Match either zero or one a character.
73	mysql> SELECT "Bn" REGEXP "^Ba?n"; -> 1
74	mysql> SELECT "Ban" REGEXP "^Ba?n"; -> 1
75	mysql> SELECT "Baan" REGEXP "^Ba?n"; -> 0
76
77	de\|abc Match either of the sequences de or abc.
78	mysql> SELECT "pi" REGEXP "pi\|apa"; -> 1
79	mysql> SELECT "axe" REGEXP "pi\|apa"; -> 0
80	mysql> SELECT "apa" REGEXP "pi\|apa"; -> 1
81	mysql> SELECT "apa" REGEXP "^(pi\|apa)$"; -> 1
82	mysql> SELECT "pi" REGEXP "^(pi\|apa)$"; -> 1
83	mysql> SELECT "pix" REGEXP "^(pi\|apa)$"; -> 0
84
85	(abc)* Match zero or more instances of the sequence abc.
86	mysql> SELECT "pi" REGEXP "^(pi)*$"; -> 1
87	mysql> SELECT "pip" REGEXP "^(pi)*$"; -> 0
88	mysql> SELECT "pipi" REGEXP "^(pi)*$"; -> 1
89
90	{1} The is a more general way of writing regexps that match many
91	{2,3} occurrences of the previous atom.
92	a* Can be written as a{0,}.
93	a+ Can be written as a{1,}.
94	a? Can be written as a{0,1}.
95
96	To be more precise, an atom followed by a bound containing one
97	integer i and no comma matches a sequence of exactly i matches
98	of the atom. An atom followed by a bound containing one integer i
99	and a comma matches a sequence of i or more matches of the atom.
100	An atom followed by a bound containing two integers i and j matches
101	a sequence of i through j (inclusive) matches of the atom.
102
103	Both arguments must be in the range from 0 to RE_DUP_MAX (default 255),
104	inclusive. If there are two arguments, the second must be greater
105	than or equal to the first.
106
107	[a-dX] Matches any character which is (or is not, if ^ is used) either a, b, c,
108	[^a-dX] d or X. To include a literal ] character, it must immediately follow
109	the opening bracket [. To include a literal - character, it must be
110	written first or last. So [0-9] matches any decimal digit. Any character
111	that does not have a defined meaning inside a [] pair has no special
112	meaning and matches only itself.
113	mysql> SELECT "aXbc" REGEXP "[a-dXYZ]"; -> 1
114	mysql> SELECT "aXbc" REGEXP "^[a-dXYZ]$"; -> 0
115	mysql> SELECT "aXbc" REGEXP "^[a-dXYZ]+$"; -> 1
116	mysql> SELECT "aXbc" REGEXP "^[^a-dXYZ]+$"; -> 0
117	mysql> SELECT "gheis" REGEXP "^[^a-dXYZ]+$"; -> 1
118	mysql> SELECT "gheisa" REGEXP "^[^a-dXYZ]+$"; -> 0
119
120	[[.characters.]]
121	The sequence of characters of that collating element. characters is
122	either a single character or a character name like newline. You can
123	find the full list of character names in 'regexp/cname.h'.
124
125	[ =character_class=]
126	An equivalence class, standing for the sequences of characters of all
127	collating elements equivalent to that one, including itself.
128
129	For example, if o and (+) are the members of an equivalence class,
130	then [[=o=]], [[=(+)=]], and [o(+)] are all synonymous. An equivalence
131	class may not be an endpoint of a range.
132
133	[:character_class:]
134	Within a bracket expression, the name of a character class enclosed
135	in [: and :] stands for the list of all characters belonging to that
136	class. Standard character class names are:
137
138	These stand for the character classes defined in the ctype(3) manual
139	page. A locale may provide others. A character class may not be used
140	as an endpoint of a range.
141	mysql> SELECT "justalnums" REGEXP "[[:alnum:]]+"; -> 1
142	mysql> SELECT "!!" REGEXP "[[:alnum:]]+"; -> 0
143
144	[[:<:]] These match the null string at the beginning and end of a word
145	[[:>:]] respectively. A word is defined as a sequence of word characters
146	which is neither preceded nor followed by word characters. A word
147	character is an alnum character (as defined by ctype(3)) or an
148	underscore (_).
149	mysql> SELECT "a word a" REGEXP "[[:<:]]word[[:>:]]"; -> 1
150	mysql> SELECT "a xword a" REGEXP "[[:<:]]word[[:>:]]"; -> 0
151
152	mysql> SELECT "weeknights" REGEXP "^(wee\|week)(knights\|nights)$"; -> 1
153	</pre>
154	</td></tr>
155	<tr class="Block">
156	<td>
157	<center><h3>--- <A NAME="REGEX"><U>REGEX</U></A>(7) ---</h3></center>
158	<B>NAME</B><BR>
159	regex - POSIX 1003.2 regular expressions<P>
160
161	<B>DESCRIPTION</B><BR>
162	Regular expressions (``RE''s), as defined in POSIX 1003.2,
163	come in two forms: modern REs (roughly those of egrep;
164	1003.2 calls these ``extended'' REs) and obsolete REs
165	(roughly those of ed; 1003.2 ``basic'' REs). Obsolete REs
166	mostly exist for backward compatibility in some old pro-
167	grams; they will be discussed at the end. 1003.2 leaves
168	some aspects of RE syntax and semantics open; `' marks
169	decisions on these aspects that may not be fully portable
170	to other 1003.2 implementations.<P>
171
172	A (modern) RE is one or more non-empty branches, separated
173	by `\|'. It matches anything that matches one of the
174	branches.<P>
175
176	A branch is one or more pieces, concatenated. It matches
177	a match for the first, followed by a match for the second,
178	etc.<P>
179
180	A piece is an atom possibly followed by a single `*', `+',
181	`?', or bound. An atom followed by `*' matches a sequence
182	of 0 or more matches of the atom. An atom followed by `+'
183	matches a sequence of 1 or more matches of the atom. An
184	atom followed by `?' matches a sequence of 0 or 1 matches
185	of the atom.<P>
186
187	A bound is `{' followed by an unsigned decimal integer,
188	possibly followed by `,' possibly followed by another
189	unsigned decimal integer, always followed by `}'. The
190	integers must lie between 0 and RE_DUP_MAX (255) inclu-
191	sive, and if there are two of them, the first may not
192	exceed the second. An atom followed by a bound containing
193	one integer i and no comma matches a sequence of exactly i
194	matches of the atom. An atom followed by a bound contain-
195	ing one integer i and a comma matches a sequence of i or
196	more matches of the atom. An atom followed by a bound
197	containing two integers i and j matches a sequence of i
198	through j (inclusive) matches of the atom.<P>
199
200	An atom is a regular expression enclosed in `()' (matching
201	a match for the regular expression), an empty set of `()'
202	(matching the null string), a bracket expression (see
203	below), `.' (matching any single character), `^' (match-
204	ing the null string at the beginning of a line), `$'
205	(matching the null string at the end of a line), a `\'
206	followed by one of the characters `^.[$()\|*+?{\' (matching
207	that character taken as an ordinary character), a `\' fol-
208	lowed by any other character (matching that character
209	taken as an ordinary character, as if the `\' had not been
210	present), or a single character with no other significance
211	(matching that character). A `{' followed by a character
212	other than a digit is an ordinary character, not the
213	beginning of a bound. It is illegal to end an RE with
214	`\'.<P>
215
216	A bracket expression is a list of characters enclosed in
217	`[]'. It normally matches any single character from the
218	list (but see below). If the list begins with `^', it
219	matches any single character (but see below) not from the
220	rest of the list. If two characters in the list are sepa-
221	rated by `-', this is shorthand for the full range of
222	characters between those two (inclusive) in the collating
223	sequence, e.g. `[0-9]' in ASCII matches any decimal digit.
224	It is illegal for two ranges to share an endpoint, e.g.
225	`a-c-e'. Ranges are very collating-sequence-dependent,
226	and portable programs should avoid relying on them.<P>
227
228	To include a literal `]' in the list, make it the first
229	character (following a possible `^'). To include a lit-
230	eral `-', make it the first or last character, or the sec-
231	ond endpoint of a range. To use a literal `-' as the
232	first endpoint of a range, enclose it in `[.' and `.]' to
233	make it a collating element (see below). With the excep-
234	tion of these and some combinations using `[' (see next
235	paragraphs), all other special characters, including `\',
236	lose their special significance within a bracket expres-
237	sion.<P>
238
239	Within a bracket expression, a collating element (a char-
240	acter, a multi-character sequence that collates as if it
241	were a single character, or a collating-sequence name for
242	either) enclosed in `[.' and `.]' stands for the sequence
243	of characters of that collating element. The sequence is
244	a single element of the bracket expression's list. A
245	bracket expression containing a multi-character collating
246	element can thus match more than one character, e.g. if
247	the collating sequence includes a `ch' collating element,
248	then the RE `[[.ch.]]*c' matches the first five characters
249	of `chchcc'.<P>
250
251	Within a bracket expression, a collating element enclosed
252	in `[=' and `=]' is an equivalence class, standing for the
253	sequences of characters of all collating elements equiva-
254	lent to that one, including itself. (If there are no
255	other equivalent collating elements, the treatment is as
256	if the enclosing delimiters were `[.' and `.]'.) For
257	example, if o and ^ are the members of an equivalence
258	class, then `[[=o=]]', `[[=^=]]', and `[o^]' are all syn-
259	onymous. An equivalence class may not be an endpoint of a
260	range.<P>
261
262	Within a bracket expression, the name of a character class
263	enclosed in `[:' and `:]' stands for the list of all char-
264	acters belonging to that class. Standard character class
265	names are:<P>
266	<table>
267	<tr><td>alnum</TD><td>digit</td><td>punct</td></tr>
268	<tr><td>alpha</TD><td>graph</TD><td>space</td></tr>
269	<tr><td>blank</TD><td>lower</TD><td>upper</td></tr>
270	<tr><td>cntrl</TD><td>print</TD><td>xdigit</td></tr>
271	</table>
272	<P>
273	These stand for the character classes defined in ctype(3).
274	A locale may provide others. A character class may not be
275	used as an endpoint of a range.<P>
276
277	There are two special cases of bracket expressions: the
278	bracket expressions `[[:<:]]' and `[[:>:]]' match the null
279	string at the beginning and end of a word respectively. A
280	word is defined as a sequence of word characters which is
281	neither preceded nor followed by word characters. A word
282	character is an alnum character (as defined by ctype(3))
283	or an underscore. This is an extension, compatible with
284	but not specified by POSIX 1003.2, and should be used with
285	caution in software intended to be portable to other sys-
286	tems.<P>
287
288	In the event that an RE could match more than one sub-
289	string of a given string, the RE matches the one starting
290	earliest in the string. If the RE could match more than
291	one substring starting at that point, it matches the
292	longest. Subexpressions also match the longest possible
293	substrings, subject to the constraint that the whole match
294	be as long as possible, with subexpressions starting ear-
295	lier in the RE taking priority over ones starting later.
296	Note that higher-level subexpressions thus take priority
297	over their lower-level component subexpressions.<P>
298
299	Match lengths are measured in characters, not collating
300	elements. A null string is considered longer than no
301	match at all. For example, `bb*' matches the three middle
302	characters of `abbbc', `(wee\|week)(knights\|nights)'
303	matches all ten characters of `weeknights', when `(.).'
304	is matched against `abc' the parenthesized subexpression
305	matches all three characters, and when `(a)' is matched
306	against `bc' both the whole RE and the parenthesized
307	subexpression match the null string.<P>
308
309	If case-independent matching is specified, the effect is
310	much as if all case distinctions had vanished from the
311	alphabet. When an alphabetic that exists in multiple
312	cases appears as an ordinary character outside a bracket
313	expression, it is effectively transformed into a bracket
314	expression containing both cases, e.g. `x' becomes `[xX]'.
315	When it appears inside a bracket expression, all case
316	counterparts of it are added to the bracket expression, so
317	that (e.g.) `[x]' becomes `[xX]' and `[^x]' becomes
318	`[^xX]'.<P>
319
320	No particular limit is imposed on the length of REs. Pro-
321	grams intended to be portable should not employ REs longer
322	than 256 bytes, as an implementation can refuse to accept
323	such REs and remain POSIX-compliant.<P>
324
325	Obsolete (``basic'') regular expressions differ in several
326	respects. `\|', `+', and `?' are ordinary characters and
327	there is no equivalent for their functionality. The
328	delimiters for bounds are `\{' and `\}', with `{' and `}'
329	by themselves ordinary characters. The parentheses for
330	nested subexpressions are `$' and `$', with `(' and `)'
331	by themselves ordinary characters. `^' is an ordinary
332	character except at the beginning of the RE or the begin-
333	ning of a parenthesized subexpression, `$' is an ordinary
334	character except at the end of the RE or the end of a
335	parenthesized subexpression, and `*' is an ordinary char-
336	acter if it appears at the beginning of the RE or the
337	beginning of a parenthesized subexpression (after a possi-
338	ble leading `^'). Finally, there is one new type of atom,
339	a back reference: `\' followed by a non-zero decimal digit
340	d matches the same sequence of characters matched by the
341	dth parenthesized subexpression (numbering subexpressions
342	by the positions of their opening parentheses, left to
343	right), so that (e.g.) `$[bc]$\1' matches `bb' or `cc'
344	but not `bc'.<P>
345
346	<B>SEE ALSO</B><BR>
347	POSIX 1003.2, section 2.8 (Regular Expression Notation).<P>
348
349	<B>BUGS</B><BR>
350	Having two kinds of REs is a botch.<P>
351
352	The current 1003.2 spec says that `)' is an ordinary char-
353	acter in the absence of an unmatched `('; this was an
354	unintentional result of a wording error, and change is
355	likely. Avoid relying on it.<P>
356
357	Back references are a dreadful botch, posing major prob-
358	lems for efficient implementations. They are also some-
359	what vaguely defined (does `a$\(b$\2\)d' match
360	`abbbd'?). Avoid using them.<P>
361
362	1003.2's specification of case-independent matching is
363	vague. The ``one case implies all cases'' definition
364	given above is current consensus among implementors as to
365	the right interpretation.<P>
366
367	The syntax for word boundaries is incredibly ugly.<P>
368
369	<B>AUTHOR</B><BR>
370	This page was taken from Henry Spencer's regex package.
371	</td>
372	</tr>
373
374	</table></center>
375
376	<center>
377	<hr NOSHADE WIDTH="80%"><i><font color="#000099"><font size=-1>This Web Site is
378	hosted by Apache for OS/2 and done by <a href="mailto:tbretz@astro.uni-wuerzburg.de">Thomas Bretz</a>.</font></font></i><BR>
379	<BR>
380	<a href="http://validator.w3.org/check/referer"><img border="0"
381	src="../../valid-html40.png" alt="Valid HTML 4.0!" height="20" width="66"></a>
382	</center>
383	</tr>
384	</table>
385
386	</center>
387
388	</body>
389	</html>

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: trunk/Mars/datacenter/db/regexp.html@ 18066

Download in other formats: