Oracle's Regular Expression in SQL and PL/SQL by examples

RegExp in SQL and PL/SQL
Stas Belkov

Stas Belkov

The author of the article. A well-known specialist in the IT world. Oracle Products and Solutions Consultant. Practicing programmer and database administrator. More about.

 Oracle's regular expression support manifests itself in the form of three SQL functions and one predicate that you can use to search and manipulate text in any of Oracle's supported text datatypes: VARCHAR2, CHAR, NVARCHAR2, NCHAR, CLOB, and NCLOB.


 Regular expression support does not extend to LONG, because LONG is supported only for backward compatibility with existing code.




 

Regular Expression Functions

Following are the four functions you'll use to work with regular expressions in Oracle:

REGEXP_LIKE

Determines whether a specific column, variable, or text literal contains text matching a regular expression.

REGEXP_INSTR

Locates, by character position, an occurrence of text matching a regular expression.

REGEXP_REPLACE

Replaces text matching a regular expression with new text that you specify. Your replacement text can include backreferences to values in the regular expression.

REGEXP_SUBSTR

Extracts text matching a regular expression from a character column, variable, or text literal.

Of these, you've already seen REGEXP_LIKE in quite a few examples. REGEXP_LIKE is documented in the "Conditions" chapter of the Oracle Database 10g SQL Reference because in SQL it can only be used as a predicate in the WHERE and HAVING clauses of a query or DML statement. In PL/SQL, however, you can use REGEXP_LIKE as you would any other Boolean function:

DECLARE
   x Boolean;
BEGIN
   x := REGEXP_LIKE(
        'Does this string mention Oracle?',
        'Oracle');
END;
/

The remaining three functions work identically in SQL and PL/SQL.

 

Regular Expression Locale Support

Oracle is notable for its Globalization Support in that it supports an exceedingly wide variety of character sets, languages, territories, and linguistic sorts. Regular expressions are no exception. The combination of character set, language, and territory is known as a locale. Oracle's regular expression engine respects locale, and is configurable via NLS (National Language Support) parameter settings.

Following are some notable examples of the way in which regular expression locale support affects you:

  • The regular expression engine is character-based. The period (.) will always match a single character or, more strictly speaking, a single codepoint, regardless of how many bytes are used to represent that character in the underlying character set.
  • Character classes are sensitive to the underlying character set. For example, if you're using one of the Unicode character sets, the class [:digit:] will include not only 0, 1, 2, through 9, but also the Arabic-Indic , , through , the Bengali , , through , and so forth.
  • NLS_SORT can affect how comparisons are performed. If NLS_SORT considers two characters to be equivalent, then so does the regular expression engine. For example, using the default sort of BINARY, the expression 'resume' will not match the text 'Résumé'. Change NLS_SORT to GENERIC_BASELETTER, and the expression does match, because that sort treats 'e' and 'é' as the same letter and also ignores case.
  • Bracket expressions such as [A-z] are affected by the underlying character set and the sort order. For example:
    • [a-z] includes A when using the case-insensitive sort GERMAN_CI, but not when using GERMAN.
    • Given an ASCII-based character set and the BINARY sort order, [A-z] encompasses all letters, upper- and lowercase. Given an EBCDIC character set and the BINARY sort order, [A-z] fails to be a valid expression, even failing to compile, because in EBCDIC the binary representation of the letter A comes after that of the letter z.
  • If a regular expression is in one character set, and the text to be searched is in another, the regular expression will be converted to the character set of the text to be searched.
  • Your NLS_SORT setting affects whether case-sensitive matching is done by default. A sort such as SPANISH yields case-sensitive sorting. You can add the suffix _CI, as in SPANISH_CI, to linguistic sorts to get a case-insensitive sort. Use the suffix _AI for an accent-insensitive sort.
  • NLS_SORT also affects which accented and unaccented characters are considered to be of the same class. For example, the expression 'na[[=i=]]ve' will match both 'naive' and 'naïve' when NLS_SORT is set to BINARY (the default sort for the AMERICAN language), but not when NLS_SORT is set to GREEK.
  • NLS_SORT affects which collation elements are considered valid. For example, [.ch.] is recognized by Spanish sorting rules (when NLS_SORT equals XSPANISH), but not by American sorting rules.

 

Regular Expression Matching Options

Each of Oracle's regular expression functions takes an optional match_parameter, which is a character string that you can fill with one-character flags. This string gives you control over the following aspects of regular expression behavior:

Whether matching is case-sensitive

NLS_SORT controls whether matching is case-sensitive by default, which it usually will be. You can override the default on a per-call basis.

Whether the period (.) matches newline characters

By default, periods do not match newline characters (occurrences of CHR(10) on Unix systems) in the source text. You can specify that periods match newlines.

The definition of "line"

By default, the source string that you are searching is considered one long line, and the caret (^) and dollar sign ($) match only the beginning and ending of the entire string.

You can specify that the source value is to be treated as many lines delimited by newline characters. If you do so, then the ^ and $ match the beginning and end of each line respectively.

The following example demonstrates the use of the match_parameter by performing a case-insensitive search for doubled words. The match_parameter value in this case is 'i'. The two 1 parameters preceding 'i' in REGEXP_SUBSTR supply the default values for starting position and occurrence. Those parameters need to be specified in order to reach the match_parameter.

SELECT park_name, REGEXP_SUBSTR(

         description,

         '(^|[[:space:][:punct:]]+)([[:alpha:]]+)'

         || '([[:space:][:punct:]])+\2'

         || '([[:space:][:punct:]]+|$)',

         1,1,'i') duplicates

FROM park

WHERE REGEXP_LIKE(description,

         '(^|[[:space:][:punct:]]+)([[:alpha:]]+)'

         || '([[:space:][:punct:]])+\2'

         || '([[:space:][:punct:]]+|$)', 'i');

To specify multiple parameters, simply list them in one string. For example, to request case-insensitive matching with periods matching newline characters, specify 'in' or 'ni' as your match_parameter.

If you specify contradictory parameters, Oracle uses the last value in the string. For example, 'ic' is contradictory because 'i' asks for case-insensitivity, while 'c' asks for the opposite. Oracle resolves this by taking the last value in the string, in this case the 'c'.

If you specify parameters that are undefined, Oracle will return an ORA-01760: illegal argument for function error.

 

Standards Compliance

Oracle's regular expression engine is of the traditional nondeterministic finite automata (traditional NFA) variety, the same type used in Perl, the .NET environment, and Java. With one exception, Oracle's engine implements the syntax and behavior for extended regular expressions (EREs) as described in the POSIX standard. In addition, Oracle adds support for backreferences.

The regular expression syntax and behavior documented in the Open Group Base Specifications Issue 6, IEEE Standard 1003.1, 2003 Edition is the same as that for POSIX. You can view the Open Group specifications at http://www.opengroup.org/onlinepubs/007904975/basedefs/xbd_chap09.html

SELECT REGEXP_SUBSTR('bbb','b|bb') FROM dual;

The one exception that stands between Oracle and full POSIX compliance is that Oracle does not attempt to determine the longest possible match for a pattern containing variations, as the standard requires. The following example demonstrates this very well:

b

SELECT REGEXP_SUBSTR('bbb','bb|b') FROM dual; 

bb

These two statements differ only by the order in which the alternatives are specified in the regular expression: b|bb versus bb|b. The longest possible match in either case is 'bb', and that's the match POSIX requires for both cases. However, Oracle's regular expression engine takes the first match it finds, which can be either 'b' or 'bb', depending on the order in which the alternatives are specified.

Do not confuse finding the longest possible match out of several alternations with greediness.

Like many regular expression engines, Oracle ignores the "longest possible match" rule, because the overhead of computing all possible permutations and then determining which is the longest can be excessive.

Differences Between Perl and Oracle

Perl has done a lot to popularize the use of regular expressions, and many regular expression engines (e.g., Java and PHP) follow Perl's implementation closely. Many readers may have learned regular expressions using Perl or a Perl-like engine, so this brief section highlights the key differences between Perl's and Oracle's support for regular expressions.

This section is based a comparison of Perl Version 5.8 with Oracle Database 10g.


String literal issues 

Regular expressions are often written as string literals. When you move string literals from one language to another, you may encounter issues with the way that each language handles such literals.

For example, Perl enables you to to use \x followed by two hexadecimal digits to embed arbitrary byte codes within a string. Perl also supports character sequences such as \n for the newline (linefeed on Unix) character. Thus, in Perl, you can write the following regular expression to search for either a linefeed or a space:

/[\n|\x20]/

The issue is that this isn't a regular expression per se—it's a Perl string. The backslash sequences \n and \x20 have no meaning to Perl's regular expression engine, which, in fact, never sees them. Those sequences are interpreted by Perl itself. By the time the string gets to Perl's regular expression engine, \n and \x20 have been replaced by the appropriate byte codes.

Another issue you may encounter is Perl's use of the dollar sign ($) to dereference a variable within a string. In Perl, the expression /a$b/ searches for the letter 'a' followed by the contents of the Perl variable named b. Perl's regular expression never sees the '$b', because Perl substitutes the value of the variable before it passes the string to the engine.

Neither SQL nor PL/SQL support the use of \ and $ in the way that Perl does. Because Perl and Oracle differ in their handling of string literals, you may not be able to take a regular expression developed for Perl and simply drop it into Oracle. Before attempting to move an expression in the form of a string literal from Perl to Oracle, make sure that the "expression" doesn't contain any characters that Perl itself interprets.

 

NULL versus empty strings

Unlike Perl and many database products, Oracle treats an empty string as a NULL value. Thus, the following query, which attempts to match an empty string, brings back no data:

SELECT * FROM park
WHERE REGEXP_LIKE(description,'');

In Oracle, the regular expression engine does not see an empty string; rather, it sees a NULL, or the complete absence of an expression with which to do any matching.

 

Perl-specific syntax

Oracle's regular expression syntax is POSIX-compliant. Perl's engine supports a number of operators, character classes, and so forth that are not defined as part of the POSIX standard. These are described in Table 1-2. Where possible, we also specify a POSIX equivalent that's usable in Oracle.

The POSIX equivalents shown in Table 1-2 should work for the default locale (American_America.US7ASCII, with a BINARY sort). However, we have not yet been able to run exhaustive tests.

 

Table 1-2. Perl's nonstandard regular expression operators

Perl operator

Description / Oracle equivalent

[[:ascii:]]

Matches any ASCII character. In Oracle, possibly use: '[' || CHR(00) || '-' || CHR(127) || ']'.

[[:word:]]

A word character, defined as any alphanumeric character, including underscore: [[:alnum:]_]

\C

Embeds arbitrary bytes in a regular expression. In Oracle, use the CHR function, but be aware that Oracle requires an expression to be composed of valid characters as defined by the underlying character set.

\d

Digits: [[:digit:]]

\D

Non-digits: [^[:digit:]]

\pP

Named properties, no POSIX equivalent

\PP

Negated named properties, no POSIX equivalent

\s

Whitespace: [[:space:]], except that[[:space:]] includes vertical tab (\x0B ), and \s does not.

\S

Non-whitespace: [^[:space:]]

\w

Alphanumeric characters: [[:alnum:]_]

\W

Non-alphanumeric characters: [^[:alnum:]_]

\X

Followed by a code point value, \X embeds a Unicode combining character sequence into a regular expression. In Oracle, use the COMPOSE function to generate Unicode combining characters from code points.

\b \B \A \Z \z \G

Perl supports a number of zero-width assertions. None are recognized by POSIX.

 

 

Syntax Perl does not support

Perl does not support the POSIX-standard [= =] notation for defining an equivalence class. In addition, Perl does not support the use of [. .] to specify a collation element.

 

Negating character classes

Both Perl and Oracle support the POSIX-compliant caret (^) as the first character within a bracket expression to mean all characters except those listed within the expression. For example, you can write: [^A-Z] to match on any character but the uppercase letters.

Perl also supports the use of a caret in conjunction with a character class name. For example, Perl allows you to write [[:^digit:]] to match on any character except for one in the [:digit:] class. You can get the same effect in Oracle using the form: [^[:digit:]].

 

Lazy quantifiers (non-greediness)

Quantifiers in a regular expression will match as many characters as possible. For example, given a source string of '123456', the expression [0-9]+ will match the entire string of six digits.

Perl supports the addition of a question mark (?) to the end of a quantifier to make it non-greedy, or lazy, in which case the quantifier matches the minimum number of characters possible. For example, the expression [0-9]+? matches only the first digit of the string '123456'.

The complete list of lazy quantifiers supported by Perl is:

*?, +?, ??, and {}?

POSIX, and by extension Oracle, does not support these quantifiers.

 

Experimental features

Perl supports a mechanism for adding experimental regular expression features. Such features always take the form (?...), in which the ellipses represent the feature-specific syntax. Comments within expressions are one of the so-called experimental features, and you can embed a comment in a Perl regular expression as follows:

(?#area code)([[:digit:]]{3}[-\.]|\([[:digit:]]{3}\))

(?#local number)[[:digit:]]{3}[-\.][[:digit:]]{4}

Oracle does not support Perl's experimental feature syntax.

 

Backreferences

In a replacement string such as one you might use with REGEXP_REPLACE, Perl supports the use of a dollar sign ($) to indicate a backreference. For example, you can use $1 to refer to the first subexpression. Oracle supports only the backslash syntax \1, \2, and so forth.

 

Backslash differences

POSIX and Perl differ somewhat in how they handle backslash (\) characters:

\ in a bracket-list

In Perl, a \ in a bracket-list is treated as a metacharacter. In Oracle, a \ in a bracket-list represents itself.

\ as the last character of an expression

Use \ as the last character of a regular expression in Perl, and you get an error. Do the same thing in Oracle, and the trailing \ is silently ignored.

Вас заинтересует / Intresting for you:

Understanding SQL and Database...
Understanding SQL and Database... 2376 views Ирина Светлова Thu, 17 May 2018, 18:18:17
Spatial Information Management...
Spatial Information Management... 5981 views Дэн Sat, 28 Apr 2018, 15:53:08
Relational Model and Why It Do...
Relational Model and Why It Do... 5231 views Александров Попков Tue, 21 Jul 2020, 19:36:45
Choosing a database for your a...
Choosing a database for your a... 2127 views Валерий Павлюков Mon, 27 Aug 2018, 14:12:09
Comments (0)
There are no comments posted here yet
Leave your comments
Posting as Guest
×
Suggested Locations