Class

RegEx


Description

Used to do search and replace operations using regular expressions (i.e., perl). The RegEx class uses the current version of the PCRE library, 8.33.

Properties

Name

Type

Read-Only

Shared

Options

RegExOptions

ReplacementPattern

String

SearchPattern

String

SearchStartPosition

Integer

Methods

Name

Parameters

Returns

Shared

Replace

String

targetString As String, [searchStartPosition As Integer]

RegExMatch

Search

RegExMatch

targetString As String, [searchStartPosition As Integer]

RegExMatch

Property descriptions


RegEx.Options

Options As RegExOptions

These options are various states which you can set for the Regular Expressions engine. See the RegExOptions class.


RegEx.ReplacementPattern

ReplacementPattern As String

This is the replacement string, which can include references to substrings matched previously, via the standard '1' or '$1' notation common in regular expressions.

This pattern is used either with the Replace property or passed to the RegExMatch class when Search returns, and subsequently used with Replace if no parameters are specified.


RegEx.SearchPattern

SearchPattern As String

This is the pattern you are currently searching for.


RegEx.SearchStartPosition

SearchStartPosition As Integer

The byte offset at which you want to start the search if the optional TargetString parameter to Replace is not specified.

Keep in mind: If you set it, it will only be used if you don't specify a TargetString, since setting a new TargetString resets the value.

The offset is zero-based!. I.e. to start at the beginning of the string, use 0.

If you want to set a start value past the first character, and if the string uses an encoding where not every character is exactly one byte long, such as UTF-8, you need to convert the string's character position into its byte offset.

Here's a way to convert a (1-based) character position into the (0-based) byte position:

aRegEx.SearchStartPosition = theString.Left(characterPosition-1).Bytes

Method descriptions


RegEx.Replace

Replace As String

Finds SearchPattern in the last used targetString and replaces the contents of SearchPattern with ReplacementPattern starting at the last SearchStartPosition. Returns the resulting String.

Replace(targetString As String, [searchStartPosition As Integer]) As RegExMatch

Finds SearchPattern in the last used targetString starting at searchStartPosition and replaces the contents of SearchPattern with ReplacementPattern starting at the last SearchStartPosition. Returns the result as a RegExMatch.

This code does a simple remove of HTML tags from source HTML:

Var re As New RegEx
re.SearchPattern = "<[^<>]+>"
re.ReplacementPattern = ""
re.Options.ReplaceAllMatches = True

Var html As String = "<p>Hello.</p>"
Var plain As String = re.Replace(html)

MessageBox(plain) ' "Hello."

This code finds all occurrences of the word "a" and replace them all with "the":

Var re As New RegEx
re.SearchPattern = "\ba\b"
re.ReplacementPattern = "the"
re.Options.ReplaceAllMatches = True

Var origText As String = "a bus drove on a street in a town"
Var newText As String = re.Replace(origText)

MessageBox(newText) ' "the bus drove on the street in the town"

This code replaces the second occurrence only:

Var re As New RegEx
re.SearchPattern = "\ba\b"
re.ReplacementPattern = "the"

Var sampleText As String = "a bus drove on a street in a town"
Var match As RegExMatch = re.Search(sampleText) ' Find the first SearchPattern in the text

If match <> Nil Then
  sampleText = re.Replace ' Find the second SearchPattern in the text and replace it
End If

MessageBox(sampleText) ' "a bus drove on the street in a town"

This code uses the same RegEx on several strings:

Var sources() As String = Array("<b>this</b>", "<i>that</i>", "<strong>the other</strong>")

Var re As New RegEx
re.SearchPattern = "<[^<>]+>"
re.ReplacementPattern = ""
re.Options.ReplaceAllMatches = True

For sourceIndex As Integer = 0 To sources.LastIndex
  sources(sourceIndex) = re.Replace(sources(sourceIndex))
Next sourceIndex

' sources now contains
' {"this", "that", "the other"}

RegEx.Search

Search As RegExMatch

Resumes searching in the previously provided TargetString (see the Notes).

Search(targetString As String, [searchStartPosition As Integer]) As RegExMatch

Finds SearchPattern in targetString, beginning at searchStartPosition if provided.

If it succeeds, it returns a RegExMatch. Both parameters are optional; if targetString is omitted, it assumes the previous targetString, so you will want to pass a targetString the first time you call Search. If you call Search with a targetString and omit Search startPosition, zero is assumed. If you call Search with no parameters after initially passing a targetString, it assumes the previous targetString and will begin the search where it left off in the previous call. This is the easiest way to find the next occurrence of SearchPattern in targetString.

The RegExMatch will remember the ReplacementPattern specified at the time of the search.

This example finds occurrences of "software" in the supplied string. Note that RegEx searches are case-insensitive by default so "software" is found twice:

Var re As New RegEx
Var match As RegExMatch

re.SearchPattern = "software"
match = re.Search("How much software can a Software Developer make?")

Var result As String

Do
  If match <> Nil Then
    result = match.SubExpressionString(0)
    MessageBox(result)
  End If

  match = re.Search
Loop Until match Is Nil

Notes

This section describes the syntax of regular expressions.

Pattern

Description

.

Matches any character except newline.

[a-z0-9]

Matches any single character of set.

[^a-z0-9]

Matches any single character not in set.

d

Matches a digit. Same as [0-9].

D

Matches a non-digit. Same as [^0-9].

w

Matches an alphanumeric (word) character - [a-zA-Z0-9_].

W

Matches a non-word character [^a-zA-Z0-9_].

s

Matches a whitespace character (space, tab, newline, etc.).

S

Matches a non-whitespace character.

n

Matches a newline (line feed).

r

Matches a return.

t

Matches a tab.

f

Matches a formfeed.

b

Matches a word boundary. Use [b] to match a backspace.

0

Matches a null character.

000

Also matches a null character because of the following:

*nnn*

Matches an ASCII character of that octal value.

x*nn*

Matches an ASCII character of that hexadecimal value.

c*X*

Matches an ASCII control character.

*metachar*

Matches the meta-character (e.g., , .).

(abc)

Used to create subexpressions. Remembers the match for later backreferences. Referenced by replacement patterns that use 1, 2, etc.

1, 2,…

Matches whatever first (second, and so on) of parens matched.

x?

Matches 0 or 1 x's, where x is any of above.

x*

Matches 0 or more x's.

x+

Matches 1 or more x's.

x{m,n}

Matches at least m x's, but no more than n. {x} (matches x occurrences) and {x,} (matches at least x occurrences).

abc

Matches all of a, b, and c in order.

a|b|c

Matches one of a, b, or c.

b

Matches a word boundary (outside [] only).

B

Matches a non-word boundary.

^

Anchors match to the beginning of a line or string.

$

Anchors match to the end of a line or string.


Replacement patterns

The following expressions can only apply to the replacement pattern:

Pattern

Description

$`

Replaced with the entire target string before match.

$&

The entire matched area; this is identical to 0 and $0.

$'

Replaced with the entire target string following the matched text.

$0-$50

$0-$50 evaluate to nothing if the subexpression corresponding to the number doesn't exist.

0-50

x*nn*

Replaced with the character represented by nn in Hex, e.g., &#8482;is &#8482;.

n*nn*

Replaced with the character represented by nn in Octal.

c*X*

Replaced with the character that is the control version of X, e.g., cP is DLE, data line escape.


Double-byte systems

If you are working with a double-byte system such as Japanese, RegEx cannot operate on the characters directly. You should first convert all double-byte text to UTF8 using the built-in Text Converter functions. See, for example, the TextConverter class for an example of how to use the Text Converter functions.

All text that will be processed by RegEx should be converted. This includes SearchPattern, ReplacementPattern, and TargetString. The result of the Search or Search and Replace will be a UTF8 string, so you will need to convert it back to its original form using the Text Converter functions. Both Search and Search and Replace operations work on all platforms, provided that this conversion takes place.


Regular expression examples

The basic idea of regular expressions is that it enables you to find and replace text that matches the set of conditions you specify. It extends normal Search and Replace with pattern searching.


Wildcards

Some special characters are used to match a class of characters:

Wildcard

Matches

.

Any single character except a line break, including a space.

If you use the "." as the search pattern, you will select the first character in the target string and, if you repeat the search, you will find each successive character, except for Return characters

The following wildcards match by position in a line:

Wildcard

Matches

Example

^

Beginning of a line (unless used in a character class; see below)

^Phone: Finds lines that begin with "Phone":

$

End of a line (unless used in a character class)

$: Finds the last character in the current line.


Character classes

A character class allows you to specify a set or range of characters. You can choose to either match or ignore the character class. The set of characters is enclosed in brackets. If you want to ignore the character class instead of match it, precede it by a caret (^). Here are some examples:

Character Class

Matches

[aeiou]

Any one of the characters a, e, i, o, u.

[^aeiou]

Any character except a, e, i, o, u.

[a-e]

Any character in the range a-e, inclusive

[a-zA-Z0-9]

Any alphanumeric character. Note: Case-sensitivity is controlled by the CaseSensitive property of the RegExOptions class.

[[]

Finds a [.

[]]

Finds a ]. To find a a closing bracket, place it immediately after the opening bracket.

[a-e^]

Finds a character in the range a-e or the caret character. To find the caret character, place it anywhere except as the first character after the opening bracket.

[a-c-]

Finds a character in the range a-c or the - sign. To match a -, place it at the beginning or end of the set.


Non-printing characters

You can use the following notation to find non-printing characters:

Special Character

Matches

r

Line break (return)

n

Newline (line feed)

t

Tab

f

Formfeed (page break)

x*NN*

Hex code NN.


Other special characters

The following patterns are wildcards for the following special characters:

Special Character

Matches

s

Any whitespace character (space, tab, return, linefeed, form feed)

S

Any non-whitespace character.

w

Any "word" character (a-z, A-Z, 0-9, and _)

W

Any "non-word" character (All characters not included by w).

d

Any digit [0-9].

D

Any non-digit character.


Repetition characters

Repetition characters are modifiers that allow you to repeat a specified pattern.

Repetition Character

Matches

Examples

*

Zero or more characters.

d* finds no characters, or one or more consecutive "d"s.

+

One or more characters.

d+ finds one or more consecutive "d"s.

?

Zero or one characters.

d? finds no characters or one "d".

Please note that, since * and ? match zero instances of the pattern, they always succeed but may not select any text. You can use them to specify an optional character, as in the examples in the following section.


"greediness"

The "?" is used as a "greediness" modifier for a subpattern in a regular expression. By default, greediness is controlled by the Greedy property of the RegExOptions class, but can be overridden using the "?". You can place a "?" directly after a * or + to reverse the "greediness" setting. That is, if Greedy is True, using the ? after a * or + causes it to match the minimum number of times possible: For example, consider the following.

Target String

Greedy

Regular Expression

Result

aaaa

True

(a+?) (a+)

$1=a, $2=aaa

aaaa

False

(a+?) (a+)

$1=aaa, $2=a


Extension mechanism

We also support the regular expression extension mechanism used in Perl. For instance:

(?#text)

Comment

(?:pattern)

For grouping without creating backreferences

(?=pattern)

A zero-width positive look-ahead assertion. For example, w+(?=t) matches a word followed by a tab, without including the tab in $&.

(?!pattern)

A zero-width negative look-ahead assertion. For example foo(?!bar)/matches any occurrence of "foo" that isn't followed by "bar".

(?<=pattern)

A zero-width positive look-behind assertion. For example, (?<=t)w+ matches a word that follows a tab, without including the tab in $&. Works only for fixed-width look-behind.

(?<!pattern)

A zero-width negative look-behind assertion. For example (?<!bar)foo matches any occurrence of "foo" that does not follow "bar". Works only for fixed-width look-behind.


Subexpressions

You can use parentheses within your search patterns to isolate portions of the matched string. You do this when you need to refer to subsections of the matched in your replacement string. For example you would do this if you need to replace only a portion of the matched string or insert other text into the matched string.

Here is an example. If you want to match any date followed by the letters "B.C." you can use the pattern "d+sB.C." (Any number of digits followed by a space character, followed by the letters "B.C.") This will match dates such as 33 B.C., 1742 B.C., etc. However, if you wanted your replacement pattern to leave the year alone but replace the letters with something else, you would use parens. The search pattern "(d+)s(B.C.)" does this.

When you write your replacement pattern, you can refer to the year only with the variable 1 and the letters with 2.

If you write "(d+)s(B.C.|A.D.|BC|AD)", then 2 would contain the matched letters.


Combining patterns

Much of the power of regular expressions comes from combining these elementary patterns to make up complex searches. Here are some examples:

Pattern

Matches

$?[0-9,]+.?d*

Matches dollar amounts with an optional dollar sign.

d+sB.C.

One or more digits followed by a space, followed by "B.C."


The alternation operator

The alternation operator (|) allows you to match any of a number of patterns using the logical "or" operator. Place it between two existing patterns to match either pattern. You can use more than one alternation operator in a pattern:

Pattern

Matches

shes | sshes

" he " or " she "

cat|dog|possum

"cat", "dog", or "possum"

([0-9,]+sB.C.)|([0-9,]+sA.D.) or [0-9,]+s((B.C.)|(A.D.))

Years of the form "yearNum B.C. or A.D." e.g., "2,175 B.C." or "215 A.D."


Search and replace

You use special patterns to represent the matched pattern. Using replacement patterns, you can append or prepend the matched pattern with other text.

Pattern

Description

$&

Contains the entire matched pattern.

1, 2, etc.

Contains the matched subpatterns, defined by use of parentheses in the search string.


Credits

Xojo uses a modified version of the PCRE library package, which is open source software, written by Philip Hazel, and copyright by the University of Cambridge, England.

The source to this library is available here.

Sample code

The following Button's Pressed event handler allows you to search the text in TextArea1 using the search pattern entered into TextField1 and display the results of the search in a Label:

Var rg As New RegEx
Var myMatch As RegExMatch

rg.SearchPattern = TextField1.Text
myMatch = rg.Search(TextArea1.Text)

If myMatch <> Nil Then
  Label1.Text = myMatch.SubExpressionString(0)
Else
  Label1.Text = "Text not found!"
End If

Exception err As RegExException
  MessageBox(err.Message)

Compatibility

Project Types

All

Operating Systems

All

See also

Object parent class; RegExMatch, RegExOptions classes; RegExException Error.