1:11:00 AM

(0) Comments

POSIX Regular Expressions in PHP

design

Regular expression is the basic functionality of pattern comparison. PHP offers two sets of functions for regular expressions - POSIX style and Perl style. Both types have their unique syntax and this post should give basic overview of the POSIX one.

Regular expression (called regex) is nothing more just sequence of characters (called pattern) which is compared agains a text in which we search. Patterns contain a combination of metacharacters and literals. Metacharacters (also called operators) define how literals (also called constants) should be treated on pattern evaluation against evaluated expression. For example, POSIX pattern [a-z0-9] which determines valid expression containig lowercase letters or number 0-9 has two metacharacters (opening square bracket and closing square bracket) and two literal ranges (a-z and 0-9, also called classes). In other words, literal means character itself whilst metacharacter means control character. Why it’s so important to distinguish between metacharacters and literals? The reason is that if you need to use metacharacters in pattern as a literal you must precede it by \ (backslash), very often said: it must be escaped. For example, if you need to add a dot in the regular expression pattern and don’t want to use this dot as a control character with meaning “any character” it is necessary to escape it - use it with backslash (see the table below for an example).

Following table lists POSIX metacharacters:

Metacharacter Description Example
^ matches the starting position within the string ^(([A-Za-z0-9_-]+)…
. matches any one character a.c matches “abc”
* matches the preceding element zero or more times ab*c matches “ac”, “abc”, “abbbc”
[xyz]* matches “”, “x”, “y”, “z”, “zx”, “zyx”, “xyzzy”
+ matches the preceding element one or more times ba+ matches “ba”, “baa”, “baaa”
? matches the preceding element zero or one time ba? matches “b” or “ba”
{m,n} matches the preceding element at least m and not more than n times {3,5} matches only “aaa”, “aaaa”, and “aaaaa”
() defines a marked subexpression ^(([A-Za-z0-9_-]+)[.]([A-Za-z0-9_-]+))+$
[] defines a class of characters [0-9] matches any one number (range class)
[a.c] matches only “a” or “.” or “c” (list class)
[^] matches a single character that is not contained within the brackets [^abc] matches any char other than “a”, “b”, or “c”
[^a-z] matches any single char that is not a lowercase letter from “a” to “z”
$ matches the ending position of the string or the position just before a string-ending newline …[.]([A-Za-z0-9_-]+))+$
| matches either the expression before or the expression after the operator abc|def matches “abc” or “def”
\ changes metacharacter to literal (.+) matches any expression containing at least one arbitrary character
(\.+) matches any expression containing at least one dot character

Following table lists POSIX character classes for more comfortable programming:

Class Description Alternative
[:alpha:] uppercase and lowercase letters [A-Za-z]
[:alnum:] uppercase and lowercase letters and numbers [A-Za-z0-9]
[:cntrl:] control characters like TAB, ESC or Backspace -
[:digit:] numbers from zero to nine [0-9]
[:graph:] ASCII (33-126) printable characters -
[:lower:] lowercase letters [a-z]
[:punct:] punctual characters: ~`!@#$%^&*()-_+={}[]:;’<>,.?/ -
[:upper:] uppercase letters [A-Z]
[:space:] empty characters like space, newline, carriage return -
[:xdigit:] hexadecimal numbers [a-fA-F0-9]

This table lists PHP POSIX regex functions:

Prototype Description
int ereg (string $pattern, string $string [, array &$regs]) Searches a string for matches to the regular expression given in pattern in a case-sensitive way.
int eregi (string $pattern, string $string [, array &$regs]) This function is identical to ereg() except that it ignores case distinction when matching alphabetic characters.
string ereg_replace (string $pattern, string $replacement, string $string) This function scans string for matches to pattern, then replaces the matched text with replacement.
string eregi_replace (string $pattern, string $replacement, string $string) This function is identical to ereg_replace() except that ignores case distinction when matching alphabetic chars.
array split (string $pattern, string $string [, int $limit]) Splits a string into array by regular expression.
array spliti (string $pattern, string $string [, int $limit]) This function is identical to split() except that this ignores case distinction when matching alphabetic characters.
string sql_regcase (string $string) Creates a regular expression for a case insensitive match.

Regular expressions are very usefull when we need to check some user inputs. If you have a contact form on your site which contains mandatory e-mail address field, how would you check whether user input string has valid e-mail format? Use regular expression match! Here are some examples for better understanding:

  • ^(([A-Za-z0-9_-]+)[.]([A-Za-z0-9_-]+))+$ : matches a hostname expression (hostname.example.com)
  • ^([0-9]{1,3})\.([0-9]{1,3})[.]([0-9]{1,3})\.([0-9]{1,3})$ : matches an IP address (192.168.10.122)
  • ^([A-Za-z0-9._-]+)@([A-Za-z0-9._-]+)[.]([a-z]{2,4})$ : matches an e-mail address (mailbox@example.com)

Maybe you have noticed that sometimes there is a choice how to write regular expression pattern. In the first and third example above the dot character is expressed as a member of list class [.] whilst in the second example (IP address regexp) the dot is expressed as an escaped metacharacter \. at some places (this was done for demonstration purposes).

Another very important detail which should be noted is the fact that if you need to use metacharacters in a range class or list class it must be placed at the end of a content of such class, right before closing square bracket [... _-].

You can play with staed above examples by pasting the following code into a regexp.php file and run it in a browser:



POSIX Regexp Tester



Enter String:



Select Pattern:

Hostname

IP Address

Email Address




' . $pattern . '
';
echo 'String: ' . $string . '

';

echo 'Match: ';
if (ereg($pattern, $string))
echo 'OK';
else
echo 'WRONG';
?>



I hope this post gave you at least basic overview of POSIX regular expressions and their use in PHP. In some of future articles we will take a look at Perl style regular expressions.




0 Responses to "POSIX Regular Expressions in PHP"

Post a Comment