PHP 正则表达式函数库（Perl 兼容）

piaoling 2011-05-18 16:26:15

选自PHP手册

介绍

本类函数中所使用的模式极其类似 Perl。表达式应被包含在定界符中，如斜线（/）。任何不是字母、数字或反斜线（）的字符都可以作为定界符。如果作为定界符的字符必须被用在表达式本身中，则需要用反斜线转义。自 PHP 4.0.4 起，也可以使用 Perl 风格的 ()，{}，[] 和 <> 匹配定界符。

结束定界符后可以跟上不同的修正符以影响匹配方式。

预定义常量

由于这些常量是由该扩展模块定义的，因此只有在该扩展模块被编译到 PHP 中，或者在运行时被动态加载后，这些常量才有效。

表格 1. PREG 常量

常量	说明
PREG_PATTERN_ORDER	对结果排序使得 $matches[0] 为整个模式的匹配结果的数组，$matches[1] 为第一个括号内的子模式所匹配的字符串的数组，等等。本标记仅用于 preg_match_all()。
PREG_SET_ORDER	对结果排序使得 $matches[0] 为第一组匹配结果的数组，$matches[1] 为第二组匹配结果的数组，等等。本标记仅用于 preg_match_all()。
PREG_OFFSET_CAPTURE	见 `PREG_SPLIT_OFFSET_CAPTURE` 的说明。本标记自 `PHP 4.3.0` 起可用。
PREG_SPLIT_NO_EMPTY	本标记使 preg_split() 仅返回非空的结果。
PREG_SPLIT_DELIM_CAPTURE	本标记使 preg_split() 也捕获定界符模式中的括号表达。本标记自 `PHP 4.0.5` 起可用。
PREG_SPLIT_OFFSET_CAPTURE	如果设定本标记，对每个出现的匹配结果也同时返回其附属的字符串偏移量。注意这改变了返回的数组的值，使其中的每个单元也是一个数组，其中第一项为匹配字符串，第二项为其偏移量。本标记自 `PHP 4.3.0` 起可用且仅用于 preg_split()。

范例

例子 1. 合法的模式举例

/</w+>/
|(d{3})-d+|Sm
/^(?i)php[34]/
{^s+(s+)?$}

例子 2. 非法的模式举例

/href='(.*)' - 缺少结束定界符
/w+s*w+/J - 未知的修正符 'J'
1-d3-d3-d4| - 缺少起始定界符

preg_grep

(PHP 4 )

preg_grep -- 返回与模式匹配的数组单元

说明

array preg_grep ( string pattern, array input)

preg_grep() 返回一个数组，其中包括了 input 数组中与给定的 pattern 模式相匹配的单元。

自 PHP 4.0.4 起，preg_grep() 返回的结果使用从输入数组来的键名进行索引。如果不希望这样的结果，用 array_values() 对 preg_grep() 返回的结果重新索引。

例子 1. preg_grep() 例子

<?php

											// return all array elements

											// containing floating point numbers

											$fl_array = preg_grep ("/^(d+)?.d+$/", $array);

											?>

preg_match_all

(PHP 3>= 3.0.9, PHP 4 )

preg_match_all -- 进行全局正则表达式匹配

说明

int preg_match_all ( string pattern, string subject, array matches [, int flags])

在 subject 中搜索所有与 pattern 给出的正则表达式匹配的内容并将结果以 flags 指定的顺序放到 matches 中。

搜索到第一个匹配项之后，接下来的搜索从上一个匹配项末尾开始。

flags 可以是下列标记的组合（注意把 PREG_PATTERN_ORDER 和 PREG_SET_ORDER 合起来用没有意义）：

PREG_PATTERN_ORDER

对结果排序使 $matches[0] 为全部模式匹配的数组，$matches[1] 为第一个括号中的子模式所匹配的字符串组成的数组，以此类推。

<?php

										preg_match_all ("|<[^>]+>(.*)</[^>]+>|U",

										    "<b>example: </b><div align=left>this is a test</div>",

										    $out, PREG_PATTERN_ORDER);

										print $out[0][0].", ".$out[0][1]."n";

										print $out[1][0].", ".$out[1][1]."n";

										?>

本例将输出：

<b>example: </b>, <div align=left>this is a test</div>
example: , this is a test

因此，$out[0] 包含匹配整个模式的字符串，$out[1] 包含一对 HTML 标记之间的字符串。

PREG_SET_ORDER

对结果排序使 $matches[0] 为第一组匹配项的数组，$matches[1] 为第二组匹配项的数组，以此类推。

<?php

										preg_match_all ("|<[^>]+>(.*)</[^>]+>|U",

										    "<b>example: </b><div align=left>this is a test</div>",

										    $out, PREG_SET_ORDER);

										print $out[0][0].", ".$out[0][1]."n";

										print $out[1][0].", ".$out[1][1]."n";

										?>

本例将输出：

<b>example: </b>, example:
<div align=left>this is a test</div>, this is a test

本例中，$matches[0] 是第一组匹配结果，$matches[0][0] 包含匹配整个模式的文本，$matches[0][1] 包含匹配第一个子模式的文本，以此类推。同样，$matches[1] 是第二组匹配结果，等等。

PREG_OFFSET_CAPTURE

如果设定本标记，对每个出现的匹配结果也同时返回其附属的字符串偏移量。注意这改变了返回的数组的值，使其中的每个单元也是一个数组，其中第一项为匹配字符串，第二项为其在 subject 中的偏移量。本标记自 PHP 4.3.0 起可用。

如果没有给出标记，则假定为 PREG_PATTERN_ORDER。

返回整个模式匹配的次数（可能为零），如果出错返回 FALSE。

例子 1. 从某文本中取得所有的电话号码

<?php

											preg_match_all ("/(?  (d{3})?  )?  (?(1)  [-s] ) d{3}-d{4}/x",

											                "Call 555-1212 or 1-800-555-1212", $phones);

											?>

例子 2. 搜索匹配的 HTML 标记（greedy）

<?php

											// \2 是一个逆向引用的例子，其在 PCRE 中的含义是

											// 必须匹配正则表达式本身中第二组括号内的内容，本例中

											// 就是 ([w]+)。因为字符串在双引号中，所以需要

											// 多加一个反斜线。

											$html = "<b>bold text</b><a href=howdy.html>click me</a>";

											

											preg_match_all ("/(<([w]+)[^>]*>)(.*)(</\2>)/", $html, $matches);

											

											for ($i=0; $i< count($matches[0]); $i++) {

											  echo "matched: ".$matches[0][$i]."n";

											  echo "part 1: ".$matches[1][$i]."n";

											  echo "part 2: ".$matches[3][$i]."n";

											  echo "part 3: ".$matches[4][$i]."nn";

											}

											?>

本例将输出：

matched: <b>bold text</b>
part 1: <b>
part 2: bold text
part 3: </b>

matched: <a href=howdy.html>click me</a>
part 1: <a href=howdy.html>
part 2: click me
part 3: </a>

preg_match

(PHP 3>= 3.0.9, PHP 4 )

preg_match -- 进行正则表达式匹配

说明

int preg_match ( string pattern, string subject [, array matches [, int flags]])

在 subject 字符串中搜索与 pattern 给出的正则表达式相匹配的内容。

如果提供了 matches，则其会被搜索的结果所填充。$matches[0] 将包含与整个模式匹配的文本，$matches[1] 将包含与第一个捕获的括号中的子模式所匹配的文本，以此类推。

flags 可以是下列标记：

PREG_OFFSET_CAPTURE: 如果设定本标记，对每个出现的匹配结果也同时返回其附属的字符串偏移量。注意这改变了返回的数组的值，使其中的每个单元也是一个数组，其中第一项为匹配字符串，第二项为其偏移量。本标记自 PHP 4.3.0 起可用。

flags 参数自 PHP 4.3.0 起可用。

preg_match() 返回 pattern 所匹配的次数。要么是 0 次（没有匹配）或 1 次，因为 preg_match() 在第一次匹配之后将停止搜索。preg_match_all() 则相反，会一直搜索到 subject 的结尾处。如果出错 preg_match() 返回 FALSE。

提示: 如果只想查看一个字符串是否包含在另一个字符串中，不要用 preg_match()。可以用 strpos() 或 strstr() 替代，要快得多。

例子 1. 在文本中搜索“php”

<?php

									// 模式定界符后面的 "i" 表示不区分大小写字母的搜索

									if (preg_match ("/php/i", "PHP is the web scripting language of choice.")) {

									    print "A match was found.";

									} else {

									    print "A match was not found.";

									}

									?>

例子 2. 搜索单词“web”

<?php

									/* 模式中的 b 表示单词的边界，因此只有独立的 "web" 单词会被匹配，

									* 而不会匹配例如 "webbing" 或 "cobweb" 中的一部分 */

									if (preg_match ("/bwebb/i", "PHP is the web scripting language of choice.")) {

									    print "A match was found.";

									} else {

									    print "A match was not found.";

									}

									

									if (preg_match ("/bwebb/i", "PHP is the website scripting language of choice.")) {

									    print "A match was found.";

									} else {

									    print "A match was not found.";

									}

									?>

例子 3. 从 URL 中取出域名

<?php

					// 从 URL 中取得主机名

					preg_match("/^(http://)?([^/]+)/i",

					    "http://www.php.net/index.html", $matches);

					$host = $matches[2];

					

					// 从主机名中取得后面两段

					preg_match("/[^./]+.[^./]+$/", $host, $matches);

					echo "domain name is: {$matches[0]}n";

					?>

本例将输出：

domain name is: php.net

preg_quote

(PHP 3>= 3.0.9, PHP 4 )

preg_quote -- 转义正则表达式字符

说明

string preg_quote ( string str [, string delimiter])

preg_quote() 以 str 为参数并给其中每个属于正则表达式语法的字符前面加上一个反斜线。如果你需要以动态生成的字符串作为模式去匹配则可以用此函数转义其中可能包含的特殊字符。

如果提供了可选参数 delimiter，该字符也将被转义。可以用来转义 PCRE 函数所需要的定界符，最常用的定界符是斜线 /。

正则表达式的特殊字符包括：. \ + * ? [ ^ ] $ ( ) { } = ! < > | :。

例子 1. preg_quote() 例子

<?php

										$keywords = "$40 for a g3/400";

										$keywords = preg_quote ($keywords, "/");

										echo $keywords; // returns $40 for a g3/400

										?>

例子 2. 给某文本中的一个单词加上斜体标记

<?php

										// 本例中，preg_quote($word) 用来使星号不在正则表达式中

										// 具有特殊含义。

										

										$textbody = "This book is *very* difficult to find.";

										$word = "*very*";

										$textbody = preg_replace ("/".preg_quote($word)."/",

										                          "<i>".$word."</i>",

										                          $textbody);

										?>

preg_replace_callback

(PHP 4 >= 4.0.5)

preg_replace_callback -- 用回调函数执行正则表达式的搜索和替换

说明

mixed preg_replace_callback ( mixed pattern, callback callback, mixed subject [, int limit])

本函数的行为几乎和 preg_replace() 一样，除了不是提供一个 replacement 参数，而是指定一个 callback 函数。该函数将以目标字符串中的匹配数组作为输入参数，并返回用于替换的字符串。

例子 1. preg_replace_callback() 例子

<?php

										  // 此文本是用于 2002 年的，

										  // 现在想使其能用于 2003 年

										  $text = "April fools day is 04/01/2002n";

										  $text.= "Last christmas was 12/24/2001n";

										

										  // 回调函数

										  function next_year($matches) {

										    // 通常：$matches[0] 是完整的匹配项

										    // $matches[1] 是第一个括号中的子模式的匹配项

										    // 以此类推

										    return $matches[1].($matches[2]+1);

										  }

										

										  echo preg_replace_callback(

										              "|(d{2}/d{2}/)(d{4})|",

										              "next_year",

										              $text);

										

										  // 结果为：

										  // April fools day is 04/01/2003

										  // Last christmas was 12/24/2002

										?>

You'll often need the callback function for a preg_replace_callback() in just one place. In this case you can use create_function() to declare an anonymous function as callback within the call to preg_replace_callback(). By doing it this way you have all information for the call in one place and do not clutter the function namespace with a callback functions name not used anywhere else.

例子 2. preg_replace_callback() 和 create_function()

<?php

										  /* 一个 UNIX 风格的命令行过滤器，将每个段落开头的

										   * 大写字母转换成小写字母 */

										

										  $fp = fopen("php://stdin", "r") or die("can't read stdin");

										  while (!feof($fp)) {

										      $line = fgets($fp);

										      $line = preg_replace_callback(

										          '|<p>s*w|',

										          create_function(

										              // 这里使用单引号很关键，

										              // 否则就把所有的 $ 换成 $

										              '$matches',

										              'return strtolower($matches[0]);'

										          ),

										          $line

										      );

										      echo $line;

										  }

										  fclose($fp);

										?>

preg_replace

(PHP 3>= 3.0.9, PHP 4 )

preg_replace -- 执行正则表达式的搜索和替换

说明

mixed preg_replace ( mixed pattern, mixed replacement, mixed subject [, int limit])

在 subject 中搜索 pattern 模式的匹配项并替换为 replacement。如果指定了 limit，则仅替换 limit 个匹配，如果省略 limit 或者其值为 -1，则所有的匹配项都会被替换。

replacement 可以包含 \n 形式或（自 PHP 4.0.4 起）$n 形式的逆向引用，首选使用后者。每个此种引用将被替换为与第 n 个被捕获的括号内的子模式所匹配的文本。n 可以从 0 到 99，其中 \0 或 $0 指的是被整个模式所匹配的文本。对左圆括号从左到右计数（从 1 开始）以取得子模式的数目。

对替换模式在一个逆向引用后面紧接着一个数字时（即：紧接在一个匹配的模式后面的数字），不能使用熟悉的 \1 符号来表示逆向引用。举例说 \11，将会使 preg_replace() 搞不清楚是想要一个 \1 的逆向引用后面跟着一个数字 1 还是一个 \11 的逆向引用。本例中的解决方法是使用 ${1}1。这会形成一个隔离的 $1 逆向引用，而使另一个 1 只是单纯的文字。

例子 1. 逆向引用后面紧接着数字的用法

<?php

										$string = "April 15, 2003";

										$pattern = "/(w+) (d+), (d+)/i";

										$replacement = "${1}1,$3";

										print preg_replace($pattern, $replacement, $string);

										

										/* Output

										   ======

										

										April1,2003

										

										*/

										?>

如果搜索到匹配项，则会返回被替换后的 subject，否则返回原来不变的 subject。

preg_replace() 的每个参数（除了 limit）都可以是一个数组。如果 pattern 和 replacement 都是数组，将以其键名在数组中出现的顺序来进行处理。这不一定和索引的数字顺序相同。如果使用索引来标识哪个 pattern 将被哪个 replacement 来替换，应该在调用 preg_replace() 之前用 ksort() 对数组进行排序。

例子 2. 在 preg_replace() 中使用索引数组

<?php

										$string = "The quick brown fox jumped over the lazy dog.";

										

										$patterns[0] = "/quick/";

										$patterns[1] = "/brown/";

										$patterns[2] = "/fox/";

										

										$replacements[2] = "bear";

										$replacements[1] = "black";

										$replacements[0] = "slow";

										

										print preg_replace($patterns, $replacements, $string);

										

										/* Output

										   ======

										

										The bear black slow jumped over the lazy dog.

										

										*/

										

										/* By ksorting patterns and replacements,

										   we should get what we wanted. */

										

										ksort($patterns);

										ksort($replacements);

										

										print preg_replace($patterns, $replacements, $string);

										

										/* Output

										   ======

										

										The slow black bear jumped over the lazy dog.

										

										*/

										

										?>

如果 subject 是个数组，则会对 subject 中的每个项目执行搜索和替换，并返回一个数组。

如果 pattern 和 replacement 都是数组，则 preg_replace() 会依次从中分别取出值来对 subject 进行搜索和替换。如果 replacement 中的值比 pattern 中的少，则用空字符串作为余下的替换值。如果 pattern 是数组而 replacement 是字符串，则对 pattern 中的每个值都用此字符串作为替换值。反过来则没有意义了。

/e 修正符使 preg_replace() 将 replacement 参数当作 PHP 代码（在适当的逆向引用替换完之后）。提示：要确保 replacement 构成一个合法的 PHP 代码字符串，否则 PHP 会在报告在包含 preg_replace() 的行中出现语法解析错误。

例子 3. 替换数个值

<?php

										$patterns = array ("/(19|20)(d{2})-(d{1,2})-(d{1,2})/",

										                   "/^s*{(w+)}s*=/");

										$replace = array ("\3/\4/\1\2", "$\1 =");

										print preg_replace ($patterns, $replace, "{startDate} = 1999-5-27");

										?>

本例将输出：

$startDate = 5/27/1999

例子 4. 使用 /e 修正符

<?php

										preg_replace ("/(</?)(w+)([^>]*>)/e",

										              "'\1'.strtoupper('\2').'\3'",

										              $html_body);

										?>

这将使输入字符串中的所有 HTML 标记变成大写。

例子 5. 将 HTML 转换成文本

<?php

										// $document 应包含一个 HTML 文档。

										// 本例将去掉 HTML 标记，javascript 代码

										// 和空白字符。还会将一些通用的

										// HTML 实体转换成相应的文本。

										

										$search = array ("'<script[^>]*?>.*?</script>'si",  // 去掉 javascript

										                 "'<[/!]*?[^<>]*?>'si",           // 去掉 HTML 标记

										                 "'([rn])[s]+'",                 // 去掉空白字符

										                 "'&(quot|#34);'i",                 // 替换 HTML 实体

										                 "'&(amp|#38);'i",

										                 "'&(lt|#60);'i",

										                 "'&(gt|#62);'i",

										                 "'&(nbsp|#160);'i",

										                 "'&(iexcl|#161);'i",

										                 "'&(cent|#162);'i",

										                 "'&(pound|#163);'i",

										                 "'&(copy|#169);'i",

										                 "'&#(d+);'e");                    // 作为 PHP 代码运行

										

										$replace = array ("",

										                  "",

										                  "\1",

										                  """,

										                  "&",

										                  "<",

										                  ">",

										                  " ",

										                  chr(161),

										                  chr(162),

										                  chr(163),

										                  chr(169),

										                  "chr(\1)");

										

										$text = preg_replace ($search, $replace, $document);

										?>

注: limit 参数是 PHP 4.0.1pl2 之后加入的。

preg_split

(PHP 3>= 3.0.9, PHP 4 )
preg_split -- 用正则表达式分割字符串
说明
array preg_split ( string pattern, string subject [, int limit [, int flags]])

返回一个数组，包含 subject 中沿着与 pattern 匹配的边界所分割的子串。

如果指定了 limit，则最多返回 limit 个子串，如果 limit 是 -1，则意味着没有限制，可以用来继续指定可选参数 flags。

flags 可以是下列标记的任意组合（用按位或运算符 | 组合）：

PREG_SPLIT_NO_EMPTY

如果设定了本标记，则 preg_split() 只返回非空的成分。

PREG_SPLIT_DELIM_CAPTURE

如果设定了本标记，定界符模式中的括号表达式也会被捕获并返回。本标记添加于 PHP 4.0.5。

PREG_SPLIT_OFFSET_CAPTURE

如果设定了本标记，如果设定本标记，对每个出现的匹配结果也同时返回其附属的字符串偏移量。注意这改变了返回的数组的值，使其中的每个单元也是一个数组，其中第一项为匹配字符串，第二项为其在 subject 中的偏移量。本标记自 PHP 4.3.0 起可用。

例子 1. preg_split() 例子

<?php // split the phrase by any number of commas or space characters, // which include " ", r, t, n and f $keywords = preg_split ("/[s,]+/", "hypertext language, programming"); ?>

例子 2. 将字符串分割成字符

<?php $str = 'string'; $chars = preg_split('//', $str, -1, PREG_SPLIT_NO_EMPTY); print_r($chars); ?>
例子 3. 将字符串分割为匹配项及其偏移量

<?php $str = 'hypertext language programming'; $chars = preg_split('/ /', $str, -1, PREG_SPLIT_OFFSET_CAPTURE); print_r($chars); ?>

本例将输出：
Array
(
    [0] => Array
        (
            [0] => hypertext
            [1] => 0
        )

    [1] => Array
        (
            [0] => language
            [1] => 10
        )

    [2] => Array
        (
            [0] => programming
            [1] => 19
        )

)
注: flags 是 PHP 4 Beta 3 添加的。

类别 : PHP（78） | 浏览（4227） | 评论(0)

发表评论（评论将通过邮件发给作者）：