考虑匹配圆括号内字符串的问题,允许无限嵌套括号。如果不使用递归, 最好的方式是使用一个模式匹配固定深度的嵌套。它不能处理任意深度的嵌套。 perl 5.6 提供了一个实验性的功能允许正则表达式递归。 特殊项 (?R) 提供了递归的这种特殊用法。 这个PCRE模式解决了圆括号问题(假设 PCRE_EXTENDED 选项被设置了, 因此空白字符被忽略): \( ( (?>[^()]+) | (?R) )* \)。
首先,它匹配一个左括号。 然后它匹配任意数量的非括号字符序列或一个模式自身的递归匹配(比如, 一个正确的括号子串),最终,匹配一个右括号。
这个例子模式包含无限重复的嵌套,因此使用了一次性子组匹配非括号字符, 这在模式应用到模式不匹配的字符串时非常重要。比如, 当它应用到 (aaaaaaaaaaaaaaaaaaaaaaaaaaaaa() 时就会很快的产生”不匹配”结果。 然而,如果不使用一次性子组,这个匹配将会运行很长时间, 因为有很多途径让 + 和 * 重复限定分隔目标字符串, 并且在报告失败之前需要测试所有路径。
所有捕获子组最终被设置的捕获值都是从递归最外层子模式捕获的值。 如果上面的模式匹配 (ab(cd)ef),捕获子组最终被设置的值为 ”ef”, 即顶级得到的最后一个值。如果增加了额外的括号,\( ( ( (?>[^()]+) | (?R) )* ) \),捕获到的字符串就是顶层括号的匹配内容 ”ab(cd)ef”。 如果在模式中有超过 15 个捕获括号, PCRE 在递归期间就会使用 pcre_malloc 分配额外的内存来存储数据, 随后通过 pcre_free 释放他们。如果没有内存可被分配,它就仅保存前 15 个捕获括号, 在递归内部无法给出内存不够用的错误。
从 PHP 4.3.3 开始, (?1)、(?2) 等可以用于递归子组。这同样可以用于命名子组: (?P>name) 或 (?P&name)。
如果递归子组语法在它提到的子组括号外部使用(无论是子组数字序号还是子组名称), 这个操作就相当于程序设计语言中的子程序。 前面一些有一个例子指出模式 (sens|respons)e and \1ibility 匹配 ”sense and responsibility” 和 ”response and responsibility”,但是不匹配 ”sense and responsibility”。如果用模式 (sens|respons)e and (?1)ibility 替代, 它会像匹配那两个字符串一样匹配 ”sense and responsibility”。 这种引用方式意义是紧接着匹配引用的子模式。(译注: 后向引用只匹配引用的子组之前匹配的结果, 这里的递归语法引用是拿引用的子模式重新匹配。)
目标字符串的最大长度是 int 型变量可以存储的最大正整数。然而, PCRE 使用递归处理子组和无限重复。 这就是说对于某些模式可用的栈空间可能会受目标字符串限制。
horvath at webarticum dot hu (2013-04-15 12:09:14)
Below are some reusable patterns. I used comments with the 'x' modifier for human-readability.
You can write also a function, that generates patterns for specified bracket/quote pairs.
<?php
/* normal paretheses */
$simple_pattern = "( (?#root pattern)
( (?#text or expression)
(?>[^\\(\\)]+) (?#text)
|
\\((?-2)\\) (?#expression and recursion)
)*
)";
$simple_okay_text = "5( 2a + (b - c) ) - a * ( 2b - (c * 3(b - (c + a) ) ) )";
$simple_bad_text = "5( 2)a + (b - c) ) - )a * ( ((2b - (c * 3(b - (c + a) ) ) )";
echo "Simple pattern results:\n";
var_dump(preg_match("/^$simple_pattern\$/x",$simple_okay_text));
var_dump(preg_match("/^$simple_pattern\$/x",$simple_bad_text));
echo "\n----------\n";
/* some brackets */
$full_pattern = "( (?#root pattern)
( (?#text or expression)
(?>[^\\(\\)\\{\\}\\[\\]<>]+) (?#text not contains brackets)
|
(
[\\(\\{\\[<] (?#start bracket)
(?(?<=\\()(?-3)\\)| (?#if normal)
(?(?<=\\{)(?-3)\\}| (?#if coursed)
(?(?<=\\[)(?-3)\\]| (?#if squared)
(?1)\\> (?#else so if tag)
))) (?#close nested-but-logically-the-some-level subpatterns)
)
)*
)";
$full_okay_text = "5( 2a + [b - c] ) - a * ( 2b - {c * 3<b - (c + a) > } )";
$full_bad_text = "5[ 2a + [b - c} ) - a * ( 2b - [c * 3(b - c + a) ) ) }";
echo "Full pattern results:\n";
var_dump(preg_match("/^$full_pattern\$/x",$simple_okay_text));
var_dump(preg_match("/^$full_pattern\$/x",$full_okay_text));
var_dump(preg_match("/^$full_pattern\$/x",$full_bad_text));
echo "\n----------\n";
/* some brackets and quotes */
$extrafull_pattern = "( (?#root pattern)
( (?#text or expression)
(?>[^\\(\\)\\{\\}\\[\\]<>'\"]+) (?#text not contains brackets and quotes)
|
(
([\\(\\{\\[<'\"]) (?#start bracket)
(?(?<=\\()(?-4)\\)| (?#if normal)
(?(?<=\\{)(?-4)\\}| (?#if coursed)
(?(?<=\\[)(?-4)\\]| (?#if squared)
(?(?<=\\<)(?-4)\\>| (?#if tag)
['\"] (?#else so if static)
)))) (?#close nested-but-logically-the-some-level subpatterns)
)
)*
)";
$extrafull_okay_text = "5( 2a + ['b' - c] ) - a * ( 2b - \"{c * 3<b - (c + a) > }\" )";
$extrafull_bad_text = "5( 2a + ['b' - c] ) - a * ( 2b - \"{c * 3<b - (c + a) > }\" )";
echo "Extra-full pattern results:\n";
var_dump(preg_match("/^$extrafull_pattern\$/x",$simple_okay_text));
var_dump(preg_match("/^$extrafull_pattern\$/x",$full_okay_text));
var_dump(preg_match("/^$extrafull_pattern\$/x",$extrafull_okay_text));
var_dump(preg_match("/^$extrafull_pattern\$/x",$extrafull_bad_text));
/*
Outputs:
Simple pattern results:
int(0)
int(0)
----------
Full pattern results:
int(0)
int(0)
int(0)
----------
Extra-full pattern results:
int(0)
int(0)
int(0)
int(0)
*/
?>
horvath at webarticum dot hu (2013-04-15 12:07:51)
With the (?R) item you can link only to the full pattern, because it quasi equals to (?0). You can not use anchors, asserts etc., and you can only check that string CONTAINS a valid hierarchy or not.
This is wrong: ^\(((?>[^()]+)|(?R))*\)$
However, you can bracketing the full expression, and replace (?R) to the relative link (?-2). This make it reusable. So you can check complex expressions, for example:
<?php
$bracket_system = "(\\(((?>[^()]+)|(?-2))*\\))"; // (reuseable)
$bracket_systems = "((?>[^()]+)?$bracket_system)*(?>[^()]+)?"; // (reuseable)
$equation = "$bracket_systems=$bracket_systems"; // Both side of the equation must be contain valid bracket systems
var_dump(preg_match("/^$equation\$/","a*(a-(2a+2))=4(a+3)-2(a-(a-2))")); // Outputs 'int(1)'
var_dump(preg_match("/^$equation\$/","a*(a-(2a+2)=4(a+3)-2(a-(a-2)))")); // Outputs 'int(0)'
?>
You can also catch multibyte quotes with the 'u' modifier (if you use UTF-8), eg:
<?php
$quoted = "(?((?>[^??]+)|(?-2))*?)"; // (reuseable)
$prompt = "[\\w ]+: $quoted";
var_dump(preg_match("/^$prompt\$/u","Your name: ?write here?")); // Outputs 'int(1)'
?>
Daniel Klein (2012-06-06 04:18:40)
The order of non-mutually exclusive alternatives within a recursed sub-pattern is important.
<?php
$pattern = '/^(?<octet>[01]?\d?\d|2[0-4]\d|25[0-5])(?:\.(?P>octet)){3}$/';
?>
You might expect that this pattern will match any IP address in dotted-decimal notation (e.g. '123.45.67.89'). The pattern is intended to match four octets in the following ranges: 0-9, 00-99 & 000-255, each separated by a single dot. However, only the first octet can include values from 200-255; the remainder can only have values less than 200. The reason for this is that if the rest of the pattern fails, recursion is not back-tracked into to find an alternative match. The first part of the sub-pattern will match the first two digits of any octet from 200-255. The rest of the pattern will then fail because the third digit in the octet does not match either '\.' or '$'.
<?php
var_export(preg_match($pattern, '255.123.45.67')); // 1 (true)
var_export(preg_match($pattern, '255.200.45.67')); // 0 (false)
var_export(preg_match($pattern, '255.123.45.200')); // 0 (false)
?>
The correct pattern is:
<?php
$pattern = '/^(?<octet>25[0-5]|2[0-4]\d|[01]?\d?\d)(?:\.(?P>octet)){3}$/';
?>
Note that the first two alternatives are mutually exclusive so their order is unimportant. The third alternative, however, is not mutually exclusive but it will now only match when the first two fail.
<?php
var_export(preg_match($pattern, '255.123.45.67')); // 1 (true)
var_export(preg_match($pattern, '255.200.45.67')); // 1 (true)
var_export(preg_match($pattern, '255.123.45.200')); // 1 (true)
?>
Onyxagargaryll (2011-03-03 04:49:59)
Here's an approach to create a multidimensional array according to a string's delimiters, i.e. we want to analyze...
"some text (aaa(b(c1)(c2)d)e)(test) more text"
... as multidimensional layers.
<?php
$string = "some text (aaa(b(c1)(c2)d)e)(test) more text";
/*
* Analyses the string multidimensionally by its opening and closing delimiters
*/
function recursiveSplit($string, $layer) {
preg_match_all("/\((([^()]*|(?R))*)\)/",$string,$matches);
// iterate thru matches and continue recursive split
if (count($matches) > 1) {
for ($i = 0; $i < count($matches[1]); $i++) {
if (is_string($matches[1][$i])) {
if (strlen($matches[1][$i]) > 0) {
echo "<pre>Layer ".$layer.": ".$matches[1][$i]."</pre><br />";
recursiveSplit($matches[1][$i], $layer + 1);
}
}
}
}
}
recursiveSplit($string, 0);
/*
Output:
Layer 0: aaa(b(c1)(c2)d)e
Layer 1: b(c1)(c2)d
Layer 2: c1
Layer 2: c2
Layer 0: test
*/
?>
jonah at nucleussystems dot com (2010-12-21 18:18:33)
An unexpected behavior came up that introduced a very hard-to-track bug in some code I was working on. It has to do with the preg_match_all PREG_OFFSET_CAPTURE flag. When you capture the offset of a sub-match, it's offset is given _relative_ to it's parent. For example, if you extract the value between < and > recursively in this string:
<this is a <string>>
You will get an array that looks like this:
Array
(
[0] => Array
(
[0] => Array
(
[0] => <this is a <string>>
[1] => 0
)
[1] => Array
(
[0] => this is a <string>
[1] => 1
)
)
[1] => Array
(
[0] => Array
(
[0] => <string>
[1] => 0
)
[1] => Array
(
[0] => string
[1] => 1
)
)
)
Notice that the offset in the last index is one, not the twelve we expected. The best way to solve this problem is to run over the results with a recursive function, adding the parent's offset.
emanueledelgrande at email dot it (2010-01-09 15:47:35)
The recursion in regular expressions is the only way to allow the parsing of HTML code with nested tags of indefinite depth.
It seems it's not yet a spreaded practice; not so much contents are available on the web regarding regexp recursion, and until now no user contribute notes have been published on this manual page.
I made several tests with complex patterns to get tags with specific attributes or namespaces, studying the recursion of a subpattern only instead of the full pattern.
Here's an example that may power a fast LL parser with recursive descent (http://en.wikipedia.org/wiki/Recursive_descent_parser):
$pattern = "/<([\w]+)([^>]*?) (([\s]*\/>)| (>((([^<]*?|<\!\-\-.*?\-\->)| (?R))*)<\/\\1[\s]*>))/xsm";
The performances of a preg_match or preg_match_all function call over an avarage (x)HTML document are quite fast and may drive you to chose this way instead of classic DOM object methods, which have a lot of limits and are usually poor in performance with their workarounds, too.
I post a sample application in a brief function (easy to be turned into OOP), which returns an array of objects:
<?php
// test function:
function parse($html) {
// I have split the pattern in two lines not to have long lines alerts by the PHP.net form:
$pattern = "/<([\w]+)([^>]*?)(([\s]*\/>)|".
"(>((([^<]*?|<\!\-\-.*?\-\->)|(?R))*)<\/\\1[\s]*>))/sm";
preg_match_all($pattern, $html, $matches, PREG_OFFSET_CAPTURE);
$elements = array();
foreach ($matches[0] as $key => $match) {
$elements[] = (object)array(
'node' => $match[0],
'offset' => $match[1],
'tagname' => $matches[1][$key][0],
'attributes' => isset($matches[2][$key][0]) ? $matches[2][$key][0] : '',
'omittag' => ($matches[4][$key][1] > -1), // boolean
'inner_html' => isset($matches[6][$key][0]) ? $matches[6][$key][0] : ''
);
}
return $elements;
}
// random html nodes as example:
$html = <<<EOD
<div id="airport">
<div geo:position="1.234324,3.455546" class="index">
<!-- comment test:
<div class="index_top" />
-->
<div class="element decorator">
<ul class="lister">
<li onclick="javascript:item.showAttribute('desc');">
<h3 class="outline">
<a href="http://php.net/manual/en/regexp.reference.recursive.php" onclick="openPopup()">Link</a>
</h3>
<div class="description">Sample description</div>
</li>
</ul>
</div>
<div class="clean-line"></div>
</div>
</div>
<div id="omittag_test" rel="rootChild" />
EOD;
// application:
$elements = parse($html);
if (count($elements) > 0) {
echo "Elements found: <b>".count($elements)."</b><br />";
foreach ($elements as $element) {
echo "<p>Tpl node: <pre>".htmlentities($element->node)."</pre>
Tagname: <tt>".$element->tagname."</tt><br />
Attributes: <tt>".$element->attributes."</tt><br />
Omittag: <tt>".($element->omittag ? 'true' : 'false')."</tt><br />
Inner HTML: <pre>".htmlentities($element->inner_html)."</pre></p>";
}
}
?>