php regex markdown preg-replace-callback

Using text styling to create a list with sublists

I am trying to make markup to format an Ordered list, here is the markup style:

$strings = "1. dog
1. cat
1. fish
 1. horse
 1. monkey
1. pig
";

horse and monkey from that list should be part of a sublist, since they have one space before the number. Here is the code that I am using:

function blq($match){
    $str = preg_replace("/^1\. (.+?)$/m", "<li>$1</li>", $match[0]);
    $str = preg_replace_callback("/(?:^1\. .+(\n|$))+/m", 'blq', $str);
    return "<ol>$str</ol>";
}

$string = preg_replace_callback("/(?:^ ?1\. .+(\n|$))+/m", 'blq', $strings);

echo $string;

That code is creating this output:

<ol><li>dog
</li>
<li>cat
</li>
<li>fish
</li>
 1. horse
 1. monkey
<li>pig
</li>
</ol>

horse and monkey were not created as a sublist, but just ignored. I feel that I am getting close to the answer, but I am not sure what to do to get to that answer...

Note I would like to allow an unlimited number of sublists

Solution

<?php

$text = "1. dog
1. cat
1. fish
 1. horse
  1. duck
   1. goose
  1. swan
 1. monkey
  1. chimpanzee
  1. orangutan
  1. whale
1. pig
";

function callback($match) {
    $out = preg_replace_callback("/(^($match[2] +)1\. .+(\\n|$))(?1)*/m", 'callback', $match[0]);
    $out = preg_replace("/^$match[2]1\. (.+)$/m", "<li>$1</li>", $out);
    return "<ol>\n$out</ol>\n";
}

$html = preg_replace_callback("/(^( *)1\. .+(\\n|$))(?1)*/m", 'callback', $text);

echo $html;

?>

Here's an ideone demo.

That's a pretty neat idea you had, using preg_replace_callback recursively. Also, you're right about $-strings not interpolating within double quotes unless they're a set variable; I always forget that. And, you were right to use /m since you want ^ match the beginning of each line (not the beginning of the entire string) and you were also right to use (\n|$) despite that $ matches the end of each line in /m mode—because otherwise, the quantifier + wouldn't work because $ wouldn't actually consume the \n. I didn't see these facts when I first read your question.

Now, let's start with the first expression:

/(^( *)1\. .+(\\n|$))(?1)*/m

Actually, the recursive subexpression, (?1), isn't necessary except as shorthand. Let's expand that:

/(^( *)1\. .+(\\n|$))(^( *)1\. .+(\\n|$))*/m
 |                  ||                  |
 +------------------++------------------+

So we have two identical halves. Why not just use + as you did? Because I want to capture the number of spaces indenting the first line, only. Those spaces get stored in $match[2].

Within the callback, we bring those spaces back, plus one or more spaces:

/(^($match[2] +)1\. .+(\\n|$))(?1)*/m

That way, we only ever look at levels beneath the current level of indentation (more spaces), on each level of preg_replace_callback recursion. And as the recursions unwind, only the lines indented by exactly that level's number of spaces, $match[2], are wrapped in <li></li>,

/^$match[2]1\. (.+)$/m

before returning the whole wrapped in <ol></ol>.