Search code examples
phpregexmarkdownpreg-replace-callback

Using text styling to create a list with sublists


I am trying to make markup to format an Ordered list, here is the markup style:

$strings = "1. dog
1. cat
1. fish
 1. horse
 1. monkey
1. pig
";

horse and monkey from that list should be part of a sublist, since they have one space before the number. Here is the code that I am using:

function blq($match){
    $str = preg_replace("/^1\. (.+?)$/m", "<li>$1</li>", $match[0]);
    $str = preg_replace_callback("/(?:^1\. .+(\n|$))+/m", 'blq', $str);
    return "<ol>$str</ol>";
}

$string = preg_replace_callback("/(?:^ ?1\. .+(\n|$))+/m", 'blq', $strings);

echo $string;

That code is creating this output:

<ol><li>dog
</li>
<li>cat
</li>
<li>fish
</li>
 1. horse
 1. monkey
<li>pig
</li>
</ol>

horse and monkey were not created as a sublist, but just ignored. I feel that I am getting close to the answer, but I am not sure what to do to get to that answer...

Note I would like to allow an unlimited number of sublists


Solution

  • <?php
    
    $text = "1. dog
    1. cat
    1. fish
     1. horse
      1. duck
       1. goose
      1. swan
     1. monkey
      1. chimpanzee
      1. orangutan
      1. whale
    1. pig
    ";
    
    function callback($match) {
        $out = preg_replace_callback("/(^($match[2] +)1\. .+(\\n|$))(?1)*/m", 'callback', $match[0]);
        $out = preg_replace("/^$match[2]1\. (.+)$/m", "<li>$1</li>", $out);
        return "<ol>\n$out</ol>\n";
    }
    
    $html = preg_replace_callback("/(^( *)1\. .+(\\n|$))(?1)*/m", 'callback', $text);
    
    echo $html;
    
    ?>
    

    Here's an ideone demo.


    That's a pretty neat idea you had, using preg_replace_callback recursively. Also, you're right about $-strings not interpolating within double quotes unless they're a set variable; I always forget that. And, you were right to use /m since you want ^ match the beginning of each line (not the beginning of the entire string) and you were also right to use (\n|$) despite that $ matches the end of each line in /m mode—because otherwise, the quantifier + wouldn't work because $ wouldn't actually consume the \n. I didn't see these facts when I first read your question.

    Now, let's start with the first expression:

    /(^( *)1\. .+(\\n|$))(?1)*/m
    

    Actually, the recursive subexpression, (?1), isn't necessary except as shorthand. Let's expand that:

    /(^( *)1\. .+(\\n|$))(^( *)1\. .+(\\n|$))*/m
     |                  ||                  |
     +------------------++------------------+
    

    So we have two identical halves. Why not just use + as you did? Because I want to capture the number of spaces indenting the first line, only. Those spaces get stored in $match[2].

    Within the callback, we bring those spaces back, plus one or more spaces:

    /(^($match[2] +)1\. .+(\\n|$))(?1)*/m
    

    That way, we only ever look at levels beneath the current level of indentation (more spaces), on each level of preg_replace_callback recursion. And as the recursions unwind, only the lines indented by exactly that level's number of spaces, $match[2], are wrapped in <li></li>,

    /^$match[2]1\. (.+)$/m
    

    before returning the whole wrapped in <ol></ol>.