Search code examples
phplatexcpu-wordpreg-replaceadobe-indesign

Convert LaTeX markup to HTML


[UPDATED]

This is my task – Converting a bunch of custom built LaTeX files to into InDesign. So my current method is: run the .tex files through a PHP script that changes the custom LaTeX codes to more generic TeX codes, then I'm using TeX2Word to convert them to .doc files, and then placing those into InDesign.

What I'm wanting to do with this preg_replace is convert a few of the TeX tags so they won't be touched by TeX2Word, then I'll be able to run a script in InDesign that changes the HTML-like tags to InDesign text frames, footnotes, variables and such.

[/UPDATED]

I have some text with LaTeX markup in it:

$newphrase = "\blockquote{\hspace*{.5em}Lorem ipsum dolor sit amet, consectetur
adipiscing elit. Integer posuere erat a ante venenatis dapibus posuere
velit aliquet. Aenean lacinia bibendum nulla sed consectetur. Aenean
eu leo quam. Pellentesque ornare sem lacinia quam venenatis
vestibulum. Sed posuere consectetur est at lobortis. \note{Integer
posuere erat a ante venenatis dapibus posuere velit aliquet.
\textit{Vivamus} sagittis lacus vel augue laoreet rutrum faucibus
dolor auctor.}}";

What I want to do is remove \blockquote{...} and replace it with <div>...</div>

So I've tried a jillion different versions of this:

$regex = "#(blockquote){(.*)(})#";
$replace = "<div>$2</div>";
$newphrase = preg_replace($regex,$replace,$newphrase);

This is the output

\<div>\hspace*{.5em</div>Lorem ipsum dolor sit amet, consectetur
adipiscing elit. Integer posuere erat a ante venenatis dapibus posuere
velit aliquet. Aenean lacinia bibendum nulla sed consectetur. Aenean
eu leo quam. Pellentesque ornare sem lacinia quam venenatis
vestibulum. Sed posuere consectetur est at lobortis. \note{Integer
posuere erat a ante venenatis dapibus posuere velit aliquet.
\textit{Vivamus} sagittis lacus vel augue laoreet rutrum faucibus
dolor auctor.}}";

The first problem with it is that it replaces everything from \blockquote{ to the first }. When I want it to ignore the next } if there has been another { after the initial \blockquote{.

The next problem I'm having is with the \ I can't seem to escape it! I've tried \\, /\\/, \\\, /\\\/, [\], [\\]. Nothing works! I'm sure it's because I don't understand how it really IS suposed to work.

So finally, This is what I want to end up with:

<div>\hspace*{.5em}Lorem ipsum dolor sit amet, consectetur
adipiscing elit. Integer posuere erat a ante venenatis dapibus posuere
velit aliquet. Aenean lacinia bibendum nulla sed consectetur. Aenean
eu leo quam. Pellentesque ornare sem lacinia quam venenatis
vestibulum. Sed posuere consectetur est at lobortis. \note{Integer
posuere erat a ante venenatis dapibus posuere velit aliquet.
\textit{Vivamus} sagittis lacus vel augue laoreet rutrum faucibus
dolor auctor.}</div>";

I'm planning to make $regex & $replace into arrays, so I can replace things like \textit{Vivamus} with this <em>Vivamus</em>

Any guidance would be MUCH welcomed and appreciated!


Solution

  • If you still want to do the conversion yourself, you can do it using multiple passes thru the string, replacing the inner elements first:

    $t = '\blockquote{\hspace*{.5em}Lorem ipsum dolor sit amet, consectetur
    adipiscing elit. Integer posuere erat a ante venenatis dapibus posuere
    velit aliquet. Aenean lacinia bibendum nulla sed consectetur. Aenean
    eu leo quam. Pellentesque ornare sem lacinia quam venenatis
    vestibulum. Sed posuere consectetur est at lobortis. \note{Integer
    posuere erat a ante venenatis dapibus posuere velit aliquet.
    \textit{Vivamus} sagittis lacus vel augue laoreet rutrum faucibus
    dolor auctor.}}';
    
    function hspace($m) { return "<br />"; }
    function textit($m) { return "<i>" . $m[1] . "</i>"; }
    function note($m) { return "<b>" . $m[1] . "</b>"; }
    function blockquote($m) { return "<quote>" .  $m[1] . "</quote>"; }
    
    while (true) {
      $newt = $t;
      $newt = preg_replace_callback("/\\\\hspace\\*\\{([^{}]*?)\\}/", "hspace", $newt);
      $newt = preg_replace_callback("/\\\\textit\\{([^{}]*?)\\}/", "textit", $newt);
      $newt = preg_replace_callback("/\\\\note\\{([^{}]*?)\\}/", "note", $newt);
      $newt = preg_replace_callback("/\\\\blockquote{([^{}]*?)\\}/", "blockquote", $newt);
    
      if ($newt == $t) break;
      $t = $newt;
    }
    
    echo $t;
    

    But of course, this might work for simple examples, but you cannot use this method to correctly parse the whole TeX format. Also it gets very ineffective for longer inputs.