Search code examples
regexcsvsedlatexmultiline

Match multiple regex groups on different lines from a tex to print into csv


I have a beamer latex file, in that file some frames have the form

\frame{\frametitle{Title01}
Sub01\\
\begin{tabular}{|p{7cm}|}
\hline
\rowcolor{black}\\
\rowcolor{white}\\
\rowcolor{green}\\
\hline
\end{tabular}
}

I would like to end up with a csv format like

Title01,Sub01,black,white,green
Title02,Sub02,red,white,blue

So far I have managed to get all the titles with

sed -rn 's/^.*frametitle\{(.*)\}/\1,/pm' f.tex

I am failing to match the second group Sub01 (for now with latexlinebreak \) in the next line, a small selection of what I have tried so far

sed -rn 's/^.*frametitle\{(.*)\}\n(.*)$/\1,\2/mp' f.tex
sed -rn 's/^.*frametitle\{(.*)\}$^(.*)$/\1,\2/mp' f.tex
sed -rn 's/^.*frametitle\{(.*)(\}\n)(.*)$/\1,\3/mp' f.tex
sed -rn 's/^.*frametitle\{(.*)\}\n(.*)\n/\1,\2/mp' f.tex

all matching either just the title or nothing at all.


Solution

  • This might work for you (GNU sed):

    sed -n '/^\\frame{\\frametitle{\(.*\)}.*/{s//\1/;h;n;s/\([^\]*\).*/\1/;H;:a;n;/^\\rowcolor{\(.*\)}.*/{s//\1/;H};/^}/!ba;g;s/\n/,/gp}' file
    

    This is a filtering job, so use the -n option to only print what you want.

    The data required exists between a line starting \frame{\frametitle{...} and ends with a line staring }.

    Using the above criteria, copy the required matching data into the hold space and on encountering the end of the match, replace the current line by this copied data.

    The data will be delimited by newlines, so replace these by commas and print out the result.