Search code examples
regexawksed

Bash: Search for consecutive repetitions of a pattern and replace it with a string containing the number of repetitions


I used pandoc to convert a .docx file to .tex. The original file was a fill in the blank where the blanks were created using the _ character repeatedly.

In .tex this has been literally converted to \_ by pandoc. However, there are little spaces between the underscores and overall the blanks are too long.

I'd like to find strings like \_\_\_ (three repetitions of \_) and substitute them by a tex command like \rule[-0.1ex]{3em}{0.5pt}. In general, if N is the number of repetitions, then it would be \rule[-0.1ex]{N em}{0.5pt}.

Since all the blanks have various sizes, I need to match for all possible lengths. I read about groups in sed, but couldn't figure out how to use them here. I'm not proficient at regex at all and am somewhat overwhelmed with the cryptic regex patterns I could find so far...

Adding requested information:

Here is a possible input:

this is some text
\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_

there is some more text

even more text here
\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_.

\hfill\par

text text text
\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_.

\hfill\par

\textbf{Teilmenge}

Some text here: \_\_\_\_\_\_\_\_\_\_, and more text as well    \_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_.

Here is the expected output:

this is some text
\rule[-0.1ex]{42em}{0.5pt}

there is some more text

even more text here
\rule[-0.1ex]{24em}{0.5pt}.

\hfill\par

text text text
\rule[-0.1ex]{37em}{0.5pt}.

\hfill\par

\textbf{Teilmenge}

Some text here: \rule[-0.1ex]{10em}{0.5pt}, and more text as well \rule[-0.1ex]{31em}{0.5pt}.

I don't have anything working as command, but something along the lines of

echo "$TEST" | sed 's/([\\\_]+)/\rule[-0.1ex]{length(\1) em}{0.5pt}/'

Solution

  • sed is not the right tool for this. I suggest using an awk solution like this:

    awk '{
    for (i=1; i<=NF; ++i)
       if ($i ~ /\\_/)
          $i = "\\rule[-0.1ex]{" gsub(/\\_/, "", $i) "em}{0.5pt}" $i
    } 1' file
    
    this is some text
    \rule[-0.1ex]{42em}{0.5pt}
    
    there is some more text
    
    even more text here
    \rule[-0.1ex]{24em}{0.5pt}.
    
    \hfill\par
    
    text text text
    \rule[-0.1ex]{37em}{0.5pt}.
    
    \hfill\par
    
    \textbf{Teilmenge}
    
    Some text here: \rule[-0.1ex]{10em}{0.5pt}, and more text as well \rule[-0.1ex]{31em}{0.5pt}.
    

    Note that gsub function returns number of replacements in the output and we use that number to construct the output we need in before em.