Search code examples
regexperl

Eliminate whitespace around single letters


I frequently receive PDFs that contain (when converted with pdftotext) whitespaces between the letters of some arbitrary words:

This i s a n example t e x t that c o n t a i n s strange spaces.

For further automated processing (looking for specific words) I would like to remove all whitespace between "standalone" letters (single-letter words), so the result would look like this:

This isan example text that contains strange spaces.

I tried to achieve this with a simple perl regex:

s/ (\w) (\w) / $1$2 /g

Which of course does not work, as after the first and second standalone letters have been moved together, the second one no longer is a standalone, so the space to the third will not match:

This is a n example te x t that co n ta i ns strange spaces.

So I tried lockahead assertions, but failed to achieve anything (also because I did not find any example that uses them in a substitution).

As usual with PRE, my feeling is, that there must be a very simple and elegant solution for this...


Solution

  • Just match a continuous series of single letters separated by spaces, then delete all spaces from that using a nested substitution (the /e eval modifier).

    s{\b ((\w\s)+\w) \b}{ my $s = $1; $s =~ s/ //g; $s }xge;