Search code examples
pythontextdata-cleaning

Correcting words broken into syllables in a text


I converted a .pdf file into .txt using Python. It is fairly easy to "clean" the text by removing special characters or certain characters that I don't want, however I have an interesting problem that I haven't managed to figure out other than manually.

The text is in German and some words are broken into syllables (they were probably like that in the original .pdf). So I have stuff like

Das ist die Belastung eines Grundstücks mit der Haftung für bestimmte, in der Regel wiederkeh-
rende Leistungen des jeweiligen Grundeigentümers.

It it not a good idea to just delete the hyphens because sometimes they make sense, such as in Verkehrs- und Tarifverbund Stuttgart.

Is there any way to avoid doing it manually? It happens in almost every sentence.


Solution

  • If the word was split due to it being too long and at the end of the line, you should be able to just remove "-\n" (replace it with "").

    If your document uses some other special character to indicate the end of line, you need to replace \n with that instead.