I converted a .pdf file into .txt using Python. It is fairly easy to "clean" the text by removing special characters or certain characters that I don't want, however I have an interesting problem that I haven't managed to figure out other than manually.
The text is in German and some words are broken into syllables (they were probably like that in the original .pdf). So I have stuff like
Das ist die Belastung eines Grundstücks mit der Haftung für bestimmte, in der Regel wiederkeh-
rende Leistungen des jeweiligen Grundeigentümers.
It it not a good idea to just delete the hyphens because sometimes they make sense, such as in Verkehrs- und Tarifverbund Stuttgart
.
Is there any way to avoid doing it manually? It happens in almost every sentence.
If the word was split due to it being too long and at the end of the line, you should be able to just remove "-\n"
(replace it with ""
).
If your document uses some other special character to indicate the end of line, you need to replace \n
with that instead.