Search code examples
pythontext

Properly unwrap long text with Python


I'm trying to find a way to properly unwrap long text in Python. What I mean by "properly" is that only single occurrences of \r\n should be replaced by a space, and any sequence of two or more should be retained. For instance:

text = "The 44th Chess Olympiad was an international\r\nteam chess event organised by the International Chess Federation (FIDE)\r\nin Chennai, India from 28 July to 10 August 2022.\r\n\r\nIt consisted of Open and Women's\r\ntournaments, as well as\r\nseveral events to promote chess.\r\n\r\nThe Olympiad was initially supposed to take place\r\nin Khanty-Mansiysk, Russia,\r\nthe host of the Chess World Cup 2019,\r\nin August 2020, but it was later moved to Moscow.\r\n\r\nHowever, it was postponed due to the COVID-19 pandemic\r\nand then relocated to Chennai."

\r\n\r\n should be retained, but \r\n should be replaced by a space.

I'm really bad with regular expressions, which is often the option with best performance, so any hint would be much appreciated!


Solution

  • With Python, you can have many other options not just using 're'. But if you wanna try, maybe you can try this one:

    def unwrap_text(text):
        # Replace single occurrences of \r\n with a space
        unwrapped_text = re.sub(r'(?<!\r\n)\r\n', ' ', text)
        return unwrapped_text