Search code examples
pythonweb-scrapingnlpstrip

How to remove ALL kind of linebreaks or formattings from strings in python


I know the classic way of dealing with linebreaks, tabs,.. is to .strip() or .remove('\n',''). But sometimes there are special cases in which these methods fail, e.g.

         'H\xf6cke\n\n:\n\nDie'.strip()

  gives: 'H\xf6cke\n\n:\n\nDie'

How can I catch these rare cases which would have to be covered one by one (e.g. by .remove('*', '')? The above is just one example I came across.


Solution

  • In [1]: import re
    
    In [2]: text = 'H\xf6cke\n\n:\n\nDie'
    
    In [3]: re.sub(r'\s+', '', text)
    Out[3]: 'Höcke:Die'
    

    \s:

    Matches Unicode whitespace characters (which includes [ \t\n\r\f\v], and also many other characters, for example the non-breaking spaces mandated by typography rules in many languages). If the ASCII flag is used, only [ \t\n\r\f\v] is matched (but the flag affects the entire regular expression, so in such cases using an explicit [ \t\n\r\f\v] may be a better choice).

    '+'

    Causes the resulting RE to match 1 or more repetitions of the preceding RE.