Search code examples
asciiwhitespacemultilingualnon-ascii-characters

Is there a compilation somewhere all kinds of whitespace: ascii and non-ascii whitespaces?


I am working with documents of different sources (and also different languages) and I am having a lot of trouble with different definitions of whitespaces.

For instance '\xa0' does no belong to this list of wilipedia Whitespace

I want to replace all of them by ' '. For instance,

text = re.sub(r'\xa0', ' ', text)

Solution

  • U+00A0 is on that Wikipedia page you linked to, in the Unicode list.

    I'd say that Unicode.org has the definitive list: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5Cp%7Bwhitespace%7D