Search code examples
python-3.xregexstringescapingunicode-escapes

Remove all escape sequences from list of strings


I'm playing around with pokebase a python wrapper for pokeAPI and some of the api responses contain \n \x0c etc. In the end I dont need them but I don't want to just loop through every letter to remove them and .replace doesnt seem sustainable either (also cuz I think that would lead to problems).

This is a sample list of strings: https://pastebin.com/SbhR50br

["The female's horn\ndevelops slowly.\nPrefers physical\x0cattacks such as\nclawing and\nbiting.", 'When resting deep\nin its burrow, its\nthorns always\x0cretract.\nThis is proof that\nit is relaxed.', 'When feeding its\nyoung, it first\nchews and tender\xad\x0cizes the food,\nthen spits it out\nfor the offspring.', 'It has a calm and\ncaring nature.\nBecause its horn\x0cgrows slowly, it\nprefers not to\nfight.', 'It has a docile\nnature. If it is\nthreatened with\x0cattack, it raises\nthe barbs that are\nall over its body.', 'When NIDORINA are with their friends or\nfamily, they keep their barbs tucked\naway to prevent hurting each other.\x0cThis POKéMON appears to become\nnervous if separated from the others.', 'When it is with its friends or\nfamily, its barbs are tucked away to\nprevent injury. It appears to become\nnervous if separated from the others.', 'The female has a gentle temperament.\nIt emits ultrasonic cries that have the\npower to befuddle foes.', 'The female’s horns develop slowly.\nPrefers physical attacks such as clawing\nand biting.', 'When it senses danger, it raises\nall the barbs on its body. These\nbarbs grow slower than NIDORINO’s.', 'When feeding its young, it first\nchews the food into a paste, then\nspits it out for the offspring.', 'It has a calm and caring nature.\nBecause its horn grows slowly, it\nprefers not to fight.', 'When it senses danger, it raises\nall the barbs on its body. These\nbarbs grow slower than Nidorino’s.', 'The female has a gentle temperament.\nIt emits ultrasonic cries that have the power\nto befuddle foes.', 'When feeding its young, it first chews the food into\na paste, then spits it out for the offspring.', 'When Nidorina are with their friends or family, they keep their\nbarbs tucked away to prevent hurting each other.\nThis Pokémon appears to become nervous if separated from\nthe others.', 'When Nidorina are with their friends or family, they keep\ntheir barbs tucked away to prevent hurting each other.\nThis Pokémon appears to become nervous if separated\nfrom the others.']
flavor = random.choice([listofstringshere])
#remove \ stuff from flavor here!
print(flavor)

I think I might be able to do something with regex but thats just speculation.


Solution

  • You most likely are facing a enconding problem due your original text data having 'special unicode characters' (not really printable).

    For example,

    \xad are soft-hyphens from unicode utf-8 table conversion. and they are not needed in your case I belive. quoting from here

    These are characters that mark places where a word could be split when fitting lines to a page. The idea is that the soft hyphen is invisible if the word doesn't need to be split, but printed the same as a U+2010 normal hyphen if it does.

    Since you don't care about rendering this text in a book with nicely flowing text, you're never going to hyphenate anything, so you just want to remove these characters.

    \x0c is form feed or page break

    \n is new line and in your case I also believe is related to make the text prettier and you also don't care about it.

    So a full solution would be, use re.sub (substitute/replace):

    1. To remove \xad or \xad\x0c
    2. To put ' ' spaces on \x0c and \n

    import re
    
    egstrings = ["The female's horn\ndevelops slowly.\nPrefers physical\x0cattacks such as\nclawing and\nbiting.", 
               'When resting deep\nin its burrow, its\nthorns always\x0cretract.\nThis is proof that\nit is relaxed.',
                "When feeding its\nyoung, it first\nchews and tender\xad\x0cizes the food,\nthen spits it out\nfor the offspring."]
    
    for flavor in egstrings:
        flavor = re.sub('\xad(\x0c)*',  '', flavor) # replaces \xad or \xad\x0c by nothing
        print(re.sub('[\n-\x0c]', ' ', flavor)) # replaces \n and \x0c by space
    

    The female's horn develops slowly. Prefers physical attacks such as clawing and biting.

    When resting deep in its burrow, its thorns always retract. This is proof that it is relaxed.

    When feeding its young, it first chews and tenderizes the food, then spits it out for the offspring.