Search code examples
pythonimportpython-re

unique words dictionary remove special characters and numbers


I want to make a dictionary from a book, unfortunately I have a problem

import re

with open('vechny.txt', encoding='utf-8') as fname:
    text = fname.read()
    lst = list(set(text.split()))
    str1 = ' '.join(str(e) for e in lst)
    print(str1, file=open("1000.txt", "a", encoding='utf-8'))



in_file = open("1000.txt", "r", encoding='utf-8')
lines = in_file.read().split(' ')
in_file.close()

out_file = open("file.txt", "w", encoding='utf-8')
out_file.write("\n".join(lines))
out_file.close()

this script works well but need to remove special characters

, .-, ect ... from plain text

example have words Hay, split takes it as one word and therefore does not remove it

how to make text

input
Hay, hello,% lost. 15 čas řad
output im search is
hay hello lost cas rad

Solution

  • What about this?

    import re
    str1 = '#@-/abcüšščřžý'
    r = re.findall(r'\b\d*[^\W\d_][^\W_]*\b', str1, re.UNICODE)
    str2 = ''.join(r)
    print(str2)