I have a huge corpus of text (line by line) and I want to remove special characters but sustain the space and structure of the string.
hello? there A-Z-R_T(,**), world, welcome to python.
this **should? the next line#followed- by@ an#other %million^ %%like $this.
should be
hello there A Z R T world welcome to python
this should be the next line followed by another million like this
You can use this pattern, too, with regex
import re
a = '''hello? there A-Z-R_T(,**), world, welcome to python.
this **should? the next line#followed- by@ an#other %million^ %%like $this.'''
for k in a.split("\n"):
print(re.sub(r"[^a-zA-Z0-9]+", ' ', k))
# Or:
# final = " ".join(re.findall(r"[a-zA-Z0-9]+", k))
# print(final)
hello there A Z R T world welcome to python
this should the next line followed by an other million like this
Otherwise, you can store the final lines into a list
final = [re.sub(r"[^a-zA-Z0-9]+", ' ', k) for k in a.split("\n")]
['hello there A Z R T world welcome to python ', 'this should the next line followed by an other million like this ']