Search code examples
pythontokenizecorpuslinguistics

How to Tokenize A Text By Regular Expression In Python


Is there any way to clean a text from whitespaces and dots, commas without NLTK, but especially by regular expressions?


Solution

  • If I have understood your question you can try this code

    import re
    
    text = "Split.this,text in seven.separate,words"
    
    myexp=re.compile(r'[\s.,]')
    
    print myexp.split(text)
    

    that gives you this output

    ['Split', 'this', 'text', 'in', 'seven', 'separate', 'words']