Search code examples
pythontextpunctuation

Keeping punctuation as its own unit in Preprocessed Text


what is the code to split a sentence into a list of its constituent words AND punctuation? Most text preprocessing programs tend to remove punctuations.

For example, if I enter this:

"Punctuations to be included as its own unit."

The desired output would be:

result = ['Punctuations', 'to', 'be', 'included', 'as', 'its', 'own', 'unit', '.']

many thanks!


Solution

  • You might want to consider using a Natural Language Toolkit or nltk.

    Try this:

    import nltk
    
    sentence = "Punctuations to be included as its own unit."
    tokens = nltk.word_tokenize(sentence)
    print(tokens)
    

    Output: ['Punctuations', 'to', 'be', 'included', 'as', 'its', 'own', 'unit', '.']