Search code examples
pythonnltktokenize

Modify python nltk.word_tokenize to exclude "#" as delimiter


I am using Python's NLTK library to tokenize my sentences.

If my code is

text = "C# billion dollars; we don't own an ounce C++"
print nltk.word_tokenize(text)

I get this as my output

['C', '#', 'billion', 'dollars', ';', 'we', 'do', "n't", 'own', 'an', 'ounce', 'C++']

The symbols ; , . , # are considered as delimiters. Is there a way to remove # from the set of delimiters like how + isn't a delimiter and thus C++ appears as a single token?

I want my output to be

['C#', 'billion', 'dollars', ';', 'we', 'do', "n't", 'own', 'an', 'ounce', 'C++']

I want C# to be considered as one token.


Solution

  • Another idea: instead of altering how text is tokenized, just loop over the tokens and join every '#' with the preceding one.

    txt = "C# billion dollars; we don't own an ounce C++"
    tokens = word_tokenize(txt)
    
    i_offset = 0
    for i, t in enumerate(tokens):
        i -= i_offset
        if t == '#' and i > 0:
            left = tokens[:i-1]
            joined = [tokens[i - 1] + t]
            right = tokens[i + 1:]
            tokens = left + joined + right
            i_offset += 1
    
    >>> tokens
    ['C#', 'billion', 'dollars', ';', 'we', 'do', "n't", 'own', 'an', 'ounce', 'C++']