Search code examples
pythonnlpspacytokenize

Problems using Spacy tokenizer with special characters


I'm new to Spacy and I'm trying to find some patterns in a text, but I'm having trouble because of the form that tokenization works. For example, I have created the following pattern, trying to find percentage elements like "0,42%" using the Matcher (it's not exactly what I want, but I'm just practicing for now):

nlp = spacy.load("pt_core_news_sm")

matcher = Matcher(nlp.vocab)

text = 'total: 1,80%:(comex 1,30% + deriv 0,50%/ativo: 1,17% '

pattern_test =  [{"TEXT": {"REGEX": "[0-9]+[,.]+[0-9]+[%]"}}]  

text_ = nlp(text)

matcher.add("pattern test", [pattern_test] )
result = matcher(text_)

for id_, beg, end in result:
    print(id_)
    print(text_[beg:end])

The thing is that it is returning results like the one below, cause tokenization considers this as only one token:

9844711491635719110
1,80%:(comex
9844711491635719110
0,50%/ativo

I tried using Python's .replace() method on the string to replace special characters for blank spaces before tokenizing it, but now when I print the tokenization result it's separating everything like this:

text_adjustment = text.replace(":", " ").replace("(", " ").replace(")", " ").replace("/", " ").replace(";", " ").replace("-", " ").replace("+", " ")

print([token for token in text_adjustment])

['t', 'o', 't', 'a', 'l', ' ', ' ', '1', ',', '8', '0', '%', ' ', ' ', 'c', 'o', 'm', 'e', 'x', ' ', '1', ',', '3', '0', '%', ' ', ' ', ' ', 'd', 'e', 'r', 'i', 'v', ' ', '0', ',', '5', '0', '%', ' ', 'a', 't', 'i', 'v', 'o', ' ', ' ', '1', ',', '1', '7', '%', ' ']

I would like the tokenization result to be like that:

['total', '1,80%', 'comex', '1,30%', 'deriv', '0,50%', 'ativo', '1,17%']

Is there a better way to do that? I'm using the 'pt_core_news_sm' model, but I can change the language if I want to.

Thanks in advance :)


Solution

  • I suggest using

    import re
    #...
    text = re.sub(r'(\S)([/:()])', r'\1 \2', text)
    pattern_test =  [{"TEXT": {"REGEX": r"^\d+[,.]\d+$"}}, {"ORTH": "%"}]
    

    Here, (\S)([/:()]) regex is used to match any non-whitespace (capturing it into Group 1) and then matching a /, :, ( or ) (capturing it into Group 2) and then re.sub inserts a space between these two groups.

    The ^\d+[,.]\d+$ regex matches a full token text that contains a float value and the % is the next token text (because the number and % are split into separate tokens by the model).

    Full Python code snippet:

    import spacy, re
    from spacy.matcher import Matcher
    
    #nlp = spacy.load("pt_core_news_sm")
    nlp = spacy.load("en_core_web_trf")
    matcher = Matcher(nlp.vocab)
    text = 'total: 1,80%:(comex 1,30% + deriv 0,50%/ativo: 1,17% '
    text = re.sub(r'(\S)([/:()])', r'\1 \2', text)
    pattern_test =  [{"TEXT": {"REGEX": "\d+[,.]\d+"}}, {"ORTH": "%"}]  
    text_ = nlp(text)
    
    matcher.add("pattern test", [pattern_test] )
    result = matcher(text_)
    
    for id_, beg, end in result:
        print(id_)
        print(text_[beg:end])
    

    Output:

    9844711491635719110
    1,80%
    9844711491635719110
    1,30%
    9844711491635719110
    0,50%
    9844711491635719110
    1,17%