Search code examples
pythonregexnlptext-extraction

Get the indices of numbers in a string and extract words before and after the number (in different languages)


I tried using regex and found numbers but not finding the indices for the entire number, instead getting index it only for the first character in the number

text = "४०० pounds of wheat at $ 3 per pound"
numero = re.finditer(r"(\d+)", text) ####
op = re.findall(r"(\d+)", text) ####

indices = [m.start() for m in numero]
OUTPUT

[0, 25]

***Expected OUTPUT***
[0, 6]

After finding the exact indices and storing in a list, it would be easier to extract the words. This is what I believe? What do you think?

Also, I am expecting words at different positions so it cannot be a static approach


Solution

  • You tagged the question with tag and it is a question, why don't you use Spacy?

    See an Python demo with Spacy 3.0.1:

    import spacy
    nlp = spacy.load("en_core_web_trf")
    text = "४०० pounds of wheat at $ 3 per pound"
    doc = nlp(text)
    print([(token.text, token.i) for token in doc if token.is_alpha])
    ## => [('pounds', 1), ('of', 2), ('wheat', 3), ('at', 4), ('per', 7), ('pound', 8)]
    ## => print([(token.text, token.i) for token in doc if token.like_num])
    [('४००', 0), ('3', 6)]
    

    Here,

    • nlp object is initialized with the English "big" model
    • doc is the Spacy document initialized with your text variable
    • [(token.text, token.i) for token in doc if token.is_alpha] gets you a list of letter words with their values (token.text) and their positions in the document (token.i)
    • [(token.text, token.i) for token in doc if token.like_num] fetches the list of numbers with their positions inside the document.