Search code examples
machine-learningnlptagstraining-datatagging

tagging words with different lengths in order


Hi i am trying to tag the words in a sentence in order. For example, (my initial method)

Sentence: Work across a wide range of related areas
Label:    Tag    O    O O    O     O  Tag     Tag

But now i need it to be like this where it can tag 2 words as a keyword aand label it together:

Sentence: Work across a wide range of related areas
Label:    Tag    O    O O    O     O  Tag     

I have a list of keyword of varying length and their tags. How can i tag the way i need it to be in the sentence order?


Solution

  • Looks like what you are looking for is the BIO-tagging system (If I understood you correctly, and you are looking for a solution in manually tagged corpora).

    BIO denotes the following: B - beginning of a chunk, I - the inside of the chunk, O - is a token outside of a chunk.

    Step 1

    Sentence: Work across a wide range of related areas
    Tag:       B     O    O   O    O    O   B        I
    Label:  Label_1  O    O   O    O    O   Label_2  Label_2 
    

    Step 2

    Sentence: Work across a wide range of related areas
    Label:  B-Label_1  O    O   O    O    O   B-Label_2  I-Label_2 
    

    Once you have tagged your corpus, you will align the lists of Sentences (list #1) and Tag + Label combos (list #2): the BIO tags will be prefixed to your labels, e.g., [...related, areas] + [... B-Label_2, I-Label_2]. That way you can combine [B-Label_2, I-Label_2] into one Label_2 since you have a pattern of BI together. You will just have to strip the prefixes at the very end and do a lot of other intermediate steps and post-processing.