Search code examples
pythonpython-3.xpandasnlp

Reset a group of identifiers to a sequence of consecutive serial numbers in a Pandas dataframe column


I have generated three outputs from a dataframe and I am trying to reset the identifiers of my sentences (Sentence_ID) by starting from 1 for each output.

Output exemple :

Sentence_ID  Mention Tag
6388    Chailland   B-LOCATION
6388    ,   O
6388    Mayenne B-LOCATION

6389    poste   O
6389    de  O
6389    Goumois B-LOCATION
6389    (   I-LOCATION
6389    Doubs   I-LOCATION
6389    )   I-LOCATION
6389    .   O
        
6390    Pichet  B-PERSON
6390    (   O
6390    veuve   O
6390    )   O
6390    ,   O
6390    de  O
6390    Paris   B-LOCATION
6390    .   O
... continue

Expected Output :

Sentence_ID  Mention Tag
1 Chailland B-LOCATION
1   ,   O
1   Mayenne B-LOCATION

2   poste   O
2   de  O
2   Goumois B-LOCATION
2   (   I-LOCATION
2   Doubs   I-LOCATION
2   )   I-LOCATION
2   .   O
        
3   Pichet  B-PERSON
3   (   O
3   veuve   O
3   )   O
3   ,   O
3   de  O
3   Paris   B-LOCATION
3   .   O
... continue

I must be missing something, but not sure if I should apply a counter on Sentence_id column (via group_by()) or reset_index on this specific columns to complete this task.

If anyone has a lead, thanks in advance.


Solution

  • You can use pd.factorize to generate a new set of sequence numbers, as follows:

    df['Sentence_ID'] = pd.factorize(df['Sentence_ID'])[0] + 1
    

    or use Series.factorize

    df['Sentence_ID'] = df['Sentence_ID'].factorize()[0] + 1
    

    Result:

    print(df)
    
    
        Sentence_ID    Mention         Tag
    0             1  Chailland  B-LOCATION
    1             1          ,           O
    2             1    Mayenne  B-LOCATION
    3             2      poste           O
    4             2         de           O
    5             2    Goumois  B-LOCATION
    6             2          (  I-LOCATION
    7             2      Doubs  I-LOCATION
    8             2          )  I-LOCATION
    9             2          .           O
    10            3     Pichet    B-PERSON
    11            3          (           O
    12            3      veuve           O
    13            3          )           O
    14            3          ,           O
    15            3         de           O
    16            3      Paris  B-LOCATION
    17            3          .           O