Search code examples
pythonpython-3.xpandasdataframecategories

How to get categories of words containing unique 3-letter set from the columns of pandas dataframe in Python?


I have a dataframe df which looks as

     Unnamed: 0 Characters Split    A   B   C   D   Set Names
0   FROKDUWJU   [FRO, KDU, WJU] FRO KDU WJU NaN {WJU, KDU, FRO}
1   IDJWPZSUR   [IDJ, WPZ, SUR] IDJ WPZ SUR NaN {SUR, WPZ, IDJ}
2   UCFURKIRODCQ    [UCF, URK, IRO, DCQ]    UCF URK IRO DCQ {UCF, URK, DCQ, IRO}
3   ORI [ORI]   ORI NaN NaN NaN {ORI}
4   PROIRKIQARTIBPO [PRO, IRK, IQA, RTI, BPO]   PRO IRK IQA RTI {IQA, BPO, PRO, IRK, RTI}
5   QAZWREDCQIBR    [QAZ, WRE, DCQ, IBR]    QAZ WRE DCQ IBR {DCQ, QAZ, IBR, WRE}
6   PLPRUFSWURKI    [PLP, RUF, SWU, RKI]    PLP RUF SWU RKI {PLP, SWU, RKI, RUF}
7   FROIEUSKIKIR    [FRO, IEU, SKI, KIR]    FRO IEU SKI KIR {SKI, IEU, KIR, FRO}
8   ORIUWJZSRFRO    [ORI, UWJ, ZSR, FRO]    ORI UWJ ZSR FRO {UWJ, ORI, ZSR, FRO}
9   URKIFJVUR   [URK, IFJ, VUR] URK IFJ VUR NaN {URK, VUR, IFJ}
10  RUFOFR  [RUF, OFR]  RUF OFR NaN NaN {OFR, RUF}
11  IEU [IEU]   IEU NaN NaN NaN {IEU}
12  PIMIEU  [PIM, IEU]  PIM IEU NaN NaN {PIM, IEU}

                                

The first column contains certain names. The Characters Split column contains the name split into every 3 letters in the form of a list. Columns A, B, C, and D contain the breakdown of those 3-letters. Column Set Names have the same 3-letters but in the form of a set.

Some of the 3-letters are common in different names. For example: "FRO" is present in name in index 0, 7 and 8. For these names which have one 3-letter set in common, I'd like to put them into one category, perferably in the form of list. Is it possible to have these categories for each unique 3-letter set? What would be the suitable way to do it?

df.to_dict() is as shown:

{'Unnamed: 0': {0: 'FROKDUWJU',
  1: 'IDJWPZSUR',
  2: 'UCFURKIRODCQ',
  3: 'ORI',
  4: 'PROIRKIQARTIBPO',
  5: 'QAZWREDCQIBR',
  6: 'PLPRUFSWURKI',
  7: 'FROIEUSKIKIR',
  8: 'ORIUWJZSRFRO',
  9: 'URKIFJVUR',
  10: 'RUFOFR',
  11: 'IEU',
  12: 'PIMIEU'},
 'Characters Split': {0: ['FRO', 'KDU', 'WJU'],
  1: ['IDJ', 'WPZ', 'SUR'],
  2: ['UCF', 'URK', 'IRO', 'DCQ'],
  3: ['ORI'],
  4: ['PRO', 'IRK', 'IQA', 'RTI', 'BPO'],
  5: ['QAZ', 'WRE', 'DCQ', 'IBR'],
  6: ['PLP', 'RUF', 'SWU', 'RKI'],
  7: ['FRO', 'IEU', 'SKI', 'KIR'],
  8: ['ORI', 'UWJ', 'ZSR', 'FRO'],
  9: ['URK', 'IFJ', 'VUR'],
  10: ['RUF', 'OFR'],
  11: ['IEU'],
  12: ['PIM', 'IEU']},
 'A': {0: 'FRO',
  1: 'IDJ',
  2: 'UCF',
  3: 'ORI',
  4: 'PRO',
  5: 'QAZ',
  6: 'PLP',
  7: 'FRO',
  8: 'ORI',
  9: 'URK',
  10: 'RUF',
  11: 'IEU',
  12: 'PIM'},
 'B': {0: 'KDU',
  1: 'WPZ',
  2: 'URK',
  3: nan,
  4: 'IRK',
  5: 'WRE',
  6: 'RUF',
  7: 'IEU',
  8: 'UWJ',
  9: 'IFJ',
  10: 'OFR',
  11: nan,
  12: 'IEU'},
 'C': {0: 'WJU',
  1: 'SUR',
  2: 'IRO',
  3: nan,
  4: 'IQA',
  5: 'DCQ',
  6: 'SWU',
  7: 'SKI',
  8: 'ZSR',
  9: 'VUR',
  10: nan,
  11: nan,
  12: nan},
 'D': {0: nan,
  1: nan,
  2: 'DCQ',
  3: nan,
  4: 'RTI',
  5: 'IBR',
  6: 'RKI',
  7: 'KIR',
  8: 'FRO',
  9: nan,
  10: nan,
  11: nan,
  12: nan},
 'Set Names': {0: {'FRO', 'KDU', 'WJU'},
  1: {'IDJ', 'SUR', 'WPZ'},
  2: {'DCQ', 'IRO', 'UCF', 'URK'},
  3: {'ORI'},
  4: {'BPO', 'IQA', 'IRK', 'PRO', 'RTI'},
  5: {'DCQ', 'IBR', 'QAZ', 'WRE'},
  6: {'PLP', 'RKI', 'RUF', 'SWU'},
  7: {'FRO', 'IEU', 'KIR', 'SKI'},
  8: {'FRO', 'ORI', 'UWJ', 'ZSR'},
  9: {'IFJ', 'URK', 'VUR'},
  10: {'OFR', 'RUF'},
  11: {'IEU'},
  12: {'IEU', 'PIM'}}}

Solution

  • You can explode 'Set Names', then groupby the exploded columns and merge the 'Unnamed: 0' into a list per group:

    (df.explode('Set Names')
       .groupby('Set Names')
       ['Unnamed: 0'].apply(list)
    )
    

    output:

    Set Names
    BPO                          [PROIRKIQARTIBPO]
    DCQ               [UCFURKIRODCQ, QAZWREDCQIBR]
    FRO    [FROKDUWJU, FROIEUSKIKIR, ORIUWJZSRFRO]
    IBR                             [QAZWREDCQIBR]
    IDJ                                [IDJWPZSUR]
    ...                                        ...
    WJU                                [FROKDUWJU]
    WPZ                                [IDJWPZSUR]
    WRE                             [QAZWREDCQIBR]
    ZSR                             [ORIUWJZSRFRO]
    

    If you want to filter the output to have a minimal number of items per group (here > 1):

    (df.explode('Set Names')
       .groupby('Set Names')
       ['Unnamed: 0'].apply(lambda g: list(g) if len(g) > 1 else None)
       .dropna()
    )
    

    output:

    Set Names
    DCQ               [UCFURKIRODCQ, QAZWREDCQIBR]
    FRO    [FROKDUWJU, FROIEUSKIKIR, ORIUWJZSRFRO]
    IEU                [FROIEUSKIKIR, IEU, PIMIEU]
    ORI                        [ORI, ORIUWJZSRFRO]
    RUF                     [PLPRUFSWURKI, RUFOFR]
    URK                  [UCFURKIRODCQ, URKIFJVUR]