Search code examples
pythonpandasdataframegrouping

How to group strings in a column by strings in a different column and containing NaN values?


Background
I fetched a table from a source on the internet (see Mordred Molecular Descriptors) machine learning project.
The code I used to fetch that table is listed below:

import requests
import pandas as pd
from bs4 import BeautifulSoup

# Fetch the HTML content of the webpage
url = "https://mordred-descriptor.github.io/documentation/master/descriptors.html"
html = requests.get(url).content

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')

# Find the table element in the HTML
table = soup.find('table')

# Convert the table into a Pandas dataframe
df = pd.read_html(str(table))[0]

# Print the resulting dataframe
df.drop(['#', 'constructor', 'dim', 'description'], axis=1)

After running above code in Python 3, I yield this dataframe.

Now I want to group together the names in the "names" column by their corresponding modules in the "modules" column.

The problem is that the fetched table already is a pivot table, and that the "modules" column is filled by NaN values. Ideally, I would like to generate a dictionary with the keys the module, and the values a list of the grouped names.

Example:
dict_df = {'ABCIndex': ['ABC','ABCGG'], 'AcidBase': ['nAcid', 'nBase'], ..., 'ZagrebIndex': ['Zagreb1', 'Zagreb2', 'mZagreb1', 'mZagreb2']}

I have tried grouping together the names by modules using .groupby() in Pandas, however the NaN values are left away leaving the dictionary values a list of a single name; the name of the row where the module was not a NaN value.

Thank you for your time and assistance.


Solution

  • IIUC, like this? Use ffill then groupby, agg with list.

    df.groupby(df['module'].ffill())['name'].agg(list)
    

    Output:

    module
    ABCIndex                                                           [ABC, ABCGG]
    AcidBase                                                         [nAcid, nBase]
    AdjacencyMatrix               [SpAbs_A, SpMax_A, SpDiam_A, SpAD_A, SpMAD_A, ...
    Aromatic                                                 [nAromAtom, nAromBond]
    AtomCount                     [nAtom, nHeavyAtom, nSpiro, nBridgehead, nHete...
    Autocorrelation               [ATS0dv, ATS1dv, ATS2dv, ATS3dv, ATS4dv, ATS5d...
    BCUT                          [BCUTc-1h, BCUTc-1l, BCUTdv-1h, BCUTdv-1l, BCU...
    BalabanJ                                                             [BalabanJ]
    BaryszMatrix                  [SpAbs_DzZ, SpMax_DzZ, SpDiam_DzZ, SpAD_DzZ, S...
    BertzCT                                                               [BertzCT]
    BondCount                     [nBonds, nBondsO, nBondsS, nBondsD, nBondsT, n...
    CPSA                          [PNSA1, PNSA2, PNSA3, PNSA4, PNSA5, PPSA1, PPS...
    CarbonTypes                   [C1SP1, C2SP1, C1SP2, C2SP2, C3SP2, C1SP3, C2S...
    Chi                           [Xch-3d, Xch-4d, Xch-5d, Xch-6d, Xch-7d, Xch-3...
    Constitutional                [SZ, Sm, Sv, Sse, Spe, Sare, Sp, Si, MZ, Mm, M...
    DetourMatrix                  [SpAbs_Dt, SpMax_Dt, SpDiam_Dt, SpAD_Dt, SpMAD...
    DistanceMatrix                [SpAbs_D, SpMax_D, SpDiam_D, SpAD_D, SpMAD_D, ...
    EState                        [NsLi, NssBe, NssssBe, NssBH, NsssB, NssssB, N...
    EccentricConnectivityIndex                                            [ECIndex]
    ExtendedTopochemicalAtom      [ETA_alpha, AETA_alpha, ETA_shape_p, ETA_shape...
    FragmentComplexity                                                    [fragCpx]
    Framework                                                                 [fMF]
    GeometricalIndex              [GeomDiameter, GeomRadius, GeomShapeIndex, Geo...
    GravitationalIndex                                 [GRAV, GRAVH, GRAVp, GRAVHp]
    HydrogenBond                                                   [nHBAcc, nHBDon]
    InformationContent            [IC0, IC1, IC2, IC3, IC4, IC5, TIC0, TIC1, TIC...
    KappaShapeIndex                                           [Kier1, Kier2, Kier3]
    Lipinski                                                [Lipinski, GhoseFilter]
    LogS                                                             [FilterItLogS]
    McGowanVolume                                                        [VMcGowan]
    MoRSE                         [Mor01, Mor02, Mor03, Mor04, Mor05, Mor06, Mor...
    MoeType                       [LabuteASA, PEOE_VSA1, PEOE_VSA2, PEOE_VSA3, P...
    MolecularDistanceEdge         [MDEC-11, MDEC-12, MDEC-13, MDEC-14, MDEC-22, ...
    MolecularId                   [MID, AMID, MID_h, AMID_h, MID_C, AMID_C, MID_...
    MomentOfInertia                                        [MOMI-X, MOMI-Y, MOMI-Z]
    PBF                                                                       [PBF]
    PathCount                     [MPC2, MPC3, MPC4, MPC5, MPC6, MPC7, MPC8, MPC...
    Polarizability                                                     [apol, bpol]
    RingCount                     [nRing, n3Ring, n4Ring, n5Ring, n6Ring, n7Ring...
    RotatableBond                                                  [nRot, RotRatio]
    SLogP                                                              [SLogP, SMR]
    TopoPSA                                                  [TopoPSA(NO), TopoPSA]
    TopologicalCharge             [GGI1, GGI2, GGI3, GGI4, GGI5, GGI6, GGI7, GGI...
    TopologicalIndex              [Diameter, Radius, TopoShapeIndex, PetitjeanIn...
    VdwVolumeABC                                                             [Vabc]
    VertexAdjacencyInformation                                            [VAdjMat]
    WalkCount                     [MWC01, MWC02, MWC03, MWC04, MWC05, MWC06, MWC...
    Weight                                                                [MW, AMW]
    WienerIndex                                                       [WPath, WPol]
    ZagrebIndex                              [Zagreb1, Zagreb2, mZagreb1, mZagreb2]
    Name: name, dtype: object