Search code examples
pythonhtmlpandasweb-scrapingtooltip

Have read_html read cell content and tooltip (text bubble) separately, instead of concatenate them


This site page has tooltips appearing when hovering over values in columns "Score" and "XP LVL".

It appears that read_html will concatenate cell content and tooltip. Splitting those in post-processing is not always obvious and I seek a way to have read_html handle them separately, possibly return them as two columns.

This is how the first row appears online:

(Rank)# Name Score XP LVL Victories / Total Victory Ratio
1 Rainin☆☆☆☆ 6129 447 408 / 531 76%
  • where "Score"'s "6129" carries tooltip "Max6129"
  • where, more annoyingly, "XP LVL"'s "447" carries tooltip "21173534 pts"

This is how it appears after reading:

pd.read_html('https://stats.gladiabots.com/pantheon?', header=0, flavor="html5lib")[0]

        #            Name         Score           XP LVL Victories / Total  \
0       1      Rainin☆☆☆☆  6129Max 6129  44721173534 pts         408 / 531   

See "44721173534 pts" is the concatenation of "447" and "21173534 pts". "XP LVL" values have a variable number of digits, so splitting the string in the post-processing phase would require being pretty smart about it and I woud like to explore the "let read_html do the split", first.

(The special flavor="html5lib" was added because the page is dynamically-generated)

I have not found any mention of tooltips in the docs


Solution

  • It turns out that this is because pandas uses the .text attribute of the <td> bs4.element.Tag objects and this one concatenate (without any separator) the texts of all the tag's children.

    In the first row of the table, the score has two children 6129 and Max 6129, thus the concat.

    <td nowrap="" class="barContainer">
      <div class="scoreBar" style="width: 100%;"></div>
      <div class="maxScoreBar" style="width: 0%;"></div>
      <span class="barLabel tooltipable">
        "6129"
        <span class="tooltip">
          "Max 6129"
        </span>
      </span>
    </td>
    

    A quick/hacky solution would be to override the _text_getter method of the parser used by pandas and replace .text with get_text that has a separator parameter :

    def _text_getter(self, obj):
        return obj.get_text(separator="_", strip=True) # I choosed "_"
    
    pd.io.html._BeautifulSoupHtml5LibFrameParser._text_getter = _text_getter
    

    With this modification, read_html gives this df :

            #            Name          Score            XP LVL Victories / Total Victory_Ratio
    0       1      Rainin☆☆☆☆  6129_Max 6129  447_21173534 pts         408 / 531           76%
    1       2        ZM_XL☆☆☆  5888_Max 6025  344_15942978 pts       3685 / 6748           54%
    2       3     UzuraGames☆  5555_Max 5586   119_4688941 pts        610 / 1109           55%
    ..    ...             ...            ...               ...               ...           ...
    997   998          Tekuma  3183_Max 3460     27_370585 pts         151 / 304           49%
    998   999            hemi  3183_Max 3227      10_49432 pts           29 / 62           46%
    999  1000  wanna bet kid?  3183_Max 3304      13_85777 pts           51 / 95           53%
    
    [1000 rows x 6 columns]
    

    And this way, you can extract / disattach the values of the two concerned columns :

    scores = df.pop("Score").str.extract(r"(?P<Score>\d+)_Max (?P<Max>\d+)")
    xplvls = df.pop("XP LVL").str.extract(r"(?P<XPLVL>\d+)_(?P<PTS>\d+)")
    
    out = pd.concat([df, scores, xplvls], axis=1)
    

    Output :

    print(out) # with only `scores` and `xplvls`
    
        Score   Max XPLVL       PTS
    0    6129  6129   447  21173534
    1    5888  6025   344  15942978
    2    5555  5586   119   4688941
    ..    ...   ...   ...       ...
    997  3183  3460    27    370585
    998  3183  3227    10     49432
    999  3183  3304    13     85777
    
    [1000 rows x 4 columns]