This site page has tooltips appearing when hovering over values in columns "Score"
and "XP LVL"
.
It appears that read_html
will concatenate cell content and tooltip. Splitting those in post-processing is not always obvious and I seek a way to have read_html
handle them separately, possibly return them as two columns.
This is how the first row appears online:
(Rank)# | Name | Score | XP LVL | Victories / Total | Victory Ratio |
---|---|---|---|---|---|
1 | Rainin☆☆☆☆ | 6129 | 447 | 408 / 531 | 76% |
"Score"
's "6129" carries tooltip "Max6129""XP LVL"
's "447" carries tooltip "21173534 pts"This is how it appears after reading:
pd.read_html('https://stats.gladiabots.com/pantheon?', header=0, flavor="html5lib")[0]
# Name Score XP LVL Victories / Total \
0 1 Rainin☆☆☆☆ 6129Max 6129 44721173534 pts 408 / 531
See "44721173534 pts" is the concatenation of "447" and "21173534 pts". "XP LVL"
values have a variable number of digits, so splitting the string in the post-processing phase would require being pretty smart about it and I woud like to explore the "let read_html do the split", first.
(The special flavor="html5lib" was added because the page is dynamically-generated)
I have not found any mention of tooltips in the docs
It turns out that this is because pandas uses the .text
attribute of the <td>
bs4.element.Tag
objects and this one concatenate (without any separator) the texts of all the tag's children.
In the first row of the table, the score has two children 6129
and Max 6129
, thus the concat.
<td nowrap="" class="barContainer">
<div class="scoreBar" style="width: 100%;"></div>
<div class="maxScoreBar" style="width: 0%;"></div>
<span class="barLabel tooltipable">
"6129"
<span class="tooltip">
"Max 6129"
</span>
</span>
</td>
A quick/hacky solution would be to override the _text_getter
method of the parser used by pandas and replace .text
with get_text
that has a separator
parameter :
def _text_getter(self, obj):
return obj.get_text(separator="_", strip=True) # I choosed "_"
pd.io.html._BeautifulSoupHtml5LibFrameParser._text_getter = _text_getter
With this modification, read_html
gives this df
:
# Name Score XP LVL Victories / Total Victory_Ratio
0 1 Rainin☆☆☆☆ 6129_Max 6129 447_21173534 pts 408 / 531 76%
1 2 ZM_XL☆☆☆ 5888_Max 6025 344_15942978 pts 3685 / 6748 54%
2 3 UzuraGames☆ 5555_Max 5586 119_4688941 pts 610 / 1109 55%
.. ... ... ... ... ... ...
997 998 Tekuma 3183_Max 3460 27_370585 pts 151 / 304 49%
998 999 hemi 3183_Max 3227 10_49432 pts 29 / 62 46%
999 1000 wanna bet kid? 3183_Max 3304 13_85777 pts 51 / 95 53%
[1000 rows x 6 columns]
And this way, you can extract
/ disattach the values of the two concerned columns :
scores = df.pop("Score").str.extract(r"(?P<Score>\d+)_Max (?P<Max>\d+)")
xplvls = df.pop("XP LVL").str.extract(r"(?P<XPLVL>\d+)_(?P<PTS>\d+)")
out = pd.concat([df, scores, xplvls], axis=1)
Output :
print(out) # with only `scores` and `xplvls`
Score Max XPLVL PTS
0 6129 6129 447 21173534
1 5888 6025 344 15942978
2 5555 5586 119 4688941
.. ... ... ... ...
997 3183 3460 27 370585
998 3183 3227 10 49432
999 3183 3304 13 85777
[1000 rows x 4 columns]