I'd like to capture all the data from an HTML table, and place it into a data-set. Preferably, a DSV, or a JSON. Both of which, can be conveniently ported to other data-sets, and data-set containers, e.g., XML, or a database.
The following Wikipedia article, contains a table of values I'd like to capture. Wikipedia – List of Solar System objects by size – Objects with radius over 400 km.
I tried using XPath expressions, and not only is the syntax complex, the task of relaying each element class, or id, is simply too daunting. Not to mention that XPath is intended for XML, and not HTML.
The Internet provides a wealth of data, that can be used by developers, and coders.
Capturing this data, although, is what proves trivial for most. Considering, I find the task incredibly simple.
My general tools are, regular expression patterns, and scripting-languages.
I tend to avoid "web scraping", in the classical sense, e.g., using Beautiful Soup, jsoup, etc.
For example, if you were to copy and paste an HTML table into a text-editor, you'd find the data to be delimited by tab characters.
This is much simpler to parse.
Although, to reiterate, feel free to provide a solution, using the like.
Start by picking a text-editor, specifically one which provides a find-and-replace function, that allows for regular expression patterns.
Some examples are, Notepad++, and SublimeText; even online tools, such as regex101, will suffice.
I will be using the IntelliJ IDE, and a blank file.
Start by highlighting, and copying, the entire table; starting with the header text, "Body", up until the last column of the "2002 MS4" entry.
Paste the text into a new file.
If you quickly peruse the text, you'll find it contains several notes within brackets, e.g., "[123]", or "[abc]".
Use a find-and-replace to remove the values; here is a pattern.
\[.+?\]
Additionally, some of the numeric values contain commas.
Use a find-and-replace to remove these as well.
(?<=\d),(?=\d)
Some of the "Body" column entries have two lines of text.
We can use a capture pattern to bring this data onto the first line, and place it within parentheses.
Note, the "2002 MS4" required an accommodation; being more than one word.
(?s)^(\w+(?: MS4)?)$.+?^(.+?)(\t.+?)$
Replace this match, with the following specifiers.
$1 ($2)$3
Similarly, the "Radius", and "Type", columns have entries with multiple lines.
Although, there are only a few, we can do this manually.
As a final step, we'll need to adjust the first line; the headers.
The HTML table is using a column span attribute, so we'll need to move these sub-header values accordingly.
This can also be done manually.
Body Image Radius (km) Radius (R🜨) Volume (109 km3) Volume (V🜨) Mass (1021 kg) Mass (M🜨) Surface area (106 km2) Surface area 🜨 Density (g/cm3) Gravity (m/s2) Gravity (🜨) Type Discovery
And, that is it for formatting. We now have a TSV data-set.
This data can be pasted into a spread-sheet, or even parsed to a Python dictionary, or JSON set.
Body Image Radius (km) Radius (R🜨) Volume (109 km3) Volume (V🜨) Mass (1021 kg) Mass (M🜨) Surface area (106 km2) Surface area 🜨 Density (g/cm3) Gravity (m/s2) Gravity (🜨) Type Discovery
Sun 695508 ± ? 109.2 1409300000 1301000 1989100000 333000 6078700 11918 1.409 274.0 27.94 G2V-class star prehistoric
Jupiter 69911±6 10.97 1431280 1321 1898187±88 317.83 61419 120.41 1.3262±0.0003 24.79 2.528 gas giant planet; has rings prehistoric
Saturn 58232±6 (136775 for A Ring) 9.140 827130 764 568317±13 95.162 42612 83.54 0.6871±0.0002 10.44 1.065 gas giant planet; has rings prehistoric
Uranus 25362±7 3.981 68340 63.1 86813±4 14.536 8083.1 15.85 1.270±0.001 8.87 0.886 ice giant planet; has rings 1781
Neptune 24622±19 3.865 62540 57.7 102413±5 17.147 7618.3 14.94 1.638±0.004 11.15 1.137 ice giant planet; has rings 1846
Earth 6371.0084±0.0001 1 1083.21 1 5972.4±0.3 1 510.06447 1 5.5136±0.0003 9.81 1 terrestrial planet prehistoric
Venus 6052±1 0.9499 928.43 0.857 4867.5±0.2 0.815 460.2 0.903 5.243±0.003 8.87 0.905 terrestrial planet prehistoric
Mars 3389.5±0.2 0.5320 163.18 0.151 641.71±0.03 0.107 144.37 0.283 3.9341±0.0007 3.71 0.379 terrestrial planet prehistoric
Ganymede (Jupiter III) 2634.1±0.3 0.4135 76.30 0.0704 148.2 0.0248 86.999 0.171 1.936 1.428 0.146 moon of Jupiter (icy) 1610
Titan (Saturn VI) 2574.73±0.09 0.4037 71.50 0.0658 134.5 0.0225 83.3054 0.163 1.880±0.004 1.354 0.138 moon of Saturn (icy) 1655
Mercury 2439.4±0.1 0.3829 60.83 0.0562 330.11±0.02 0.0553 74.797 0.147 5.4291±0.007 3.70 0.377 terrestrial planet prehistoric
Callisto (Jupiter IV) 2410.3±1.5 0.3783 58.65 0.0541 107.6 0.018 73.005 0.143 1.834±0.003 1.23603 0.126 moon of Jupiter (icy) 1610
Io (Jupiter I) 1821.6±0.5 0.2859 25.32 0.0234 89.32 0.015 41.698 0.082 3.528±0.006 1.797 0.183 moon of Jupiter (terrestrial) 1610
Moon (Earth I) 1737.5±0.1 0.2727 21.958 0.0203 73.46 0.0123 37.937 0.074 3.344±0.005 1.625 0.166 moon of Earth (terrestrial) prehistoric
Europa (Jupiter II) 1560.8±0.5 0.2450 15.93 0.0147 48.00 0.008035 30.613 0.06 3.013±0.005 1.316 0.134 moon of Jupiter (terrestrial) 1610
Triton (Neptune I) 1353.4±0.9 0.2124 10.38 0.0096 21.39±0.03 0.003599 23.018 0.045 2.061 0.782 0.0797 moon of Neptune (icy) 1846
Pluto (134340) 1188.3±0.8 0.187 7.057 0.00651 13.03±0.03 0.0022 17.79 0.034 1.854±0.006 0.620 0.063 dwarf planet; plutino; multiple 1930
Eris (136199) 1163±6 0.1825 6.59 0.0061 16.6±0.2 0.0028 17 0.033 2.52±0.07 0.824 0.083 dwarf planet; SDO; binary 2003
Haumea (136108) 798±6 to 816 0.12 1.98 0.0018 4.01±0.04 0.00066 8.14 0.016 2.018 0.401 0.0409 dwarf planet; resonant KBO (7:12); trinary; has rings 2004
Titania (Uranus III) 788.9±1.8 0.1237 2.06 0.0019 3.40±0.06 0.00059 7.82 0.015 1.711±0.005 0.378 0.0385 moon of Uranus 1787
Rhea (Saturn V) 763.8±1.0 0.1199 1.87 0.0017 2.307 0.00039 7.34 0.014 1.236±0.005 0.26 0.027 moon of Saturn 1672
Oberon (Uranus IV) 761.4±2.6 0.1195 1.85 0.0017 3.08±0.09 0.0005 7.285 0.014 1.63±0.05 0.347 0.035 moon of Uranus 1787
Iapetus (Saturn VIII) 735.6±1.5 0.1153 1.66 0.0015 1.806 0.00033 6.8 0.013 1.088±0.013 0.223 0.0227 moon of Saturn 1671
Makemake (136472) 715+19−11 0.112 1.53 0.0014 ≈ 3.1 0.00053 6.4 0.013 ≈ 2.1 0.57 0.0581 dwarf planet; cubewano 2005
Gonggong (225088) 615±25 0.0983 1.03 0.0009 1.75±0.07 0.00029 4.753 0.009 1.72±0.16 0.3 0.0306 dwarf planet?; resonant SDO (3:10) 2007
Charon (Pluto I) 606.0±0.5 0.0951 0.932 0.0009 1.586±0.015 0.00025 4.578 0.009 1.70±0.02 0.288 0.0294 moon of Pluto 1978
Umbriel (Uranus II) 584.7±2.8 0.0918 0.837 0.0008 1.28±0.03 0.00020 4.3 0.008 1.39±0.16 0.234 0.024 moon of Uranus 1851
Ariel (Uranus I) 578.9±0.6 0.0909 0.813 0.0007 1.25±0.02 0.000226 4.211 0.008 1.66±0.15 0.269 0.027 moon of Uranus 1851
Dione (Saturn IV) 561.7±0.45 0.0881 0.741 0.0007 1.095 0.000183 3.965 0.008 1.478±0.003 0.232 0.0237 moon of Saturn 1684
Quaoar (50000) 543±2 0.0879 0.737 0.0007 1.20±0.05 0.0002 3.83 0.008 2.0±0.5 0.3 0.0306 cubewano; binary 2002
Tethys (Saturn III) 533.0±0.7 0.0834 0.624 0.0006 0.617 0.000103 3.57 0.007 0.984±0.003 0.145 0.015 moon of Saturn 1684
Sedna (90377) 498±40 0.0785 0.516 0.0005 sednoid; detached object 2003
Ceres (1) 469.7±0.1 0.0742 0.433 0.0004 0.938 0.000157 2.85 0.006 2.17 0.28 0.029 dwarf planet; belt asteroid 1801
Orcus (90482) 455+25−20 0.0719 0.404 0.0004 0.548±0.010 0.000092 1.4±0.2 0.2 0.0204 plutino; binary 2004
Salacia (120347) 423±11 0.0664 0.3729 0.0003 0.492±0.007 0.000082 1.5±0.1 0.165 0.0168 cubewano; binary 2004
2002 MS4 (307261) 400±12 0.0628 0.2681 0.0002 cubewano 2002
Here is an example parse, in Python.
import csv as c
sso = {}
with open('wikipedia_sso.tsv', encoding='utf-8') as f:
r = c.reader(f, delimiter='\t')
next(it := iter(r))
for (b, i, rkm, re, vkm3, ve, mkg, me, akm2, ae, d, ms2, ge, t, ds) in it:
sso.update(
{b: {'radius': {'km': rkm, 're': re},
'volume': {'km^3-10e9': vkm3, 've': ve},
'mass': {'kg-10e21': mkg, 'me': me},
'surface-area': {'km^2-10e6': akm2, 'e': ae},
'density': {'g/cm^3': d},
'gravity': {'m/s^2': ms2, 'e': ge},
'type': t,
'discovery': ds}})
import pprint as p
p.pprint(sso['Earth'])
Output
{'density': {'g/cm^3': '5.5136±0.0003'},
'discovery': 'prehistoric',
'gravity': {'e': '1', 'm/s^2': '9.81'},
'mass': {'kg-10e21': '5972.4±0.3', 'me': '1'},
'radius': {'km': '6371.0084±0.0001', 're': '1'},
'surface-area': {'e': '1', 'km^2-10e6': '510.06447'},
'type': 'terrestrial planet',
'volume': {'km^3-10e9': '1083.21', 've': '1'}}
And, here is a parse, from Python, to JSON.
import json as j
print(j.dumps(sso['Earth'], sort_keys=True, indent=4))
Output
{
"density": {
"g/cm^3": "5.5136\u00b10.0003"
},
"discovery": "prehistoric",
"gravity": {
"e": "1",
"m/s^2": "9.81"
},
"mass": {
"kg-10e21": "5972.4\u00b10.3",
"me": "1"
},
"radius": {
"km": "6371.0084\u00b10.0001",
"re": "1"
},
"surface-area": {
"e": "1",
"km^2-10e6": "510.06447"
},
"type": "terrestrial planet",
"volume": {
"km^3-10e9": "1083.21",
"ve": "1"
}
}