Search code examples
pythoncsvparsinghtml-tabledataset

How to parse an HTML table, into a portable data-set; DSV, JSON, etc


I'd like to capture all the data from an HTML table, and place it into a data-set. Preferably, a DSV, or a JSON.  Both of which, can be conveniently ported to other data-sets, and data-set containers, e.g., XML, or a database.

The following Wikipedia article, contains a table of values I'd like to capture. Wikipedia – List of Solar System objects by size – Objects with radius over 400 km.

I tried using XPath expressions, and not only is the syntax complex, the task of relaying each element class, or id, is simply too daunting. Not to mention that XPath is intended for XML, and not HTML.


Solution

  • The Internet provides a wealth of data, that can be used by developers, and coders.
    Capturing this data, although, is what proves trivial for most.  Considering, I find the task incredibly simple.

    My general tools are, regular expression patterns, and scripting-languages.
    I tend to avoid "web scraping", in the classical sense, e.g., using Beautiful Soup, jsoup, etc.
    For example, if you were to copy and paste an HTML table into a text-editor, you'd find the data to be delimited by tab characters.
    This is much simpler to parse.

    Although, to reiterate, feel free to provide a solution, using the like.

    Start by picking a text-editor, specifically one which provides a find-and-replace function, that allows for regular expression patterns.
    Some examples are, Notepad++, and SublimeText; even online tools, such as regex101, will suffice.
    I will be using the IntelliJ IDE, and a blank file.

    Start by highlighting, and copying, the entire table; starting with the header text, "Body", up until the last column of the "2002 MS4" entry.
    Paste the text into a new file.

    If you quickly peruse the text, you'll find it contains several notes within brackets, e.g., "[123]", or "[abc]".
    Use a find-and-replace to remove the values; here is a pattern.

    \[.+?\]
    

    Additionally, some of the numeric values contain commas.
    Use a find-and-replace to remove these as well.

    (?<=\d),(?=\d)
    

    Some of the "Body" column entries have two lines of text.
    We can use a capture pattern to bring this data onto the first line, and place it within parentheses.
    Note, the "2002 MS4" required an accommodation; being more than one word.

    (?s)^(\w+(?: MS4)?)$.+?^(.+?)(\t.+?)$
    

    Replace this match, with the following specifiers.

    $1 ($2)$3
    

    Similarly, the "Radius", and "Type", columns have entries with multiple lines.
    Although, there are only a few, we can do this manually.

    As a final step, we'll need to adjust the first line; the headers.
    The HTML table is using a column span attribute, so we'll need to move these sub-header values accordingly.
    This can also be done manually.

    Body    Image   Radius (km) Radius (R🜨)    Volume (109 km3)    Volume (V🜨)    Mass (1021 kg)  Mass (M🜨)  Surface area (106 km2)  Surface area 🜨 Density (g/cm3) Gravity (m/s2)  Gravity (🜨)    Type    Discovery
    

    And, that is it for formatting.  We now have a TSV data-set.
    This data can be pasted into a spread-sheet, or even parsed to a Python dictionary, or JSON set.

    Body    Image   Radius (km) Radius (R🜨)    Volume (109 km3)    Volume (V🜨)    Mass (1021 kg)  Mass (M🜨)  Surface area (106 km2)  Surface area 🜨 Density (g/cm3) Gravity (m/s2)  Gravity (🜨)    Type    Discovery
    Sun     695508 ± ?  109.2   1409300000  1301000 1989100000  333000  6078700 11918   1.409   274.0   27.94   G2V-class star  prehistoric
    Jupiter     69911±6 10.97   1431280 1321    1898187±88  317.83  61419   120.41  1.3262±0.0003   24.79   2.528   gas giant planet; has rings prehistoric
    Saturn      58232±6 (136775 for A Ring) 9.140   827130  764 568317±13   95.162  42612   83.54   0.6871±0.0002   10.44   1.065   gas giant planet; has rings prehistoric
    Uranus      25362±7 3.981   68340   63.1    86813±4 14.536  8083.1  15.85   1.270±0.001 8.87    0.886   ice giant planet; has rings 1781
    Neptune     24622±19    3.865   62540   57.7    102413±5    17.147  7618.3  14.94   1.638±0.004 11.15   1.137   ice giant planet; has rings 1846
    Earth       6371.0084±0.0001    1   1083.21 1   5972.4±0.3  1   510.06447   1   5.5136±0.0003   9.81    1   terrestrial planet  prehistoric
    Venus       6052±1  0.9499  928.43  0.857   4867.5±0.2  0.815   460.2   0.903   5.243±0.003 8.87    0.905   terrestrial planet  prehistoric
    Mars        3389.5±0.2  0.5320  163.18  0.151   641.71±0.03 0.107   144.37  0.283   3.9341±0.0007   3.71    0.379   terrestrial planet  prehistoric
    Ganymede (Jupiter III)      2634.1±0.3  0.4135  76.30   0.0704  148.2   0.0248  86.999  0.171   1.936   1.428   0.146   moon of Jupiter (icy)   1610
    Titan (Saturn VI)       2574.73±0.09    0.4037  71.50   0.0658  134.5   0.0225  83.3054 0.163   1.880±0.004 1.354   0.138   moon of Saturn (icy)    1655
    Mercury     2439.4±0.1  0.3829  60.83   0.0562  330.11±0.02 0.0553  74.797  0.147   5.4291±0.007    3.70    0.377   terrestrial planet  prehistoric
    Callisto (Jupiter IV)       2410.3±1.5  0.3783  58.65   0.0541  107.6   0.018   73.005  0.143   1.834±0.003 1.23603 0.126   moon of Jupiter (icy)   1610
    Io (Jupiter I)      1821.6±0.5  0.2859  25.32   0.0234  89.32   0.015   41.698  0.082   3.528±0.006 1.797   0.183   moon of Jupiter (terrestrial)   1610
    Moon (Earth I)      1737.5±0.1  0.2727  21.958  0.0203  73.46   0.0123  37.937  0.074   3.344±0.005 1.625   0.166   moon of Earth (terrestrial) prehistoric
    Europa (Jupiter II)     1560.8±0.5  0.2450  15.93   0.0147  48.00   0.008035    30.613  0.06    3.013±0.005 1.316   0.134   moon of Jupiter (terrestrial)   1610
    Triton (Neptune I)      1353.4±0.9  0.2124  10.38   0.0096  21.39±0.03  0.003599    23.018  0.045   2.061   0.782   0.0797  moon of Neptune (icy)   1846
    Pluto (134340)      1188.3±0.8  0.187   7.057   0.00651 13.03±0.03  0.0022  17.79   0.034   1.854±0.006 0.620   0.063   dwarf planet; plutino; multiple 1930
    Eris (136199)       1163±6  0.1825  6.59    0.0061  16.6±0.2    0.0028  17  0.033   2.52±0.07   0.824   0.083   dwarf planet; SDO; binary   2003
    Haumea (136108)     798±6 to 816    0.12    1.98    0.0018  4.01±0.04   0.00066 8.14    0.016   2.018   0.401   0.0409  dwarf planet; resonant KBO (7:12); trinary; has rings   2004
    Titania (Uranus III)        788.9±1.8   0.1237  2.06    0.0019  3.40±0.06   0.00059 7.82    0.015   1.711±0.005 0.378   0.0385  moon of Uranus  1787
    Rhea (Saturn V)     763.8±1.0   0.1199  1.87    0.0017  2.307   0.00039 7.34    0.014   1.236±0.005 0.26    0.027   moon of Saturn  1672
    Oberon (Uranus IV)      761.4±2.6   0.1195  1.85    0.0017  3.08±0.09   0.0005  7.285   0.014   1.63±0.05   0.347   0.035   moon of Uranus  1787
    Iapetus (Saturn VIII)       735.6±1.5   0.1153  1.66    0.0015  1.806   0.00033 6.8 0.013   1.088±0.013 0.223   0.0227  moon of Saturn  1671
    Makemake (136472)       715+19−11   0.112   1.53    0.0014  ≈ 3.1   0.00053 6.4 0.013   ≈ 2.1   0.57    0.0581  dwarf planet; cubewano  2005
    Gonggong (225088)       615±25  0.0983  1.03    0.0009  1.75±0.07   0.00029 4.753   0.009   1.72±0.16   0.3 0.0306  dwarf planet?; resonant SDO (3:10)  2007
    Charon (Pluto I)        606.0±0.5   0.0951  0.932   0.0009  1.586±0.015 0.00025 4.578   0.009   1.70±0.02   0.288   0.0294  moon of Pluto   1978
    Umbriel (Uranus II)     584.7±2.8   0.0918  0.837   0.0008  1.28±0.03   0.00020 4.3 0.008   1.39±0.16   0.234   0.024   moon of Uranus  1851
    Ariel (Uranus I)        578.9±0.6   0.0909  0.813   0.0007  1.25±0.02   0.000226    4.211   0.008   1.66±0.15   0.269   0.027   moon of Uranus  1851
    Dione (Saturn IV)       561.7±0.45  0.0881  0.741   0.0007  1.095   0.000183    3.965   0.008   1.478±0.003 0.232   0.0237  moon of Saturn  1684
    Quaoar (50000)      543±2   0.0879  0.737   0.0007  1.20±0.05   0.0002  3.83    0.008   2.0±0.5 0.3 0.0306  cubewano; binary    2002
    Tethys (Saturn III)     533.0±0.7   0.0834  0.624   0.0006  0.617   0.000103    3.57    0.007   0.984±0.003 0.145   0.015   moon of Saturn  1684
    Sedna (90377)       498±40  0.0785  0.516   0.0005                              sednoid; detached object    2003
    Ceres (1)       469.7±0.1   0.0742  0.433   0.0004  0.938   0.000157    2.85    0.006   2.17    0.28    0.029   dwarf planet; belt asteroid 1801
    Orcus (90482)       455+25−20   0.0719  0.404   0.0004  0.548±0.010 0.000092            1.4±0.2 0.2 0.0204  plutino; binary 2004
    Salacia (120347)        423±11  0.0664  0.3729  0.0003  0.492±0.007 0.000082            1.5±0.1 0.165   0.0168  cubewano; binary    2004
    2002 MS4 (307261)       400±12  0.0628  0.2681  0.0002                              cubewano    2002
    

    Here is an example parse, in Python.

    import csv as c
    sso = {}
    with open('wikipedia_sso.tsv', encoding='utf-8') as f:
        r = c.reader(f, delimiter='\t')
        next(it := iter(r))
        for (b, i, rkm, re, vkm3, ve, mkg, me, akm2, ae, d, ms2, ge, t, ds) in it:
            sso.update(
                {b: {'radius': {'km': rkm, 're': re},
                     'volume': {'km^3-10e9': vkm3, 've': ve},
                     'mass': {'kg-10e21': mkg, 'me': me},
                     'surface-area': {'km^2-10e6': akm2, 'e': ae},
                     'density': {'g/cm^3': d},
                     'gravity': {'m/s^2': ms2, 'e': ge},
                     'type': t,
                     'discovery': ds}})
    
    import pprint as p
    p.pprint(sso['Earth'])
    

    Output

    {'density': {'g/cm^3': '5.5136±0.0003'},
     'discovery': 'prehistoric',
     'gravity': {'e': '1', 'm/s^2': '9.81'},
     'mass': {'kg-10e21': '5972.4±0.3', 'me': '1'},
     'radius': {'km': '6371.0084±0.0001', 're': '1'},
     'surface-area': {'e': '1', 'km^2-10e6': '510.06447'},
     'type': 'terrestrial planet',
     'volume': {'km^3-10e9': '1083.21', 've': '1'}}
    

    And, here is a parse, from Python, to JSON.

    import json as j
    print(j.dumps(sso['Earth'], sort_keys=True, indent=4))
    

    Output

    {
        "density": {
            "g/cm^3": "5.5136\u00b10.0003"
        },
        "discovery": "prehistoric",
        "gravity": {
            "e": "1",
            "m/s^2": "9.81"
        },
        "mass": {
            "kg-10e21": "5972.4\u00b10.3",
            "me": "1"
        },
        "radius": {
            "km": "6371.0084\u00b10.0001",
            "re": "1"
        },
        "surface-area": {
            "e": "1",
            "km^2-10e6": "510.06447"
        },
        "type": "terrestrial planet",
        "volume": {
            "km^3-10e9": "1083.21",
            "ve": "1"
        }
    }