Search code examples
pythonparsingpandaswikiwikipedia

How to convert Wikipedia wikitable to Python Pandas DataFrame?


In Wikipedia, you can find some interesting data to be sorted, filtered, ...

Here is a sample of a wikitable

{| class="wikitable sortable"
|-
! Model !! Mhash/s !! Mhash/J !! Watts !! Clock !! SP !! Comment
|-
| ION || 1.8 || 0.067 || 27 ||  || 16 || poclbm;  power consumption incl. CPU
|-
| 8200 mGPU || 1.2 || || || 1200 || 16 || 128 MB shared memory, "poclbm -w 128 -f 0"
|-
| 8400 GS || 2.3 || || ||  ||  || "poclbm -w 128"
|-
|}

I'm looking for a way to import such data to a Python Pandas DataFrame


Solution

  • Here's a solution using py-wikimarkup and PyQuery to extract all tables as pandas DataFrames from a wikimarkup string, ignoring non-table content.

    import wikimarkup
    import pandas as pd
    from pyquery import PyQuery
    
    def get_tables(wiki):
        html = PyQuery(wikimarkup.parse(wiki))
        frames = []
        for table in html('table'):
            data = [[x.text.strip() for x in row]
                    for row in table.getchildren()]
            df = pd.DataFrame(data[1:], columns=data[0])
            frames.append(df)
        return frames
    

    Given the following input,

    wiki = """
    =Title=
    
    Description.
    
    {| class="wikitable sortable"
    |-
    ! Model !! Mhash/s !! Mhash/J !! Watts !! Clock !! SP !! Comment
    |-
    | ION || 1.8 || 0.067 || 27 ||  || 16 || poclbm;  power consumption incl. CPU
    |-
    | 8200 mGPU || 1.2 || || || 1200 || 16 || 128 MB shared memory, "poclbm -w 128 -f 0"
    |-
    | 8400 GS || 2.3 || || || || || "poclbm -w 128"
    |-
    |}
    
    {| class="wikitable sortable"
    |-
    ! A !! B !! C
    |-
    | 0
    | 1
    | 2
    |-
    | 3
    | 4
    | 5
    |}
    """
    

    get_tables returns the following DataFrames.

           Model Mhash/s Mhash/J Watts Clock  SP                                     Comment
    0        ION     1.8   0.067    27        16        poclbm;  power consumption incl. CPU
    1  8200 mGPU     1.2                1200  16  128 MB shared memory, "poclbm -w 128 -f 0"
    2    8400 GS     2.3                                                     "poclbm -w 128"
    

     

       A  B  C
    0  0  1  2
    1  3  4  5