Search code examples
pythonpandasshapely

Elegant way to convert Shapely Multipoint to a Pandas Dataframe


I need to convert a dict of Shapely MultiPoints to a dataframe. I've written a double-for-loop program to do that but I want to know if there's a better way of doing that.

Sample data and current code:

from shapely import wkb
import pandas as pd

data = {
    "A": "010400000002000000010100000000000000000008400000000000001440010100000000000000000008400000000000000840",
    "B": "01040000000200000001010000000000000000A061C00000000000A0894001010000000000000000708C400000000000C074C0",
    "C": "01040000000200000001010000000000000000EEB34000000000006CBB4001010000000000000000003E4000000000008DD3C0"
}

df = pd.DataFrame(columns=["ID", "X", "Y"])
for key, wkb_val in data.items():
    for point in wkb.loads(wkb_val, hex=True):
        df = df.append({
          "ID": key, "X": point.x, "Y": point.y  
        }, ignore_index=True)

This is effective if a little slow and clunky. Can this be done better, and if so how?


Solution

  • A list comprehension to build the a frame constructor is likely the best option here:

    df = pd.DataFrame(
        [[k, point.x, point.y]
         for k, v in data.items()
         for point in wkb.loads(v, hex=True)],
        columns=['ID', 'X', 'Y']
    )
    
      ID       X        Y
    0  A     3.0      5.0
    1  A     3.0      3.0
    2  B  -141.0    820.0
    3  B   910.0   -332.0
    4  C  5102.0   7020.0
    5  C    30.0 -20020.0
    

    pandas operations here are going to be expensive especially append in a loop which will need to generate a copy of the DataFrame in each iteration.


    Some Timing information via %timeit:

    This Answer

    def fn(data):
        return pd.DataFrame(
            [[k, point.x, point.y]
             for k, v in data.items()
             for point in wkb.loads(v, hex=True)],
            columns=['ID', 'X', 'Y']
        )
    
    %timeit fn(data)
    552 µs ± 11.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    

    OP's solution

    def fn2(data):
        df = pd.DataFrame(columns=["ID", "X", "Y"])
        for key, wkb_val in data.items():
            for point in wkb.loads(wkb_val, hex=True):
                df = df.append({
                    "ID": key, "X": point.x, "Y": point.y
                }, ignore_index=True)
        return df
    
    %timeit fn2(data)
    10.3 ms ± 77.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

    Steele Farnsworth's Solution

    def fn3(data):
        return pd.concat(
            (
                (
                    pd.concat(
                        (pd.Series({"ID": key, "X": point.x, "Y": point.y}) for
                         point in
                         wkb.loads(wkb_val, hex=True)), axis=1)
                )
                for key, wkb_val in data.items()
            ), axis=1
        ).T
    
    %timeit fn3(data)
    3.42 ms ± 132 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)