Search code examples
pythonpandasdataframesplitassign

efficient way of computing a dataframe using concat and split


I am new to python/pandas/numpy and I need to create the following Dataframe:

DF  = pd.concat([pd.Series(x[2]).apply(lambda r: pd.Series(re.split('\@|/',r))).assign(id=x[0]) for x in hDF])

where hDF is a dataframe that has been created by:

hDF=pd.DataFrame(h.DF)

and h.DF is a list whose elements looks like this:

    ['5203906',
 ['highway=primary',
  'maxspeed=30',
  'oneway=yes',
  'ref=N 22',
  'surface=asphalt'],
 ['3655224911@1.735928/42.543651',
  '3655224917@1.735766/42.543561',
  '3655224916@1.735694/42.543523',
  '3655224915@1.735597/42.543474',
  '4817024439@1.735581/42.543469']]

However, in some cases the list is very long (O(10^7)) and also the list in h.DF[*][2] is very long, so I run out of memory.

I can obtain the same result, avoiding the use of the lambda function, like so:

DF  = pd.concat([pd.Series(x[2]).str.split('\@|/', expand=True).assign(id=x[0]) for x in hDF])

But I am still running out of memory in the cases where the lists are very long.

Can you think of a possible solution to obtain the same results without starving resources?


Solution

  • I managed to make it work using the following code:

    bl = []
    for x in h.DF:
        data = np.loadtxt(
            np.loadtxt(x[2], dtype=str, delimiter="@")[:, 1], dtype=float, delimiter="/"
        ).tolist()
        [i.append(x[0]) for i in data]
        bl.append(data)
    bbl = list(itertools.chain.from_iterable(bl))
    DF = pd.DataFrame(bbl).rename(columns={0: "lon", 1: "lat", 2: "wayid"})
    

    Now it's super fast :)