I am new to python/pandas/numpy and I need to create the following Dataframe:
DF = pd.concat([pd.Series(x[2]).apply(lambda r: pd.Series(re.split('\@|/',r))).assign(id=x[0]) for x in hDF])
where hDF
is a dataframe that has been created by:
hDF=pd.DataFrame(h.DF)
and h.DF
is a list whose elements looks like this:
['5203906',
['highway=primary',
'maxspeed=30',
'oneway=yes',
'ref=N 22',
'surface=asphalt'],
['3655224911@1.735928/42.543651',
'3655224917@1.735766/42.543561',
'3655224916@1.735694/42.543523',
'3655224915@1.735597/42.543474',
'4817024439@1.735581/42.543469']]
However, in some cases the list is very long (O(10^7)) and also the list in h.DF[*][2]
is very long, so I run out of memory.
I can obtain the same result, avoiding the use of the lambda function, like so:
DF = pd.concat([pd.Series(x[2]).str.split('\@|/', expand=True).assign(id=x[0]) for x in hDF])
But I am still running out of memory in the cases where the lists are very long.
Can you think of a possible solution to obtain the same results without starving resources?
I managed to make it work using the following code:
bl = []
for x in h.DF:
data = np.loadtxt(
np.loadtxt(x[2], dtype=str, delimiter="@")[:, 1], dtype=float, delimiter="/"
).tolist()
[i.append(x[0]) for i in data]
bl.append(data)
bbl = list(itertools.chain.from_iterable(bl))
DF = pd.DataFrame(bbl).rename(columns={0: "lon", 1: "lat", 2: "wayid"})
Now it's super fast :)