Is there a way to build a DataFrame from a list of python Generator objects? I used a list comprehension to create the list with data for the dataframe:
data_list.append([record.Timestamp,record.Value, record.Name, record.desc] for record in records)
I did it this way because normal list.append()
in a for-loop is taking like 20x times longer:
for record in records:
data_list.append(record.Timestamp,record.Value, record.Name, record.desc)
I tried to create the dataframe but it doesn't work:
dataframe = pd.DataFrame(data_list, columns=['timestamp', 'value', 'name', 'desc'])
Throws exception:
ValueError: 4 columns passed, passed data had 142538 columns.
itertools
like this:dataframe = pd.DataFrame(data=([list(elem) for elem in itt.chain.from_iterable(data_list)]), columns=['timestamp', 'value', 'name', 'desc'])
This results as a empty DataFrame:
Empty DataFrame\nColumns: [timestamp, value, name, desc]\nIndex: []
data_list looks like this:
[<generator object St...51DB0>, <generator object St...56EB8>,<generator object St...51F10>, <generator object St...51F68>]
Code for generating the list looks like this:
for events in events_list:
for record in events:
data_list.append([record.Timestamp,record.Value, record.Name, record.desc] for record in records)
This is required because of events list data structure.
Is there a way for me to create a dataframe out of list of generators? If there is, is it going to be time-efficient? What I mean is that I save a lot of time with replacing normal for-loop with list comprehension, however if the creation of dataframe takes more time, this action will be pointless.
Just turn your data_list
into a generator expression as well. For example:
from collections import namedtuple
MyData = namedtuple("MyData", ["a"])
data = (d.a for d in (MyData(i) for i in range(100)))
df = pd.DataFrame(data)
will work just fine. So what you should do is have:
data = ((record.Timestamp,record.Value, record.Name, record.desc) for record in records)
df = pd.DataFrame(data, columns=["Timestamp", "Value", "Name", "Desc"])
The actual reason why your approach does not work is because you have a single entry in your data_list
which is a generator over - I suppose - 142538 records. Pandas will try to cram that single entry in your data_list
into a single row (so all the 142538 entries, each a list of four elements) and fails, since it expects rather 4 columns to be passed.
Edit: you can of course make the generator expression more complex, here's an example along the lines of your additional loop over events:
from collections import namedtuple
MyData = namedtuple("MyData", ["a", "b"])
data = ((d.a, d.b) for j in range(100) for d in (MyData(j, j+i) for i in range(100)))
pd.DataFrame(data, columns=["a", "b"])
edit: here's also an example using data structures like you are using:
Record = namedtuple("Record", ["Timestamp", "Value", "Name", "desc"])
event_list = [[Record(Timestamp=1, Value=1, Name=1, desc=1),
Record(Timestamp=2, Value=2, Name=2, desc=2)],
[Record(Timestamp=3, Value=3, Name=3, desc=3)]]
data = ((r.Timestamp, r.Value, r.Name, r.desc) for events in event_list for r in events)
pd.DataFrame(data, columns=["timestamp", "value", "name", "desc"])
Output:
timestamp value name desc
0 1 1 1 1
1 2 2 2 2
2 3 3 3 3