I am extracting quotes from text in the following manner and with the following output:
data = [
("\"Hello, nice to meet you,\" said John. Jane said, \"It is nice to meet you as well.\"", {"url": "example1.com", "date": "Jan 1"}),
("\"Hello, nice to meet you,\" said John", {"url": "example2.com", "date": "Jan 2"}),
for record in data:
doc = textacy.make_spacy_doc(record, lang="en_core_web_sm")
[DQTriple(speaker=[John], cue=[said], content="Hello, nice to meet you,"), DQTriple(speaker=[Jane], cue=[said], content="It is nice to meet you as well.")]
[DQTriple(speaker=[John], cue=[said], content="Hello, nice to meet you,")]
My goal is to convert the output into a pandas dataframe along with the metadata from the original dataset. Specifically, I would like it to look like this:
import pandas as pd
output = {"url": ["example1.com", "example1.com", "example2.com"],
"date": ["Jan 1", "Jan 1", "Jan 2"],
"speaker": ["John", "John", "Jane"],
"cue": ["said", "said", "said"],
"content": ["Hello, nice to meet you", "It is nice to meet you as well", "Hello, nice to meet you"]}
df = pd.DataFrame(output)
url date speaker cue content
0 example1.com Jan 1 John said Hello, nice to meet you
1 example1.com Jan 1 John said It is nice to meet you as well
2 example2.com Jan 2 Jane said Hello, nice to meet you
Is there an efficient way to do this?
In your case
l = []
for record in data:
doc = textacy.make_spacy_doc(record, lang="en_core_web_sm")
out = pd.Series(l).explode().apply(pd.Series)