Search code examples
pythonregexfindall

How to extract (speaker, text) tuples from earning call transcripts with regex?


For my master thesis I need to extract (speaker, text) tuples from corporate earning call transcripts.

A transcript comes in the following form:

OPERATOR: Some text with numbers, special characters and linebreaks.

NAME, COMPANY, POSITION: Some text with numbers, special characters and linebreaks.

NAME: Some text with numbers, special characters and linebreaks.

I want to extract all (speaker, text) tuples from the document. For example:

[("OPERATOR", "Some text with numbers, special characters and linebreaks."), ..]

So far, I have tried different regular expressions with the re.findall function in Python.

Here is an example extract:

example = """OPERATOR: Good day, ladies and gentlemen, and welcome to the first-quarter 2012
Agilent Technologies earnings conference call. My name is Keith, and I will be
your operator for today. At this time, all participants are in a listen-only
mode. Later on, we will have a question and answer session. (Operator
Instructions) As a reminder, today's conference is being recorded for replay
purposes.

And I would now like to turn the conference over to your host for today, Ms.
Alicia Rodriguez, Vice President of Investor Relations. Please go ahead, ma'am.

ALICIA RODRIGUEZ, VP - IR, AGILENT TECHNOLOGIES INC: Thank you, Keith, and
welcome, everyone, to Agilent's first quarter conference call for fiscal-year
2012. With me are Agilent's President and CEO, Bill Sullivan, as well as Senior
Vice President and CFO, Didier Hirsch. Joining in the Q&A after Didier's
comments will be Agilent's Chief Operating Officer, Ron Nersesian, and the
Presidents of our Electronic Measurement, Life Sciences, and Chemical Analysis
Groups -- Guy Sene, Nick Roelofs, and Mike McMullen.

You can find the press release and information to supplement today's discussion
on our website at www.investor.agilent.com. While there, please click on the
link for financial results, where you will find revenue breakouts and historical
financials for Agilent's operations. We will also post a copy of the prepared
remarks following this call. For any non-GAAP financial measures, you will find
the most directly comparable GAAP financial metrics and reconciliations on our
website.

We will make forward-looking statements about the financial performance of the
Company. These statements are subject to risks and uncertainties, and are only
valid as of today. The Company assumes no obligation to update them. Please look
at the Company's recent SEC filings for a more complete picture of our risks and
other factors.

Before turning the call over to Bill, I would like to remind you that Agilent
will host its annual analysts meeting in New York City on March 8. Details about
the meeting and webcast will be available on the Agilent investor relations
website two weeks prior.

And now, I'd like to turn the call over to Bill.

BILL SULLIVAN, PRESIDENT AND CEO, AGILENT TECHNOLOGIES INC: Thanks, Alicia, and
hello, everyone. Agilent's Q1 orders of $1.62 billion were flat versus last
year. Q1 revenues of $1.64 billion were up 7% year-over-year. Non-GAAP EPS was
$0.69 per share, and operating margin was 19%."""

And here is my code:

import re

# First approach:
r = re.compile(r"^([^a-z:]+?):([\s\S]+?)", flags=re.MULTILINE)
re.findall(r, example)

# Second approach:
r = re.compile(r"^([^a-z:]+?):([\s\S]+)", flags=re.MULTILINE)
re.findall(r, example)

The problem of the first (non-greedy) approach is that it does not capture the full text of a speaker.

The problem of the second (greedy) approach is that it does not stop when the next speaker appears.

EDIT: Additional information

  • The text group can also contain double points. There are also some cases where a double point occurs immediately after the first word of a line, e.g. "For\nexample: ..."
  • The speaker group can also cover multiple lines, e.g. when the company name and position description is very long

Solution

  • You could your match without using [\s\S]+ because that will match any character including new lines.

    For the second capturing group you could match .* and then use a repeating group with a negative lookahead which will match as long as the following line does not start with (?:(?!\n[^a-z\r\n]+:)

    ^([^a-z\r\n]+):(.*(?:(?!\n[^a-z\r\n]+:)[\r\n].*)*)
    

    Regex demo | Python demo