Search code examples
pythonregextext-segmentationcitations

Python - How to Extract sentences that contains Citation mark?


text = "Trondheim is a small city with a university and 140000 inhabitants. Its central bus systems has 42 bus lines, serving 590 stations, with 1900 (departures per) day in average. T h a t gives approximately 60000 scheduled bus station passings per day, which is somehow represented in the route data base. The starting point is to automate the function (Garry Weber, 2005) of a route information agent."
print re.findall(r"([^.]*?\(.+ [0-9]+\)[^.]*\.)",text)

I'm using the code above to extract the sentence with citation in it. As you can see the final sentence contain citation (Garry Weber, 2005).

But I got this result:

[' Its central bus systems has 42 bus lines, serving 590 stations, with 1900 (departures per) day in average. T h a t gives approximately 60000 scheduled bus station passings per day, which is somehow represented in the route data base. The starting point is to automate the function (Garry Weber, 2005) of a route information agent.']

The result should be the sentence that contains citation only, like this:
The starting point is to automate the function (Garry Weber, 2005) of a route information agent.

I guess the problem is caused by the text inside parentheses, as you can see at the second line it contains (departures per), any solution for my code?


Solution

  • My attempt. Live demo.

    \b[^.]+\([^()]+\b(\d{2}|\d{4})\s*\)[^.]*\.
    

    It captures precisely the sentence and is more specific with the year than yours.