Search code examples
pythonnlpsentence

Insert dots/points in messy string for textual analysis in python


I am given with a long, messy string that lacks sentence structures, i.e., the string does not consistently contain dots/points.

Therefore, I am currently unable to break-down the long string into sentences, which is required for my textual analysis.

The following example best describes what I am given with and what I would need as output.

example_string = "Football is the world's most popular sport Played on rectangular fields, two teams of eleven players each compete to score goals One of the most famous teams is Real Madrid."

output_string = "Football is the world's most popular sport. Played on rectangular fields, two teams of eleven players each compete to score goals. One of the most famous teams is Real Madrid."

I was first thinking of putting a dot/point whenever there is none between a lower-case word and a capitalized word. However, given certain words and especially names may start with a capital letter, I would incorrectly add the dot/point (e.g., in the example, I would add a dot/point before "Real Madrid")

Any help is appreciated. Thank you!


Solution

  • How about leveraging an LLM (via an API) for that?

    Quick test run with GPT-4:

    Prompt
    
        Separate the following string into sentences. List each sentence with a bullet point: "Football is the world's most popular sport Played on rectangular fields, two teams of eleven players each compete to score goals One of the most famous teams is Real Madrid."
    
    Output
    
        - Football is the world's most popular sport.
        - Played on rectangular fields, two teams of eleven players each compete to score goals.
        - One of the most famous teams is Real Madrid.