Search code examples
python-3.xgoogle-translate

Cut long string to paragraphs containing full sentences


I have a task to translate very long text (more than 50k symbols) with online translate api (google, yandex, etc). All of them have limitation for request length. So, I want to cut my text into list of string with length less than those limitations but also save sentences uncut.

For example, if I want to process this text with limitation of 300 symbols:

The Stanford NLP Group makes some of our Natural Language Processing software available to everyone! We provide statistical NLP, deep learning NLP, and rule-based NLP tools for major computational linguistics problems, which can be incorporated into applications with human language technology needs.These packages are widely used in industry, academia, and government. This code is actively being developed, and we try to answer questions and fix bugs on a best-effort basis. All our supported software distributions are written in Java. Current versions of our software from October 2014 forward require Java 8+. (Versions from March 2013 to September 2014 required Java 1.6+; versions from 2005 to Feb 2013 required Java 1.5+. The Stanford Parser was first written in Java 1.1.) Distribution packages include components for command-line invocation, jar files, a Java API, and source code. You can also find us on GitHub and Maven. A number of helpful people have extended our work, with bindings or translations for other languages. As a result, much of this software can also easily be used from Python (or Jython), Ruby, Perl, Javascript, F#, and other .NET and JVM languages.

I should get that output:

['The Stanford NLP Group makes some of our Natural Language Processing software available to everyone! We provide statistical NLP, deep learning NLP, and rule-based NLP tools for major computational linguistics problems, which can be incorporated into applications with human language technology needs.', 
'These packages are widely used in industry, academia, and government. This code is actively being developed, and we try to answer questions and fix bugs on a best-effort basis. All our supported software distributions are written in Java.', 
'Current versions of our software from October 2014 forward require Java 8+. (Versions from March 2013 to September 2014 required Java 1.6+; versions from 2005 to Feb 2013 required Java 1.5+. The Stanford Parser was first written in Java 1.1.)', 
'Distribution packages include components for command-line invocation, jar files, a Java API, and source code. You can also find us on GitHub and Maven. A number of helpful people have extended our work, with bindings or translations for other languages.', 
'As a result, much of this software can also easily be used from Python (or Jython), Ruby, Perl, Javascript, F#, and other .NET and JVM languages.']  

What's most pythonic way to do it? Are there any regexps to achieve that?


Solution

  • regex is not the right tool to parse sentences out of paragraphs. you should look at nltk

    import nltk
    
    # this line only needs to be run once per environment:
    nltk.download('punkt') 
    
    text = """The Stanford NLP Group makes some of our Natural Language Processing software available to everyone! We provide statistical NLP, deep learning NLP, and rule-based NLP tools for major computational linguistics problems, which can be incorporated into applications with human language technology needs.These packages are widely used in industry, academia, and government. This code is actively being developed, and we try to answer questions and fix bugs on a best-effort basis. All our supported software distributions are written in Java. Current versions of our software from October 2014 forward require Java 8+. (Versions from March 2013 to September 2014 required Java 1.6+; versions from 2005 to Feb 2013 required Java 1.5+. The Stanford Parser was first written in Java 1.1.) Distribution packages include components for command-line invocation, jar files, a Java API, and source code. You can also find us on GitHub and Maven. A number of helpful people have extended our work, with bindings or translations for other languages. As a result, much of this software can also easily be used from Python (or Jython), Ruby, Perl, Javascript, F#, and other .NET and JVM languages."""
    
    sents = nltk.sent_tokenize(text)
    
    sents
    # outputs:
    ['The Stanford NLP Group makes some of our Natural Language Processing software available to everyone!',
     'We provide statistical NLP, deep learning NLP, and rule-based NLP tools for major computational linguistics problems, which can be incorporated into applications with human language technology needs.These packages are widely used in industry, academia, and government.',
     'This code is actively being developed, and we try to answer questions and fix bugs on a best-effort basis.',
     'All our supported software distributions are written in Java.',
     'Current versions of our software from October 2014 forward require Java 8+.',
     '(Versions from March 2013 to September 2014 required Java 1.6+; versions from 2005 to Feb 2013 required Java 1.5+.',
     'The Stanford Parser was first written in Java 1.1.)',
     'Distribution packages include components for command-line invocation, jar files, a Java API, and source code.',
     'You can also find us on GitHub and Maven.',
     'A number of helpful people have extended our work, with bindings or translations for other languages.',
     'As a result, much of this software can also easily be used from Python (or Jython), Ruby, Perl, Javascript, F#, and other .NET and JVM languages.']
    

    One way to aggregate sentences based on cumulative length is to use a generator function:

    Here, the function g will yield a joined string if the string's length exceeds 300 characters or the end of the iterable is reached. This function assumes that no single sentence exceeds the 300 character limit.

    def g(sents):
        idx = 0
        text_length = 0
        for i, s in enumerate(sents):
            if text_length + len(s) > 300:
                yield ' '.join(sents[idx:i])
                text_length = len(s)
                idx = i
            else:
                text_length += len(s)
        yield ' '.join(sents[idx:])
    

    The sentence aggregator can be called like this:

    for s in g(sents):
        print(s)
    outputs:
    The Stanford NLP Group makes some of our Natural Language Processing software available to everyone!
    We provide statistical NLP, deep learning NLP, and rule-based NLP tools for major computational linguistics problems, which can be incorporated into applications with human language technology needs.These packages are widely used in industry, academia, and government.
    This code is actively being developed, and we try to answer questions and fix bugs on a best-effort basis. All our supported software distributions are written in Java. Current versions of our software from October 2014 forward require Java 8+.
    (Versions from March 2013 to September 2014 required Java 1.6+; versions from 2005 to Feb 2013 required Java 1.5+. The Stanford Parser was first written in Java 1.1.) Distribution packages include components for command-line invocation, jar files, a Java API, and source code.
    You can also find us on GitHub and Maven. A number of helpful people have extended our work, with bindings or translations for other languages. As a result, much of this software can also easily be used from Python (or Jython), Ruby, Perl, Javascript, F#, and other .NET and JVM languages.
    

    Examining the length of each text segment shows that all the segments have fewer than 300 characters:

    [len(s) for s in g(sents)]
    #outputs:
    [100, 268, 244, 276, 289]