Search code examples
javanlpstanford-nlpgate

Tools for text simplification (Java)


What is the best tool that can do text simplification using Java?

Here is an example of text simplification:

John, who was the CEO of a company, played golf.
                       ↓
John played golf. John was the CEO of a company.

Solution

  • I see your problem as a task of converting complex or compound sentence into simple sentences. Based on literature Sentence Types, a simple sentence is built from one independent clause. A compound and complex sentence is built from at least two clauses. Also, clause must have subject and verb.
    So your task is to split sentence into clauses that form your sentence.

    Dependency parsing from Stanford CoreNLP is a perfect tools to split compound and complex sentence into simple sentence. You can try the demo online.
    From your sample sentence, we will get parse result in Stanford typed dependency (SD) notation as shown below:

    nsubj(CEO-6, John-1)
    nsubj(played-11, John-1)
    cop(CEO-6, was-4)
    det(CEO-6, the-5)
    rcmod(John-1, CEO-6)
    det(company-9, a-8)
    prep_of(CEO-6, company-9)
    root(ROOT-0, played-11)
    dobj(played-11, golf-12)

    A clause can be identified from relation (in SD) which category is subject, e.g. nsubj, nsubjpass. See Stanford Dependency Manual
    Basic clause can be extracted from head as verb part and dependent as subject part. From SD above, there are two basic clause i.e.

    • John CEO
    • John played

    After you get basic clause, you can add another part to make your clause a complete and meaningful sentence. To do so, please consult Stanford Dependency Manual.

    By the way, your question might be related with Finding meaningful sub-sentences from a sentence


    Answer to 3rd comment:

    Once you got the pair of subject an verb, i.e. nsubj(CEO-6, John-1), get all dependencies that have link to that dependency, except any dependency which category is subject, then extract unique word from these dependencies.

    Based on example, nsubj(CEO-6, John-1), if you start traversing from John-1, you'll get nsubj(played-11, John-1) but you should ignore it since its category is subject.

    Next step is traversing from CEO-6 part. You'll get

    cop(CEO-6, was-4)
    det(CEO-6, the-5)
    rcmod(John-1, CEO-6)
    prep_of(CEO-6, company-9)

    From result above, you got new dependencies to traverse (i.e. find another dependencies that have was-4, the-5, company-9 in either head or dependent).
    Now your dependencies are

    cop(CEO-6, was-4)
    det(CEO-6, the-5)
    rcmod(John-1, CEO-6)
    prep_of(CEO-6, company-9)
    det(company-9, a-8)

    In this step, you've finished traversing all dependecies linked to nsubj(CEO-6, John-1). Next, extract words from all head and dependent, then arrange the word in ascending order based on number appended to these words. This number indicating word order in original sentence.

    John was the CEO a company

    Our new sentence is missing one part, i.e of. This part is hidden in prep_of(CEO-6, company-9). If you read Stanford Dependency Manual, there are two kinds of SD, collapsed and non-collapsed. Please read them to understand why this of is hidden and how to get the word order of this hidden part.

    With same approach, you'll get second sentence

    John played golf