Search code examples
bashshellgrepsentence

grep full sentences containing a word into a document


I would like to extract a full sentence "." to "." into a document given a word. So for example given this text:

Dijkstra's original algorithm does not use a min-priority queue. For a given source vertex (node) in the graph, the algorithm finds the path with lowest cost (i.e. the shortest path) between that vertex and every other vertex. It can also be used for finding costs of shortest paths from a single vertex to a single destination vertex by stopping the algorithm once the shortest path to the destination vertex has been determined.

I would like to have the entire sentence that contains "graph"

For a given source vertex (node) in the graph, the algorithm finds the path with lowest cost (i.e. the shortest path) between that vertex and every other vertex.

Also it would be useful to find a way to include in the results the starting sentence if it contains graph, because there is no dot before it.


Solution

  • Assuming the text file dijk doesn't actually contain any newlines, you could do this in perl:

    perl -MLingua::EN::Sentence=get_sentences -ne '
    print "$_\n" for grep { /graph/ } @{get_sentences($_)}' dijk
    

    The Lingua::EN::Sentence module is smart enough to deal with well-known abbreviations and you can add your own if necessary.

    Output:

    For a given source vertex (node) in the graph, the algorithm finds the path with lowest cost (i.e. the shortest path) between that vertex and every other vertex.
    

    If the newlines do actually exist in the input, it should be possible to adapt the script without too much difficulty.


    edit

    If there are newlines in the input, you could do this instead:

    perl -MLingua::EN::Sentence=get_sentences -00 -e '
    $t = <>;         # slurp the whole file
    $t =~ tr{\n}{ }; # convert newlines to spaces
    print "$_\n" for grep { /graph/ } @{get_sentences($t)}' dijk
    

    Of course, by now this is looking a lot more like a full-blown perl script rather than a one-liner!

    Alternatively, as mentioned by @mklement0, you could use the external tool tr to perform the translation and pass the result to the original script:

    perl -MLingua::EN::Sentence=get_sentences -ne '
    print "$_\n" for grep { /graph/ } @{get_sentences($_)}' <(tr '\n' ' ' < dijk)