Search code examples
rregextext

Regex between two specific patterns including newline


I have a text file with the following pattern:

Prof. Imperdiet montes, metus elementum eleifend eget eget adipiscing augue.
Abstract title: Lorem ipsum dolor sit amet, consectetuer adipiscing

A, nec, quam eleifend quis, magnis sit pretium. leo augue. amet, elit. vel
Vel, dis eget nascetur justo. imperdiet consequat et sit Nam Aenean a, Quisque
Enim. a, dui. Aenean lorem Phasellus commodo quis, pretium ultricies nascetur
tincidunt. sem. vitae,
montes, tellus. amet, venenatis natoque enim. fringilla
quis, vitae, Aenean Etiam viverra ipsum dapibus ut elementum Aenean Lorem eget,
nisi mollis Curabitur Quisque Aenean rhoncus sociis justo, sem. justo, vel
Aenean ultricies nec, eu laoreet.

Dr. Enim. vitae, feugiat in, Aenean
Abstract title: Massa. sociis dis dapibus dolor semper ipsum
jalor

Semper tincidunt. ullamcorper commodo magnis viverra pede elit. eget aliquet
eleifend vel, eleifend feugiat pede Vivamus ridiculus vitae, a, ligula, et Nulla
ligula vulputate ac, nisi. enim dapibus. Donec metus In sit dolor Nam ultricies
imperdiet. pellentesque Cras eu, massa quis porttitor parturient varius ut,
Phasellus arcu. pretium. quam augue. eu, adipiscing felis, enim. ante,
vulputate Integer dui. ultricies a, dictum rutrum. Nullam nec, quis,
consequat Cum tellus. dis felis dolor. nulla Aliquam Donec massa. justo. in,
nascetur
Semper tincidunt. ullamcorper commodo magnis viverra pede elit. eget aliquet
eleifend vel, eleifend feugiat pede Vivamus ridiculus vitae, a, ligula, et Nulla


Dr. Justo. nisi elementum ante, Donec Aenean Nulla
Abstract title:

Aenean consectetuer leo penatibus eget imperdiet nisi. consequat
lorem pretium mus. 

Prof. Dr. Aliquam metus semper
Abstract title: Aliquet augue. amet, enim ut justo, nec, eleifend lorem enim. nisi. ipsum
eleifend
More information will be available soon.

I want to extract these parts:

Abstract title: Lorem ipsum dolor sit amet, consectetuer adipiscing

Abstract title: Massa. sociis dis dapibus dolor semper ipsum jalor

Abstract title:

and

Abstract title: Aliquet augue. amet, enim ut justo, nec, eleifend lorem enim. nisi. ipsum eleifend More information will be available soon.

Now, I found these are helpful:

but (?<=(Abstract title:))(.*)(?=\n{2}) returns only

Abstract title: Lorem ipsum dolor sit amet, consectetuer adipiscing

and

Abstract title:

Also I am not sure what software tool would be most efficient – , , ? Please forgive if it's noob question but I am open to suggestions.


Solution

  • In R, you can extract your matches and "normalize" all whitespace inside matches to a regular single space using

    x <- "Prof. Imperdiet montes, metus elementum eleifend eget eget adipiscing augue.\nAbstract title: Lorem ipsum dolor sit amet, consectetuer adipiscing\n\nA, nec, quam eleifend quis, magnis sit pretium. leo augue. amet, elit. vel\n\nVel, dis eget nascetur justo. imperdiet consequat et sit Nam Aenean a, Quisque\nEnim. a, dui. Aenean lorem Phasellus commodo quis, pretium ultricies nascetur\ntincidunt. sem. vitae,\nmontes, tellus. amet, venenatis natoque enim. fringilla\nquis, vitae, Aenean Etiam viverra ipsum dapibus ut elementum Aenean Lorem eget,\nnisi mollis Curabitur Quisque Aenean rhoncus sociis justo, sem. justo, vel\nAenean ultricies nec, eu laoreet.\n\nDr. Enim. vitae, feugiat in, Aenean\nAbstract title: Massa. sociis dis dapibus dolor semper ipsum\njalor\n\nSemper tincidunt. ullamcorper commodo magnis viverra pede elit. eget aliquet\neleifend vel, eleifend feugiat pede Vivamus ridiculus vitae, a, ligula, et Nulla\nligula vulputate ac, nisi. enim dapibus. Donec metus In sit dolor Nam ultricies\nimperdiet. pellentesque Cras eu, massa quis porttitor parturient varius ut,\nPhasellus arcu. pretium. quam augue. eu, adipiscing felis, enim. ante,\nvulputate Integer dui. ultricies a, dictum rutrum. Nullam nec, quis,\nconsequat Cum tellus. dis felis dolor. nulla Aliquam Donec massa. justo. in,\nnascetur\nSemper tincidunt. ullamcorper commodo magnis viverra pede elit. eget aliquet\neleifend vel, eleifend feugiat pede Vivamus ridiculus vitae, a, ligula, et Nulla\n\n\nDr. Justo. nisi elementum ante, Donec Aenean Nulla\nAbstract title:\n\nAenean consectetuer leo penatibus eget imperdiet nisi. consequat\nlorem pretium mus. \n\nProf. Dr. Aliquam metus semper\nAbstract title: Aliquet augue. amet, enim ut justo, nec, eleifend lorem enim. nisi. ipsum\neleifend\nMore information will be available soon.\n"
    library(stringr)
    pattern <- "(?<=Abstract title:).*(?:\n(?!\n).*)*"
    results <- lapply(str_extract_all(x, pattern), function(z) trimws(gsub("\\s+", " ", z)))
    

    The results will look like

    [[1]]
    [1] "Lorem ipsum dolor sit amet, consectetuer adipiscing"                                                                        
    [2] "Massa. sociis dis dapibus dolor semper ipsum jalor"                                                                         
    [3] ""                                                                                                                           
    [4] "Aliquet augue. amet, enim ut justo, nec, eleifend lorem enim. nisi. ipsum eleifend More information will be available soon."
    

    See the R demo online and the regex demo.

    Regex details:

    • (?<=Abstract title:) - a positive lookbehind that matches a position that is immediately preceded with Abstract title:
    • .* - any zero or more chars other than line break chars as many as possible
    • (?:\n(?!\n).*)* - zero or more sequences of
      • \n(?!\n) - a line feed char not immediately followed with another line feed char
      • .* - any zero or more chars other than line break chars as many as possible

    The lapply(..., function(z) trimws(gsub("\\s+", " ", z))) "shrinks" the whitespace in the resulting list.

    Parsing the text file into two columns

    You can use

    library(readr)
    library(stringr)
    file <- read_lines(path)
    file_string <- paste(file, collapse="\n")
    pattern <- "(?m)^(.+)\n(Abstract title:.*(?:\n(?!\n).*)*)"
    res <- str_match_all(file_string, pattern)
    res <- lapply(res, function(z) trimws(gsub("\\s+", " ", z[,-1])))
    

    The output is

    [[1]]
         [,1]                                                                           [,2]                                                                                                                                         
    [1,] "Prof. Imperdiet montes, metus elementum eleifend eget eget adipiscing augue." "Abstract title: Lorem ipsum dolor sit amet, consectetuer adipiscing"                                                                        
    [2,] "Dr. Enim. vitae, feugiat in, Aenean"                                          "Abstract title: Massa. sociis dis dapibus dolor semper ipsum jalor"                                                                         
    [3,] "Dr. Justo. nisi elementum ante, Donec Aenean Nulla"                           "Abstract title:"                                                                                                                            
    [4,] "Prof. Dr. Aliquam metus semper"                                               "Abstract title: Aliquet augue. amet, enim ut justo, nec, eleifend lorem enim. nisi. ipsum eleifend More information will be available soon."