Search code examples
regexrscreen-scraping

Remove a line from list and all successive lines to N?


I have some list in R, which is a set of lines from a relatively unstructured document that I am scraping for data. At the top of each page is a page number, proceeded by the string "page" and several lines of header information which I would like to drop.

Each document has a different number of header lines. My solution so far:

RawFeed.1<- grep("Page",RawFeed)
RawFeed.1a<-length(RawFeed.1)
RawFeed.1<-RawFeed.1[-1]

Note the first instance is dropped here because the first page always has more header lines than the rest of the pages and its dropped later anyway.

y<-RawFeed.1[1]
ya<-c(y:length(RawFeed))

NSearch<-RawFeed[ya]
NSearch.1<-grep("Start", NSearch)
y1<-NSearch.1[1]
y1<-y1-1

y2<-c(0:y1)

As 'start' is always found on the line before the data begins, this consistently gives me the document specific number of header lines.

Next I attempt to remove them by:

PageBreak <-function(y) {
RawFeed<-RawFeed[-x-y]
}

RawFeedTemp<-lapply(RawFeed.1,PageBreak,y=y2)

Which does work, sort of - I am left with an array such that RawFeedTemp[[n]] has the header information removed only for that page.

So how can I preform a similar operation where I am left with a list where each page's header information has been removed or is there a way to combine the elements in the array such that it contains only one set of lines, excluding those I am trying to remove?

Edit: An example of the data

[306] N 46 10/08/12 10/08/12  Stuff :30 NM 0 $0.00" 
[307] Week: 10/08/12 10/14/12 Other Stuff $6,500.00 0.00
[308] " Contract Agreement Between: Print Date 10/05/12 Page 5 of 6"                                                                                                                                                                  
[309] ""                                                                                                                                                                                                                              
[310] ""                                                                                                                                                                                                                              
[311] " Contract / Revision Alt Order #"                                                                                                                                                                                              
[312] " Person                                                                                                                                                                                                                
[313] " Address 1                                                                                                                                                                                                          
[314] " Address 2                                                                                                                                                                                                            
[315] " Address 3                                                                                                                                                                  
[316] " Address 4                                                                                                                                                                   
[317] ""                                                                                                                                                                                                                              
[318] " Original Date / Revision"                                                                                                                                                                                          
[319] ""                                                                                                                                                                                                                 
[320] "08/10/12 / 10/04/12"                                                                                                                                                                                        
[321] ""                                                                                                                                                                                                                              
[322] ""                                                                                                                                                                                                                              
[323] ""                                                                                                                                                                                                                        
[324] "* Line Ch Start Date End Date Description Start
[325] MORE DATA

Another File might have a different number of these headers. Also note that records occupy more than one line, most files finish a record before starting a new page but a few insist on pushing the second line of the record to a new page which why I need to remove them all

Thanks for your help!


Solution

  • Since you don't give a clear example of your data, I am not sure of the given solution.

    If I understand you have document with parts (header) between 'Page' and 'start' That you want to remove. Here a sample of your data with 2 headers:

    str <- 'Page ......        ### header1 
    alalalala
    lalalalalal
    aalalala
    lslslsls start ksksksks
    keep me 1
    keep me 2
    Page ......               ### header 2
    aalalala
    lslslsls start ksksksks
    keep me 3
    keep me 4'
    

    Here I am using readLines to read the document , and find header lines using grep, and remove the join of lines index from the lines list.

    ll <- readLines(textConnection(str))
    ids <- matrix(grep('Page|start',ll),ncol=2,byrow=TRUE)
    ll[-unlist(apply(ids,1,function(x)seq(x[1],x[2])))]
    
    [1] "keep me 1" "keep me 2" "keep me 3" "keep me 4"