I have some list in R, which is a set of lines from a relatively unstructured document that I am scraping for data. At the top of each page is a page number, proceeded by the string "page" and several lines of header information which I would like to drop.
Each document has a different number of header lines. My solution so far:
RawFeed.1<- grep("Page",RawFeed)
RawFeed.1a<-length(RawFeed.1)
RawFeed.1<-RawFeed.1[-1]
Note the first instance is dropped here because the first page always has more header lines than the rest of the pages and its dropped later anyway.
y<-RawFeed.1[1]
ya<-c(y:length(RawFeed))
NSearch<-RawFeed[ya]
NSearch.1<-grep("Start", NSearch)
y1<-NSearch.1[1]
y1<-y1-1
y2<-c(0:y1)
As 'start' is always found on the line before the data begins, this consistently gives me the document specific number of header lines.
Next I attempt to remove them by:
PageBreak <-function(y) {
RawFeed<-RawFeed[-x-y]
}
RawFeedTemp<-lapply(RawFeed.1,PageBreak,y=y2)
Which does work, sort of - I am left with an array such that RawFeedTemp[[n]]
has the header information removed only for that page.
So how can I preform a similar operation where I am left with a list where each page's header information has been removed or is there a way to combine the elements in the array such that it contains only one set of lines, excluding those I am trying to remove?
Edit: An example of the data
[306] N 46 10/08/12 10/08/12 Stuff :30 NM 0 $0.00"
[307] Week: 10/08/12 10/14/12 Other Stuff $6,500.00 0.00
[308] " Contract Agreement Between: Print Date 10/05/12 Page 5 of 6"
[309] ""
[310] ""
[311] " Contract / Revision Alt Order #"
[312] " Person
[313] " Address 1
[314] " Address 2
[315] " Address 3
[316] " Address 4
[317] ""
[318] " Original Date / Revision"
[319] ""
[320] "08/10/12 / 10/04/12"
[321] ""
[322] ""
[323] ""
[324] "* Line Ch Start Date End Date Description Start
[325] MORE DATA
Another File might have a different number of these headers. Also note that records occupy more than one line, most files finish a record before starting a new page but a few insist on pushing the second line of the record to a new page which why I need to remove them all
Thanks for your help!
Since you don't give a clear example of your data, I am not sure of the given solution.
If I understand you have document with parts (header) between 'Page' and 'start' That you want to remove. Here a sample of your data with 2 headers:
str <- 'Page ...... ### header1
alalalala
lalalalalal
aalalala
lslslsls start ksksksks
keep me 1
keep me 2
Page ...... ### header 2
aalalala
lslslsls start ksksksks
keep me 3
keep me 4'
Here I am using readLines
to read the document , and find header lines using grep
, and remove the join of lines index from the lines list.
ll <- readLines(textConnection(str))
ids <- matrix(grep('Page|start',ll),ncol=2,byrow=TRUE)
ll[-unlist(apply(ids,1,function(x)seq(x[1],x[2])))]
[1] "keep me 1" "keep me 2" "keep me 3" "keep me 4"