Non-spreadsheet-esque Data Import and Organization

Basic Problem

I have data of the form

doc
id 1
var1 A
var2 B
...
varStar 453
varStar 3432
varStar 32
...
varN 
doc
var1 A
var2 B
(And so on)

Where doc is denoting the beginning of each record/observation, varI is denoting a variable, and varStar is denoting a variable of interest that may have more than one entry.

I would like to

Import the data using R (or Python, but I'm a bit rusty there).
For each record retrieve the id and all instances of varStar.
Store them in some manner for later, easy manipulation/merger. For example, a list or binary (possibly sparse) matrix (for my application varStar is a category so overlap is to be expected).

It seems like this should be easy but I am only familiar with csv/spreadsheet data and perhaps just don't know the right words to Google. I would prefer not to, e.g., create an entire SQL database, etc. because I don't need the entire database. Of course, it may be easier to do some elaborate organization then pick out the choice pieces.

Context - My Application

I want to retrieve U.S. patent classes via the Google/USPTO Bulk Downloads. So id is the patent number while varStar is the patent class fsc. Then, I want to merge this with the NBER Patent data. My application hinges crucially on patent class designation. The NBER data, while nice in many respects, only reports a single "main" class for each patent. This is not good because based on casual perusal of the data and a paper by Volodin (2010), patents are commonly given several top-level classes.

Volodin, Dmitry. (2010) "NBER Patent Data Technological Classification Issues Relevant for Research in Inventor Mobility", Working Paper. udel.edu/~volodin/pat/draft.pdf.

Solution

Assuming dat.txt looks like:

doc
id 1
var1 A
var2 B
...
varStar 453
varStar 3432
varStar 32
...
varN
doc
id 2
var1 A
var2 B
varStar 111
varStar 222
varStar 333333
...

Then this is a possible framework:

library(dplyr)

dat <- readLines("dat.txt")

doc_starts <- which(grepl("^doc", dat))
doc_ends <- lead(doc_starts)-1
doc_ends[length(doc_ends)] <- length(dat)

# list-ified

lapply(seq_along(doc_starts), function(i) {

  chunk <- dat[doc_starts[i]:doc_ends[i]]

  id <- gsub("^id\ +", "", chunk[which(grepl("^id", chunk))])
  varStars <- gsub("^varStar\ +", "", chunk[which(grepl("^varStar", chunk))])

  list(id=id, varStar=varStars)

})

## [[1]]
## [[1]]$id
## [1] "1"
## 
## [[1]]$varStar
## [1] "453"  "3432" "32"  
## 
## 
## [[2]]
## [[2]]$id
## [1] "2"
## 
## [[2]]$varStar
## [1] "111"    "222"    "333333"


# data.frame-d

bind_rows(lapply(seq_along(doc_starts), function(i) {

  chunk <- dat[doc_starts[i]:doc_ends[i]]

  id <- gsub("^id\ +", "", chunk[which(grepl("^id", chunk))])
  varStars <- gsub("^varStar\ +", "", chunk[which(grepl("^varStar", chunk))])

  data_frame(id=id, varStar=varStars)

}))

## Source: local data frame [6 x 2]
## 
##   id varStar
## 1  1     453
## 2  1    3432
## 3  1      32
## 4  2     111
## 5  2     222
## 6  2  333333