Basic Problem
I have data of the form
doc
id 1
var1 A
var2 B
...
varStar 453
varStar 3432
varStar 32
...
varN
doc
var1 A
var2 B
(And so on)
Where doc
is denoting the beginning of each record/observation, varI
is denoting a variable, and varStar
is denoting a variable of interest that may have more than one entry.
I would like to
Import the data using R (or Python, but I'm a bit rusty there).
For each record retrieve the id
and all instances of varStar
.
Store them in some manner for later, easy manipulation/merger. For example, a list or binary (possibly sparse) matrix (for my application varStar
is a category so overlap is to be expected).
It seems like this should be easy but I am only familiar with csv/spreadsheet data and perhaps just don't know the right words to Google. I would prefer not to, e.g., create an entire SQL database, etc. because I don't need the entire database. Of course, it may be easier to do some elaborate organization then pick out the choice pieces.
Context - My Application
I want to retrieve U.S. patent classes via the Google/USPTO Bulk Downloads. So id
is the patent number while varStar
is the patent class fsc
. Then, I want to merge this with the NBER Patent data. My application hinges crucially on patent class designation. The NBER data, while nice in many respects, only reports a single "main" class for each patent. This is not good because based on casual perusal of the data and a paper by Volodin (2010), patents are commonly given several top-level classes.
Volodin, Dmitry. (2010) "NBER Patent Data Technological Classification Issues Relevant for Research in Inventor Mobility", Working Paper. udel.edu/~volodin/pat/draft.pdf.
Assuming dat.txt
looks like:
doc
id 1
var1 A
var2 B
...
varStar 453
varStar 3432
varStar 32
...
varN
doc
id 2
var1 A
var2 B
varStar 111
varStar 222
varStar 333333
...
Then this is a possible framework:
library(dplyr)
dat <- readLines("dat.txt")
doc_starts <- which(grepl("^doc", dat))
doc_ends <- lead(doc_starts)-1
doc_ends[length(doc_ends)] <- length(dat)
# list-ified
lapply(seq_along(doc_starts), function(i) {
chunk <- dat[doc_starts[i]:doc_ends[i]]
id <- gsub("^id\ +", "", chunk[which(grepl("^id", chunk))])
varStars <- gsub("^varStar\ +", "", chunk[which(grepl("^varStar", chunk))])
list(id=id, varStar=varStars)
})
## [[1]]
## [[1]]$id
## [1] "1"
##
## [[1]]$varStar
## [1] "453" "3432" "32"
##
##
## [[2]]
## [[2]]$id
## [1] "2"
##
## [[2]]$varStar
## [1] "111" "222" "333333"
# data.frame-d
bind_rows(lapply(seq_along(doc_starts), function(i) {
chunk <- dat[doc_starts[i]:doc_ends[i]]
id <- gsub("^id\ +", "", chunk[which(grepl("^id", chunk))])
varStars <- gsub("^varStar\ +", "", chunk[which(grepl("^varStar", chunk))])
data_frame(id=id, varStar=varStars)
}))
## Source: local data frame [6 x 2]
##
## id varStar
## 1 1 453
## 2 1 3432
## 3 1 32
## 4 2 111
## 5 2 222
## 6 2 333333