Search code examples
rlarge-data

Non-spreadsheet-esque Data Import and Organization


Basic Problem

I have data of the form

doc
id 1
var1 A
var2 B
...
varStar 453
varStar 3432
varStar 32
...
varN 
doc
var1 A
var2 B
(And so on)

Where doc is denoting the beginning of each record/observation, varI is denoting a variable, and varStar is denoting a variable of interest that may have more than one entry.

I would like to

  1. Import the data using R (or Python, but I'm a bit rusty there).

  2. For each record retrieve the id and all instances of varStar.

  3. Store them in some manner for later, easy manipulation/merger. For example, a list or binary (possibly sparse) matrix (for my application varStar is a category so overlap is to be expected).

It seems like this should be easy but I am only familiar with csv/spreadsheet data and perhaps just don't know the right words to Google. I would prefer not to, e.g., create an entire SQL database, etc. because I don't need the entire database. Of course, it may be easier to do some elaborate organization then pick out the choice pieces.

Context - My Application

I want to retrieve U.S. patent classes via the Google/USPTO Bulk Downloads. So id is the patent number while varStar is the patent class fsc. Then, I want to merge this with the NBER Patent data. My application hinges crucially on patent class designation. The NBER data, while nice in many respects, only reports a single "main" class for each patent. This is not good because based on casual perusal of the data and a paper by Volodin (2010), patents are commonly given several top-level classes.

Volodin, Dmitry. (2010) "NBER Patent Data Technological Classification Issues Relevant for Research in Inventor Mobility", Working Paper. udel.edu/~volodin/pat/draft.pdf.


Solution

  • Assuming dat.txt looks like:

    doc
    id 1
    var1 A
    var2 B
    ...
    varStar 453
    varStar 3432
    varStar 32
    ...
    varN
    doc
    id 2
    var1 A
    var2 B
    varStar 111
    varStar 222
    varStar 333333
    ...
    

    Then this is a possible framework:

    library(dplyr)
    
    dat <- readLines("dat.txt")
    
    doc_starts <- which(grepl("^doc", dat))
    doc_ends <- lead(doc_starts)-1
    doc_ends[length(doc_ends)] <- length(dat)
    
    # list-ified
    
    lapply(seq_along(doc_starts), function(i) {
    
      chunk <- dat[doc_starts[i]:doc_ends[i]]
    
      id <- gsub("^id\ +", "", chunk[which(grepl("^id", chunk))])
      varStars <- gsub("^varStar\ +", "", chunk[which(grepl("^varStar", chunk))])
    
      list(id=id, varStar=varStars)
    
    })
    
    ## [[1]]
    ## [[1]]$id
    ## [1] "1"
    ## 
    ## [[1]]$varStar
    ## [1] "453"  "3432" "32"  
    ## 
    ## 
    ## [[2]]
    ## [[2]]$id
    ## [1] "2"
    ## 
    ## [[2]]$varStar
    ## [1] "111"    "222"    "333333"
    
    
    # data.frame-d
    
    bind_rows(lapply(seq_along(doc_starts), function(i) {
    
      chunk <- dat[doc_starts[i]:doc_ends[i]]
    
      id <- gsub("^id\ +", "", chunk[which(grepl("^id", chunk))])
      varStars <- gsub("^varStar\ +", "", chunk[which(grepl("^varStar", chunk))])
    
      data_frame(id=id, varStar=varStars)
    
    }))
    
    ## Source: local data frame [6 x 2]
    ## 
    ##   id varStar
    ## 1  1     453
    ## 2  1    3432
    ## 3  1      32
    ## 4  2     111
    ## 5  2     222
    ## 6  2  333333