Search code examples
rstatisticseconomics

Sub setting panel data


Very new, so let me know if this is asking too much. I am trying to sub set panel data, in R, into two different categories; one that has complete information for variables and one that has incomplete information for variables. My data looks like this:

Person     Year Income Age Sex
    1      2003  1500   15  1
    1      2004  1700   16  1
    1      2005  2000   17  1
    2      2003  1400   25  0
    2      2004  1900   26  0
    2      2005  2000   27  0

What I need to do is go through each column ( not columns 1 and 2 ) and if the data is full for the variable ( variables are defined by the id in the first column and then the column name, in the picture above an example is person1Income) return that to a data set. Else put it in a different data set. Here is my meta code and an example of what it should do given the above data. Note: I call variables by their id name then the column name, for instance the variable person1Income would be the first three rows in column three.

for(each variable in all columns except 1 and 2 in data set) if (variable = FULL) { return to data set "completes" }
else {put in data set "incompletes"}
completes = person1Income, person2Income, person1Age, person2Age, person1Sex, person2 sex
incompletes = {empty because the above info is full}

I understand if someone can't answer this question completely, but any help is appreciated. Also if my goal is not clear, let me know and I will try to clarify.

tl;dr I can't yet explain it in one sentence so...sorry.

Edit: visualization of what I mean by complete and incomplete variables. screenshot


Solution

  • Using your picture, here's a stab at what you want. It may be long-winded and others may have a more elegant way of doing it, but it gets the job done:

    library("reshape2")
    
    con <- textConnection("Person Year Income Age Sex
      1      2003  1500   15  1
      1      2004  1700   16  1
      1      2005  2000   17  1
      2      2003  1400   25  0
      2      2004  1900   NA  0
      2      2005  2000   27  0
      3      2003  NA   25  0
      3      2004  1900   NA  0
      3      2005  2000   27  0")
    pnls <- read.table(con, header=TRUE)
    
    # reformat table for easier processing
    pnls2 <- melt(pnls, id=c("Person"))
    # and select those rows that relate to values
    # of income and age
    pnls2 <- subset(pnls2,
                  variable == "Income" | variable == "Age")
    
    # create column of names in desired format (e.g Person1Age etc)
    pnls2$name <- paste("Person", pnls2$Person, pnls2$variable, sep="")
    
    # Collect full set of unique names
    name.set <- unique(pnls2$name)
    # find the incomplete set
    incomplete <- unique( pnls2$name[ is.na(pnls2$value) ]) 
    # then find the complement of the incomplete set
    complete <- setdiff(name.set, incomplete) 
    
    # These two now contain list of complete and incomplete variables
    complete
    incomplete
    

    If you are not familiar with melting and the reshape2 package, you may want to run it line by line, and examine the value of pnls2 at different stages to see how this works.

    EDIT: adding code to compile the values as requested by @bstockton. I am sure there is a much more appropriate R idiom to do this, but once again, in the absence of better answers: this works

    # use these lists of complete and incomplete variable names
    # as keys to collect lists of values for each variable name
    compile <- function(keys) {
        holder = list()
        for (n in keys) {
            holder[[ n ]] <- subset(pnls2, pnls2$name == n)[,3]
        }
        return( as.data.frame(holder) )
    }
    
    complete.recs <- compile(complete)
    incomplete.recs <- compile(incomplete)