Search code examples
rdata-cleaning

Loop to clean up table where observations are stored as column


I have a table that stores observations in column x and variable names in column y the following way.

I am trying to write an R loop to create a matrix where each observation is a row and each variable is a column.

The problem is that not all observations have all the variables.

Original data:

x y
Apple Fruit
Austria Origin
Summer Season
Orange Fruit
Spain Origin
Pear Fruit
Tomato Fruit
Italy Origin
Summer Season

Desired output:

Fruit Origin Season
Apple Austria Summer
Orange Spain
Pear
Tomato Italy Summer

My thinking so far (pseudo R code):

df_old <- data.frame( x = c( "Apple", "Austria", "Summer", "Orange", "Spain", "Pear", "Tomato", "Italy", "Summer" ),
                      y = c( "Fruit", "Origin", "Season", "Fruit", "Origin", "Fruit", "Fruit", "Origin", "Season" ) )

df_new <- data.frame( matrix( ncol = 3, nrow = 0 ) )
colnames( df_new ) <- c( "Fruit", "Origin", "Season")

for ( i in seq_along( df_old ) ) {
  if ( y == "Fruit" ) {
    # add new row
    df_new$Fruit <- df_old$x
  } else if ( y == "Origin" ) {
    df_new$Origin <- df_old$x
  } else ( y == "Season" ) {
    df_new$Season <- df_old$x
  }
}

Thank you for helping.


Solution

  • Here is a solution based on the idea you were giving using a for-loop.

    df_old <- data.frame( x = c( "Apple", "Austria", "Summer", "Orange", "Spain", "Pear", "Tomato", "Italy", "Summer" ),
                              y = c( "Fruit", "Origin", "Season", "Fruit", "Origin", "Fruit", "Fruit", "Origin", "Season" ) ,stringsAsFactors=F)
        
        
    df_new <- as.data.frame(matrix(NA, nrow=sum(df_old$y == "Fruit"), ncol=length(unique(df_old$y))))
    names(df_new) <- c("Fruit", "Origin", "Season")
    
    
    j <- 0
    for (i in 1:(nrow(df_old))){
      print(df_old$y[i])
      if (df_old$y[i] == "Fruit") { j <- j + 1 ; df_new$Fruit[j] <-  df_old$x[i]
        print("new colum")
        if ((df_old$y[i+1] == "Origin")){ df_new$Origin[j] <-  df_old$x[i+1] }
        print("add origin")
        if ((df_old$y[i+1] == "Season") |  (df_old$y[i+2] == "Season")){
          df_new$Season[j] <-  df_old$x[c(i+1,i+2)][df_old$y[c(i+1,i+2)] == "Season"]
          print("add Season")
        }
      }
    }