Search code examples
rdata-dictionarydata-mapping

Mapping Data in R with missing values


I'm very new with R, and I'm trying to map data dictionary definitions to a set of data to make more readable text.

For example, based on the data dictionary within the Ames Iowa housing dataset currently on Kaggle, I'm trying to map the zoning of houses.

mapping <- list(
  'A'='Agriculture',
  'C (all)'='Commercial',
  'FV'='Floating Village Residential',
  'I'='Industrial',
  'RH'='Residential High Density',
  'RL'='Residential Low Density',
  'RP'='Residential Low Density Park',
  'RM'='Residential Medium Density'
)

housingData$MSZoning <- as.factor(as.character(mapping[origData$MSZoning]))

The original data set does not contain values for all of these data points, however.

> table(origData$MSZoning)

C (all)      FV      RH      RL      RM 
     10      65      16    1151     218 

After mapping with my code, the key value pairs don't align. (Agriculture is mapped to "C", for example.) I believe that the empty values in the source data are throwing off my mapping.

> table(housingData$MSZoning, origData$MSZoning)

                               C (all)   FV   RH   RL   RM
  Agriculture                       10    0    0    0    0
  Commercial                         0   65    0    0    0
  Floating Village Residential       0    0   16    0    0
  Industrial                         0    0    0 1151    0
  Residential High Density           0    0    0    0  218

What is a more appropriate way to ensure that these keys and values align appropriately?


Solution

  • Using the recode command, I was able to make this code work appropriately.

    library(car)
    
    housingData$MSZoning <- recode(housingData$MSZoning,
      "'A'='Agriculture';
      'C (all)'='Commercial';
      'FV'='Floating Village Residential';
      'I'='Industrial';
      'RH'='Residential High Density';
      'RL'='Residential Low Density';
      'RP'='Residential Low Density Park';
      'RM'='Residential Medium Density'"
    )
    

    Now, running the table cross-tab, I see the values map correctly.

    > table (housingData$MSZoning, origData$MSZoning)
    
                                   C (all)   FV   RH   RL   RM
      Commercial                        10    0    0    0    0
      Floating Village Residential       0   65    0    0    0
      Residential High Density           0    0   16    0    0
      Residential Low Density            0    0    0 1151    0
      Residential Medium Density         0    0    0    0  218