Search code examples
rcluster-analysisscalehierarchical-data

After scaling my numerical data I do not get my country variable


I have scaled my data as I am pursuing some clustering, with R. Thus, scaling is necessary. Yet, the issue is that I have not got back my factor variable - country.

How to attach my factor variable country, back but on a scaled data?

This is a sample of data:

structure(list(country = structure(c(1L, 3L, 14L, 41L, 42L, 48L
), .Label = c("Afganistan", "Albania", "Algeria", "American Samoa", 
"Andorra", "Angola", "Anguilla", "Antigua & Barbuda", "Argentina", 
"Aruba", "Australia", "Bahamas", "Bahrain", "Bangladesh", "Barbados", 
"Belize", "Benin", "Bhutan", "Bolivia", "Bonaire", "Botswana", 
"Brazil", "British Indian Ocean Ter", "Brunei", "Bulgaria", "Burkina Faso", 
"Cambodia", "Cameroon", "Canada", "Channel Islands", "Chile", 
"Cuba", "Curaco", "Cyprus", "Denmark", "Dominican Republic", 
"Ecuador", "Egypt", "El Salvador", "Eritrea", "Ethiopia", "Fiji", 
"Finland", "France", "Georgia", "Germany", "Ghana", "United Kingdom", 
"Greece", "Grenada", "Guinea", "Guyana", "Honduras", "Hungary", 
"India", "Indonesia", "Iraq", "Ireland", "Italy", "Jamaica", 
"Japan", "Jordan", "Kenya", "Korea Sout", "Kuwait", "Lebanon", 
"Lesotho", "Liberia", "Libya", "Macedonia", "Malaysia", "Maldives", 
"Mali", "Malta", "Mauritius", "Mexico", "Morocco", "Myanmar", 
"Nambia", "Nepal", "Netherlands", "New Zealand", "Nicaragua", 
"Nigeria", "Norway", "Oman", "Pakistan", "Peru", "Phillipines", 
"Portugal", "Puerto Rico", "Qatar", "Republic of Serbia", "Romania", 
"Russia", "Samoa", "Saudi Arabia", "Singapore", "Slovakia", "Slovenia", 
"Somalia", "South Africa", "Spain", "Sri Lanka", "Suriname", 
"Swaziland", "Sweden", "Switzerland", "Tanzania", "Thailand", 
"Trinidad & Tobago", "Tunisia", "Turkey", "Turks & Caicos Is", 
"United Arab Erimates", "USA", "Vietnam", "Yemen", "Zambia", 
"Zimbabwe"), class = "factor"), chills = c(8, 8, 52, 2, 1, 841
), cough = c(7, 8, 167, 8, 1, 1321), diarrhoea = c(5, 3, 33, 
3, 1, 566), fatigue = c(13, 4, 156, 10, 1, 1703), headache = c(7, 
5, 104, 6, 1, 1331), loss_smell_taste = c(4L, 2L, 50L, 6L, 1L, 
777L), muscle_ache = c(11, 5, 120, 7, 1, 1329), nasal_congestion = c(4, 
8, 100, 4, 1, 837), nausea_vomiting = c(2L, 2L, 28L, 2L, 1L, 
218L), shortness_breath = c(9, 4, 95, 9, 1, 1061), sore_throat = c(14, 
4, 123, 11, 1, 1099), sputum = c(9, 4, 131, 7, 1, 882), temperature = c(9, 
8, 112, 6, 1, 535)), row.names = c(NA, -6L), class = c("tbl_df", 
"tbl", "data.frame"), na.action = structure(c(`2` = 2L, `4` = 4L, 
`5` = 5L, `7` = 7L, `8` = 8L, `9` = 9L, `10` = 10L, `11` = 11L, 
`12` = 12L, `13` = 13L, `14` = 14L, `15` = 15L, `16` = 16L, `19` = 19L, 
`20` = 20L, `21` = 21L, `25` = 25L, `26` = 26L, `28` = 28L, `29` = 29L, 
`32` = 32L, `34` = 34L, `37` = 37L, `41` = 41L, `42` = 42L, `44` = 44L, 
`46` = 46L, `48` = 48L, `51` = 51L, `53` = 53L, `56` = 56L, `57` = 57L, 
`58` = 58L, `59` = 59L, `60` = 60L, `61` = 61L, `64` = 64L, `67` = 67L, 
`70` = 70L, `71` = 71L, `72` = 72L, `73` = 73L, `74` = 74L, `75` = 75L, 
`76` = 76L, `77` = 77L, `78` = 78L, `79` = 79L, `80` = 80L, `81` = 81L, 
`82` = 82L, `83` = 83L, `84` = 84L, `85` = 85L, `86` = 86L, `87` = 87L, 
`88` = 88L, `89` = 89L, `90` = 90L, `91` = 91L, `92` = 92L, `93` = 93L, 
`94` = 94L, `95` = 95L, `96` = 96L, `97` = 97L, `98` = 98L, `99` = 99L, 
`100` = 100L, `101` = 101L, `102` = 102L, `103` = 103L, `104` = 104L, 
`105` = 105L, `106` = 106L, `107` = 107L, `108` = 108L, `109` = 109L, 
`110` = 110L, `111` = 111L, `112` = 112L, `113` = 113L, `114` = 114L, 
`115` = 115L, `116` = 116L), class = "omit"))

This is my code when for scaling only he numerical variables:

df_scaled <- scale(test_data[2:14])

Is there a way I can scale my data by keeping my factor variable into my scaled dataset? Because it doesn't keep it.


Solution

  • Two ways, one converting to a dataframe, and the other taking advantage of rownames:

    df_scaled <- as.data.frame(scale(test_data[2:14]))
    df_scaled$country <- test_data$country
    
    df_scaled_2 <- scale(test_data[2:14])
    rownames(df_scaled_2) <- test_data$country