Search code examples
rmissing-datacountry

Identify missing countries in a dataframe in R


I have a dataframe which includes the column "country" with various country names.

I want to find out which countries (say, UN member states) are missing.

Is there any quick way to do that in an automated way, perhaps with the package countrycode?

Here is my dput:

structure(list(country = c("Albania", "Algeria", "Angola", "Antigua and Barbuda", 
"Argentina", "Armenia", "Australia", "Austria", "Azerbaijan", 
"Bahamas", "Bahrain", "Bangladesh", "Barbados", "Belarus", "Belgium", 
"Bhutan", "Bolivia", "Bosnia and Herzegovina", "Botswana", "Brazil", 
"Brunei", "Bulgaria", "Burkina Faso", "Cambodia", "Canada", "Chile", 
"Colombia", "Costa Rica", "Cote d'Ivoire", "Croatia", "Cuba", 
"Czechia", "Democratic Republic of the Congo", "Denmark", "Djibouti", 
"Dominica", "Dominican Republic", "Ecuador", "Egypt", "El Salvador", 
"Eritrea", "Estonia", "Ethiopia", "Fiji", "Finland", "France", 
"Gabon", "Georgia", "Germany", "Ghana", "Greece", "Guatemala", 
"Guinea", "Guyana", "Honduras", "Hungary", "Iceland", "India", 
"Indonesia", "Iran", "Iraq", "Ireland", "Israel", "Italy", "Jamaica", 
"Japan", "Jordan", "Kazakhstan", "Kenya", "Kuwait", "Kyrgyzstan", 
"Laos", "Latvia", "Lebanon", "Lesotho", "Liechtenstein", "Lithuania", 
"Luxembourg", "Macedonia", "Madagascar", "Malawi", "Malaysia", 
"Malta", "Mauritania", "Mauritius", "Mexico", "Micronesia", "Moldova", 
"Monaco", "Mongolia", "Morocco", "Myanmar", "Namibia", "Nepal", 
"Netherlands", "New Zealand", "Nicaragua", "Niger", "Nigeria", 
"Norway", "Oman", "Pakistan", "Palau", "Panama", "Papua New Guinea", 
"Paraguay", "People's Republic of China", "Peru", "Philippines", 
"Poland", "Portugal", "Qatar", "Romania", "Russia", "Rwanda", 
"Samoa", "San Marino", "Saudi Arabia", "Senegal", "Serbia", "Singapore", 
"Slovakia", "Slovenia", "South Africa", "South Korea", "Spain", 
"Sri Lanka", "Sudan", "Suriname", "Sweden", "Switzerland", "Syria", 
"Taiwan", "Tajikistan", "Tanzania", "Thailand", "Tonga", "Trinidad and Tobago", 
"Tunisia", "Turkey", "U.K.", "U.S.A.", "Uganda", "Ukraine", "United Arab Emirates", 
"Uruguay", "Uzbekistan", "Venezuela", "Vietnam", "Yemen", "Zambia", 
"Zimbabwe")), row.names = c(NA, -152L), class = c("tbl_df", "tbl", 
"data.frame"))

Solution

  • You can certainly get a vector of "countries" stored in countrycodes that are missing from your own data:

    library(countrycode)
    
    codelist$country.name.en[sapply(codelist$country.name.en.regex, function(x) {
      !any(grepl(x, df$country, perl = TRUE, ignore = TRUE))
      })]
    #>   [1] "Afghanistan"                               
    #>   [2] "Åland Islands"                             
    #>   [3] "American Samoa"                            
    #>   [4] "Andorra"                                   
    #>   [5] "Anguilla"                                  
    #>   [6] "Antarctica"                                
    #>   [7] "Aruba"                                     
    #>   [8] "Austria-Hungary"                           
    #>   [9] "Baden"                                     
    #>  [10] "Bavaria"                                   
    #>  [11] "Belize"                                    
    #>  [12] "Benin"                                     
    #>  [13] "Bermuda"                                   
    #>  [14] "Bouvet Island"                             
    #>  [15] "British Indian Ocean Territory"            
    #>  [16] "British Virgin Islands"                    
    #>  [17] "Brunswick"                                 
    #>  [18] "Burundi"                                   
    #>  [19] "Cameroon"                                  
    #>  [20] "Cape Verde"                                
    #>  [21] "Caribbean Netherlands"                     
    #>  [22] "Cayman Islands"                            
    #>  [23] "Central African Republic"                  
    #>  [24] "Chad"                                      
    #>  [25] "Channel Islands"                           
    #>  [26] "Christmas Island"                          
    #>  [27] "Cocos (Keeling) Islands"                   
    #>  [28] "Comoros"                                   
    #>  [29] "Congo - Brazzaville"                       
    #>  [30] "Cook Islands"                              
    #>  [31] "Curaçao"                                   
    #>  [32] "Cyprus"                                    
    #>  [33] "Czechoslovakia"                            
    #>  [34] "Equatorial Guinea"                         
    #>  [35] "Eswatini"                                  
    #>  [36] "Falkland Islands"                          
    #>  [37] "Faroe Islands"                             
    #>  [38] "French Guiana"                             
    #>  [39] "French Polynesia"                          
    #>  [40] "French Southern Territories"               
    #>  [41] "Gambia"                                    
    #>  [42] "German Democratic Republic"                
    #>  [43] "Gibraltar"                                 
    #>  [44] "Greenland"                                 
    #>  [45] "Grenada"                                   
    #>  [46] "Guadeloupe"                                
    #>  [47] "Guam"                                      
    #>  [48] "Guernsey"                                  
    #>  [49] "Guinea-Bissau"                             
    #>  [50] "Haiti"                                     
    #>  [51] "Hamburg"                                   
    #>  [52] "Hanover"                                   
    #>  [53] "Heard & McDonald Islands"                  
    #>  [54] "Hesse Electoral"                           
    #>  [55] "Hesse Grand Ducal"                         
    #>  [56] "Hesse-Darmstadt"                           
    #>  [57] "Hesse-Kassel"                              
    #>  [58] "Hong Kong SAR China"                       
    #>  [59] "Isle of Man"                               
    #>  [60] "Jersey"                                    
    #>  [61] "Kiribati"                                  
    #>  [62] "Kosovo"                                    
    #>  [63] "Liberia"                                   
    #>  [64] "Libya"                                     
    #>  [65] "Macao SAR China"                           
    #>  [66] "Maldives"                                  
    #>  [67] "Mali"                                      
    #>  [68] "Marshall Islands"                          
    #>  [69] "Martinique"                                
    #>  [70] "Mayotte"                                   
    #>  [71] "Mecklenburg Schwerin"                      
    #>  [72] "Micronesia (Federated States of)"          
    #>  [73] "Modena"                                    
    #>  [74] "Montenegro"                                
    #>  [75] "Montserrat"                                
    #>  [76] "Mozambique"                                
    #>  [77] "Nassau"                                    
    #>  [78] "Nauru"                                     
    #>  [79] "Netherlands Antilles"                      
    #>  [80] "New Caledonia"                             
    #>  [81] "Niue"                                      
    #>  [82] "Norfolk Island"                            
    #>  [83] "North Korea"                               
    #>  [84] "Northern Mariana Islands"                  
    #>  [85] "Oldenburg"                                 
    #>  [86] "Orange Free State"                         
    #>  [87] "Palestinian Territories"                   
    #>  [88] "Parma"                                     
    #>  [89] "Piedmont-Sardinia"                         
    #>  [90] "Pitcairn Islands"                          
    #>  [91] "Prussia"                                   
    #>  [92] "Puerto Rico"                               
    #>  [93] "Republic of Vietnam"                       
    #>  [94] "Réunion"                                   
    #>  [95] "Saint Martin (French part)"                
    #>  [96] "São Tomé & Príncipe"                       
    #>  [97] "Sardinia"                                  
    #>  [98] "Saxe-Weimar-Eisenach"                      
    #>  [99] "Saxony"                                    
    #> [100] "Serbia and Montenegro"                     
    #> [101] "Seychelles"                                
    #> [102] "Sierra Leone"                              
    #> [103] "Sint Maarten"                              
    #> [104] "Solomon Islands"                           
    #> [105] "Somalia"                                   
    #> [106] "Somaliland"                                
    #> [107] "South Georgia & South Sandwich Islands"    
    #> [108] "South Sudan"                               
    #> [109] "St. Barthélemy"                            
    #> [110] "St. Helena"                                
    #> [111] "St. Kitts & Nevis"                         
    #> [112] "St. Lucia"                                 
    #> [113] "St. Pierre & Miquelon"                     
    #> [114] "St. Vincent & Grenadines"                  
    #> [115] "Svalbard & Jan Mayen"                      
    #> [116] "Timor-Leste"                               
    #> [117] "Togo"                                      
    #> [118] "Tokelau"                                   
    #> [119] "Turkmenistan"                              
    #> [120] "Turks & Caicos Islands"                    
    #> [121] "Tuscany"                                   
    #> [122] "Tuvalu"                                    
    #> [123] "Two Sicilies"                              
    #> [124] "U.S. Virgin Islands"                       
    #> [125] "United Arab Republic"                      
    #> [126] "United Province CA"                        
    #> [127] "United States Minor Outlying Islands (the)"
    #> [128] "Vanuatu"                                   
    #> [129] "Vatican City"                              
    #> [130] "Wallis & Futuna"                           
    #> [131] "Western Sahara"                            
    #> [132] "Wuerttemburg"                              
    #> [133] "Würtemberg"                                
    #> [134] "Yemen Arab Republic"                       
    #> [135] "Yemen People's Republic"                   
    #> [136] "Yugoslavia"                                
    #> [137] "Zanzibar"
    

    However, while this contains many extant countries that are missing from your data (such as Afghanistan, Belize, Benin, etc), some of them are semi-autonomous regions that are not countries in their own right (Jersey, Zanzibar, Gibraltar) or are historical and no longer exist as countries (e.g. Yugoslavia).

    To filter out entries that are not current countries, I might use something like rnaturalearth:

    missing <- codelist$country.name.en[sapply(codelist$country.name.en.regex, 
                                    function(x) {
      !any(grepl(x, df$country, perl = TRUE, ignore = TRUE))
      })]
    
    missing[missing %in% 
              rnaturalearth::ne_countries(scale = 110, returnclass = "sf")$name]
    #>  [1] "Afghanistan"   "Antarctica"    "Belize"        "Benin"        
    #>  [5] "Burundi"       "Cameroon"      "Chad"          "Cyprus"       
    #>  [9] "Gambia"        "Greenland"     "Guinea-Bissau" "Haiti"        
    #> [13] "Kosovo"        "Liberia"       "Libya"         "Mali"         
    #> [17] "Montenegro"    "Mozambique"    "New Caledonia" "North Korea"  
    #> [21] "Puerto Rico"   "Sierra Leone"  "Somalia"       "Somaliland"   
    #> [25] "Timor-Leste"   "Togo"          "Turkmenistan"  "Vanuatu"
    

    This gives you a reasonable list of 28 current countries that are not in your original list. Of these, most are UN members, but not all are (to the best of my knowledge Greenland, Antarctica, Kosovo, New Caledonia, Somaliland and Puerto Rico do not have independent representation at the UN at the time of writing)

    Created on 2023-09-28 with reprex v2.0.2