Search code examples
rencoding

Combining Datasets based on Two Columns Failing Despite Matches


I am new to R, and I am working on a project where I want to sum sales amounts in cities that a company has made sales in. I want to then account for population size of each city, to moderate the summed sales data.

In order to do so, I have to combine two datasets: one that has all of the sales data, and one that has all of the cities in the USA and their populations.

I downloaded my population data from the last file on this link: https://www2.census.gov/programs-surveys/popest/datasets/2010-2014/cities/totals/

I downloaded my sales data from: https://www.kaggle.com/datasets/roopeshbharatwajkr/ecommerce-dataset

I think there is an issue with the encoding types of the two CSV's I am working with. They do not seem to be the same, values that are exactly identical are not being treated as the same, so merging is not working correctly. I've tried to force UTF-8 and it would not work on the census data CSV. Forcing Windows-1252 worked on both csv's, but did not help with my matching issue.

require(tidyverse)
library(dplyr)
library(stringr)

###loading and cleaning sales data
fullData <-  read.csv("Us-Ecommerce Dataset.csv",header=TRUE,stringsAsFactors = FALSE)
#formatting Date column
fullData$Date <- as.Date(fullData$Date,format="%d/%m/%Y")                          
fullData$Date <- format(fullData$Date,"%m/%d/%Y")
#fix typos
fullData <- fullData %>%
  mutate(City = str_replace(City, "Los Angles", "Los Angeles"))
#make state and city lowercase
fullData$State <- tolower(fullData$State)
fullData$City <- tolower(fullData$City)
#remove whitespace
fullData$State <- trimws(fullData$State)
fullData$City <- trimws(fullData$City)


###loading and cleaning population data
censusData <-  read.csv("sub-est2014_all.csv",header=TRUE,stringsAsFactors = FALSE,fileEncoding = ("UTF-8"))
#remove rows where state name is == to city name, not sure why data was formatted like that
censusData <- subset(censusData,NAME != STNAME)
#remove unnecessary columns, need both state and city in order to join w/ other data
censusData <- select(censusData,NAME,STNAME,POPESTIMATE2013)
##make state and city lowercase
censusData$STNAME <- tolower(censusData$STNAME)
#replacing invalid characters
censusData$NAME <- str_replace_all(censusData$NAME,"[^[:graph:]]", " ")
censusData$NAME <- tolower(censusData$NAME)
#removing whitespace
censusData$NAME <- trimws(censusData$NAME)
censusData$STNAME <- trimws(censusData$STNAME)

# Merge the two datasets based on City and State columns
merged_data <- merge(fullData, censusData, by.x = c("City", "State"), by.y = c("NAME", "STNAME"), all.x = TRUE)

# Rename the POPESTIMATE2013 column to Population
merged_data <- rename(merged_data, Population = POPESTIMATE2013)

# If there are missing values in the Population column, replace them with 0
merged_data$Population[is.na(merged_data$Population)] <- 0

Running this code produces a new dataframe in which every value of merged_data$Population except for 1 is 0 (converted from NA). Of course, this is not what I am looking to do.

The only value of merged_data$Population that is not 0 is when merged_data$City == "new york city" & merged_data$State == "new york". In that case the correct value, ~8.4 million is present.

As you can see, I tried a few different methods of data cleaning in order to get them to match. tolower() and trimws(). There were also problematic characters in censusData that had to be replaced using str_replace_all()

When I run the below code (before turning NA's to 0's), it shows me that most values have not been matched.

unmatched_rows <- merged_data[is.na(merged_data$Population), c("City", "State")]
print(unmatched_rows)

Solution

  • The census data includes different geographic units:

    • State
    • County
    • Census Place (city, town, village)

    Census Place best fits the City column in the commerce data. The NAME column in the census data includes includes “city”, “town”, and “village”, which is a classification, not a part of the actual name. We have to take this into regard when harmonizing the datasets for merging.

    The commerce data only includes data for Los Angeles, New York, and Seattle. The code below only targets these three cities.

    library(tidyverse)
    
    census_data <- read_csv("sub-est2014_all.csv")
    commerce_data <- read_csv("Us-Ecommerce Dataset.csv")
    
    # Remove "City" from "New York City" and correct "Los Angeles"
    commerce_data <- 
      commerce_data |> 
      mutate(City = str_remove(City, " City"),
             City = str_replace(City, "Los Angles", "Los Angeles"))
    
    # Filter down to relevant NAMEs of class "city", select relevant columns,
    # and remove "city" from NAME
    census_data <- 
      census_data |> 
      filter(str_detect(NAME, "New York|Los Angeles|Seattle"),
             str_detect(NAME, "city")) |> 
      distinct(NAME, STNAME, POPESTIMATE2013) |> 
      mutate(NAME = str_remove(NAME, " city"))
    
    # join/merge
    res <- left_join(commerce_data,
              census_data,
              by = join_by(City == NAME, State == STNAME))
    
    # check results
    res |> 
      distinct(State, City, POPESTIMATE2013)
    #> # A tibble: 3 × 3
    #>   State      City        POPESTIMATE2013
    #>   <chr>      <chr>                 <dbl>
    #> 1 New York   New York            8438379
    #> 2 California Los Angeles         3897940
    #> 3 Washington Seattle              653404