Search code examples
rdatabase-normalizationrpostgresql

Normalizing data in R - dataframe construction


I'm using R to connect to a database I have stored in PGAdmin so that I can add data into the database through R. Before adding the data into the database, it has to be normalized.

The dataset consists of employee data. If an employee is a manager, the managing column lists the employee ids of the employees that they manage.

I want to create a new dataframe in R that would represent my Managers table that has two columns: employee_id and manager_id (where manager_id is just the employee_id of the manager) that are both pulled from the original dataframe

How would I do this in R? Here's reproducible data:

> dput(test)
structure(list(first_name = c("Carrol", "Scott", "Michael", "Mary", 
"Jane", "Alex"), last_name = c("Dhin", "Peters", "Scott", "Smith", 
"Johnson", "Barter"), employee_id = c(412153L, 534253L, 643645L, 
765453L, 627234L, 174543L), email = c("[email protected]", 
"[email protected]", "[email protected]", "[email protected]", 
"[email protected]", "[email protected]"), managing = c("174543", 
"", "", "", "534253, 643645", ""), department = c("Accounting", 
"Sales", "Sales", "Marketing", "Sales", "Accounting"), department_budget = c(500000L, 
1100000L, 1100000L, 750000L, 1100000L, 500000L), serial_number = c("KX6234", 
"FS5235", "LP5242", "GK6246", "KX6902", "BN7451"), manufacturer = c("Lenovo", 
"Lenovo", "Lenovo", "Lenovo", "Lenovo", "Lenovo"), model = c("X1 Gen 10", 
"T14s", "P1", "X1 Gen 10", "T15", "T14s"), date_assigned = c(44576L, 
44471L, 44341L, 44681L, 44606L, 44378L), installed_software = c("MS Office, Adobe Acrobat, Slack", 
"MS Office", "Mathcad, Adobe Acrobat", "", "MS Office, Slack", 
"Google Chrome")), class = "data.frame", row.names = c(NA, -6L
))

Solution

  • Given your sample data as df, you could use select() and filter() from dplyr to create a subset of managers.

    library(dplyr)
    managers <- df |> select("manager_id" =employee_id, 
                       "employee_id" =managing) %>%
      filter(.$employee_id!= "")
    

    Result:

    > managers
      manager_id    employee_id
    1     412153         174543
    2     627234 534253, 643645