Search code examples
ralphabetical

R combine Columns in DF with character strings and sort them alphabetically based on specific Columnorder


I have a dataframe that has 4 columns specifying Firstnames and Surnames of 2 People:

    Surname Firstname Surname2 Firstname2
  1    Wolf    Stefan   Schmit       Paul
  2  Schmit      Paul     Wolf     Stefan
  3  Schmit      Paul     Fore     Sabine
  4    Fore    Sabine   Schmit       Hans
  5  Schmit      Hans     Wolf     Stefan
  6  Schmit      Paul   Schmit       Hans
  7  Bracht     Armin   Brecht      Alwin
  8  Brecht     Alwin   Bracht      Armin

Now i want to add an fifth extra column, where the two persons are alphabetically grouped according to the Surname, but if this is the same then according to the Firstname... In the new fifth column should be both Persons, first with Firstname, then with Surname and seperated by a comma f.e:

  Surname Firstname Surname2 Firstname2                         Team
1    Wolf    Stefan   Schmit       Paul   Paul Schmit ,  Stefan Wolf
2  Schmit      Paul     Wolf     Stefan   Paul Schmit ,  Stefan Wolf
3  Schmit      Paul     Fore     Sabine   Sabine Fore ,  Paul Schmit
4    Fore    Sabine   Schmit       Hans   Sabine Fore ,  Hans Schmit
5  Schmit      Hans     Wolf     Stefan   Hans Schmit ,  Stefan Wolf
6  Schmit      Paul   Schmit       Hans   Hans Schmit ,  Paul Schmit
7  Bracht     Armin   Brecht      Alwin Armin Bracht ,  Alwin Brecht
8  Brecht     Alwin   Bracht      Armin Armin Bracht ,  Alwin Brecht

I have a working code based on a for-loop, but i look for a more effective version for bigger dataframes and more comfortable to work with since individual columns for each name would possibly be more than 2...

# Simple Code:
Surname <- c("Wolf", "Schmit", "Schmit", "Fore", "Schmit", "Schmit", "Bracht", "Brecht")
Firstname <- c("Stefan", "Paul", "Paul", "Sabine", "Hans", "Paul", "Armin", "Alwin")
Surname2 <- c("Schmit", "Wolf", "Fore", "Schmit", "Wolf", "Schmit", "Brecht", "Bracht")
Firstname2 <- c("Paul", "Stefan", "Sabine", "Hans", "Stefan", "Hans", "Alwin", "Armin")
library(reshape2)
tester <- melt(data.frame(Surname, Firstname, Surname2, Firstname2))
tester[] <- lapply(tester, as.character)
tester

namescomp <- function(data, i){
    if (data[i, "Surname"] < data[i, "Surname2"]){
      paste(data[i, "Firstname"], data[i, "Surname"], ", ", data[i, "Firstname2"], data[i, "Surname2"])
     } else if (data[i, "Surname"] > data[i, "Surname2"]){
       paste(data[i, "Firstname2"], data[i, "Surname2"], ", ", data[i, "Firstname"], data[i, "Surname"])
    } else 
       { if(data[i, "Firstname"] < data[i, "Firstname2"]){
         paste(data[i, "Firstname"], data[i, "Surname"], ", ", data  [i, "Firstname2"], data[i, "Surname2"])
     } else {
      paste(data[i, "Firstname2"], data[i, "Surname2"], ", ", data[i, "Firstname"], data[i, "Surname"])
      }
    }
  }


for(y in 1:nrow(tester)){
  i <- y
  tester[i, "Team"] <- namescomp(tester, i)
}
tester

Solution

  • A tidyverse solution:

    library(tibble)
    library(dplyr)
    library(tidyr)
    library(stringr)
    
    Surname <- c("Wolf", "Schmit", "Schmit", "Fore", "Schmit", "Schmit", "Bracht", "Brecht")
    Firstname <- c("Stefan", "Paul", "Paul", "Sabine", "Hans", "Paul", "Armin", "Alwin")
    Surname2 <- c("Schmit", "Wolf", "Fore", "Schmit", "Wolf", "Schmit", "Brecht", "Bracht")
    Firstname2 <- c("Paul", "Stefan", "Sabine", "Hans", "Stefan", "Hans", "Alwin", "Armin")
    
    df <- data_frame(Surname, Firstname, Surname2, Firstname2)
    
    df %>%
      # create an identifier for each team
      rownames_to_column(var = 'team_id') %>%
      # split all name components into separate rows
      gather(component, value, -team_id) %>%
      # extract a person_id from the number behind first/last name. If there's no number there, use 1
      mutate(person_id = coalesce(as.numeric(str_extract(component, '[0-9]+$')), 1)) %>%
      # remove the number from the first/last name, then pivot the data.frame so that there's a row for every team x person
      mutate(component = str_replace(component, '[0-9]+$', '')) %>%
      spread(component, value) %>%
      # order by team_id (not strictly necessary), then by Surname, then by Firstname (if you want the order reversed, wrap the variable in `desc()`)
      arrange(team_id, Surname, Firstname) %>%
      # collapse Surname and Firstname into a `full_name` column
      unite(full_name, Firstname, Surname, sep = ' ') %>%
      # collapse the full names within each team into a single line, separated by commas
      group_by(team_id) %>%
      summarize(Team = paste(full_name, collapse=', '))
    

    Doesn't produce exactly the output you want but you could join what it does produce back into the original table on rownames.