Search code examples
rsortingmultiple-columns

Arranging character values of multiple columns that contain the same name with R


I use R and I am merging mutiple data sets horizontically (by column), where each data set has one same column. For example, consider

`df1=data.frame('name'=c('9504','07123','40355','59003'),
             'age 16'=c(21,22,23,25), 
              'income 43_'=c(3,4,7,2))`

and

 `df2=data.frame('name'=c('9504','40355','59003','07123'),
             'new 21'=c(21,22,23,25),
              'GDP 98'=c(3,4,7,2) )`

Here, the two data sets have the same 'name' column. I merge the two data sets horizontally and I have

`df=data.frame('name'=c('9504','07123','40355','59003'),
             'age 16'=c(21,22,23,25),
              'income 43'=c(3,4,7,2),
             'name'=c('9504','40355','59003','07123'),
             'new 21'=c(21,22,23,25),
              'GDP 98'=c(3,4,7,2))`

Note that merging is not my problem. It is taken as given. Given that data frame, df, I want to arrange the values of the "name" columns (in an ascending or descrnding order) so as to have the same order in both 'name' columns and ofcourse to rearrange accordingly the other columns

For example, I want to have

`dfN1=data.frame('name'=c('07123','40355','59003','9504'),
             'age 16'=c(22,23,25,21),
              'income 43'=c(4,7,2,3),
             'name'=c('07123','40355','59003','9504'),
             'new 21'=c(25,22,23,21),
              'GDP 98'=c(2,4,7,3))`

 name age.16 income.43 name.1 new.21 GDP.98
1 07123     22         4  07123     25      2
2 40355     23         7  40355     22      4
3 59003     25         2  59003     23      7
4  9504     21         3   9504     21      3

I do not want to process df1 and df2 but directly df to get 'dfN1'

So I tried

`dfN<-df[with(df, order(name)),]`     
   name age.16 income.43 name.1 new.21 GDP.98
2 07123     22         4  40355     22      4
3 40355     23         7  59003     23      7
4 59003     25         2  07123     25      2
1  9504     21         3   9504     21      3

but as you can see only the first two columns are corrected properly but not the last two. So dfN is not the same as dfN1, which is what I want to achieve.

What can I do?

Note that in reality I merge many data frames before getting df. So my real df contains many times the word "name" and also I have many more variables between the "name" columns. So a more general code would be helpful.


Solution

  • Since your data sets share a "key" column, this is a natural place to use a join with df1 & df2:

    library(dplyr)
    df1 |>
      left_join(df2)     # this will join by the columns in common, i.e. "name"
     #left_join(df2, join_by(name))  # good practice to be explicit if possible
    

    Result

       name age.16 income.43_ new.21 GDP.98
    1  9504     21          3     21      3
    2 07123     22          4     25      2
    3 40355     23          7     22      4
    4 59003     25          2     23      7
    

    If you need to start from df for some reason (I'd avoid it if possible), you could get to the same place with manual subsets of df:

    left_join(df[,1:3], df[,4:6], join_by(name == "name.1"))
    

    Note that in reality I merge many data frames before getting df. So my real df contains many times the word "name" and also I have many more variables between the "name" columns. So a more general code would be helpful.

    Again, this is more straightforward to do each time as a join, such that each added table will add variables, unified by key columns like "name." Keeping the data "tidy," where each row represents one observation, will make your analysis much simpler, robust, and reliable.

    https://vita.had.co.nz/papers/tidy-data.pdf

    The main pitfall to watch out for is to make sure your keys are unique; if they're not, a row could get matched to multiple rows, inadvertently growing the output each time. The multiple parameter of left_join gives options for avoiding this, or you could perform a summarization step to make sure the key columns are unique for each row in the tables to be joined.