Search code examples
rdplyrbind-rows

How to merge rows with names that start with the same string within one dataframe?


I want to apologize in advance because I am very inexperienced with R or coding in general, but I am trying to create a script for a research project, and couldn't find a solution for this problem after several hours of googling.

I want to create a large dataframe from a large number of .csv files that contain measurements for one or more elements per year, but I running into an issue where a string (...1,...2 etc., maybe an artefact from creating a list??) is added to row names so that separate rows are created for data from the same year from different files. I would like to have measurements from the same year in the same, single row, instead.

What I did:

I tried creating a dataframe using dplyr's bind_rowsfunction as follows:

dataset_files <- list.files(pattern = "*.csv")
dataset_files_list <- lapply(dataset_files, read.csv)
raw_data <- bind_rows(dataset_files_list)

As an example, the resulting dataframe looks somewhat like this:

File A File B
1900...1 0.8 NA
1901...2 2.1 NA
1900...3 NA 1.0
1901...4 NA 1.6

Upon someone else's advice, I tried simply removing the added strings with gsub, which results in an error because duplicate row names are not allowed.

rownames(raw_data) = gsub(pattern="...*", replacement = "", x=rownames(raw_data))

In short, my goal is to create a data frame where all measurements from the same year appear in the same row, either by using a better method to create the data frame, or merging the rows that start with the same date (or any other solution you would suggest instead). What is the best way to resolve this issue?

Thanks in advance!


Solution

  • The issue is that (as you said) duplicate rownames aren't allowed.

    The solution is to turn the rownames into a column, and then transform that:

    library(tidyverse)
    
    df |>
      rownames_to_column() |> 
      mutate(year = str_remove(rowname, "\\..*"), .before = 1, .keep = "unused")
    

    Output:

      year File.A File.B
    1 1900    0.8     NA
    2 1901    2.1     NA
    3 1900     NA    1.0
    4 1901     NA    1.6
    

    Data:

    df <- read.table(text="
    'File A'    'File B'
    1900...1    0.8     NA
    1901...2    2.1     NA
    1900...3    NA  1.0
    1901...4    NA  1.6", header=T)