Search code examples
rsummary

Summarize column in multiple files in R


I have multiple files which is formatted like:

> Test1.txt
            NameNo  Team  etc
1:         AS001-A.    8  773
2:         AS002-S.    7  631
3:         AS003-G.    8  970

> Test2.txt
            NameNo  Team  etc
1:         AB001-A.    2  773
2:         AB002-S.    6  631
3:         AB003-G.    6  970

> Test2.txt
            NameNo  Team  etc
1:         AR001-A.    1  773
2:         AR002-S.    1  631
3:         AR003-G.    1  970

Here is my code:

files <- list.files(pattern ="*.txt", full.names = TRUE) 
files
> files
Test1.txt
Test2.txt
Test3.txt

I then used table(Team) one by one.

I need to summarize column Team

Desired output:

Item      Team
Test1      7,8
Test2      2,6
Test3        1

Thanks for all help.


Solution

  • # get all the textfiles starting with "Test" 
    list.files(pattern ="^Test.*\\.txt$") |>
    
      # loop through the files, reading them in, adding a column for their filename
      map_dfr(~read_csv(.) |> mutate(file = str_remove(.x, ".txt"))) |>
    
      # separate the weird format into columns using regex
      separate_wider_regex(`NameNo  Team  etc`, patterns = c("\\d+:\\s+", NameNo = "\\S+", "\\s+", Team = "\\d+", "\\s+", etc = "\\d+")) |>
    
      # get the unique teams for each file, sorted
      summarise(Team = paste0(sort(unique(`Team`)), collapse = ","), .by = file)
    

    Output:

    # A tibble: 3 × 2
      file  Team 
      <chr> <chr>
    1 Test1 7,8  
    2 Test2 2,6  
    3 Test3 1
    

    If the row numbers aren't actually part of the files, you can do this:

    f = list.files(pattern ="^Test.*\\.txt$")
    
    tibble(Item = str_remove(f, "\\.txt"),
           Team = map_chr(f, ~read.table(., header = TRUE)$Team |> unique() |> sort()  |> paste0(collapse = ",")))