Search code examples
rdplyrpurrrplyr

Use part of row data for new columns in R


I have a very large df with a column that contains the file directory for each row's data.

Example: D:Mouse_2174/experiment/13/trialsummary.txt.1

I would like to create 2 new columns, one with only the mouse ID (2174) and one with the session number (13). There will be different IDs and session numbers based on the row.

I've used sub as recommended here (match part of names in data.frame to new column), but only can get the subject column to say "D:Mouse_2174" I've added an additional line and can get it down to "D:Mous2174"

Is there a way to eliminate all chars before _ and after / to obtain mouse ID? For session number, I'm not quite as sure what to do with multiple / in the directory name.

percent_correct_list$mouse_id <- sub("/.+", "", percent_correct_list$rn)
#gives me D:Mouse_2174
percent_correct_list$mouse_id <- sub("+._", "", percent_correct_list$mouse_id)
#gives me D:Mous2174

Here is sample code for the directories:

df <- data.frame(
             rn = c("D:Mouse_2174/iti_intervals/9/trialsummary.txt.1",
                    "D:Mouse_2181/iti_intervals/33/trialsummary.txt.1",
                    "D:Mouse_2183/iti_intervals/107/trialsummary.txt.2",
                    "D:Mouse_2185/iti_intervals/87/trialsummary.txt.1")
)

What I want:

rn id session
D:.. 2174 9
D:.. 2181 33
D:.. 2183 107
D:.. 2185 87

Maybe there's some way to do this earlier along in the process too (like when I import all the data into a df using lapply - but this is good as well)


Solution

  • For sure isnt an elegant solution. Only works if your ID and Session are always numbers...

    df <- data.frame(
                rn = c("D:Mouse_2174/iti_intervals/9/trialsummary.txt.1",
                       "D:Mouse_2181/iti_intervals/33/trialsummary.txt.1",
                       "D:Mouse_2183/iti_intervals/107/trialsummary.txt.2",
                       "D:Mouse_2185/iti_intervals/87/trialsummary.txt.1")) %>%
                # Extract all numeric values from the string
                mutate(allnums = regmatches(rn, gregexpr("+[[:digit:]]+", rn)))%>%
                # Separate them
                separate(allnums, into = c("id", "session", "idk"), sep = "\\,") %>%
                # Extract them individually
                mutate(id = as.numeric(regmatches(id, gregexpr("+[[:digit:]]+", id,))),
                       session = as.numeric(regmatches(session, gregexpr("+[[:digit:]]+", session)))) %>%
                select(-idk)
    

    Output:

    1  D:Mouse_2174/iti_intervals/9/trialsummary.txt.1   2174       9
    2  D:Mouse_2181/iti_intervals/33/trialsummary.txt.1  2181      33
    3  D:Mouse_2183/iti_intervals/107/trialsummary.txt.2 2183     107
    4  D:Mouse_2185/iti_intervals/87/trialsummary.txt.1  2185      87