Search code examples
rdataframecsvextract

How can I extract specific values from a .csv-File and add them into a specific cell in a pre-exisiting dataframe/tibble in R automatically?


I want to automatically extract specific values from a .csv-File, which is generated by our measuring device, into a a dataframe/tibble in R which has a pre-defined layout. The name of the measured sample can be used as an identifier. The .csv-file contains information in the top which I don't want to extract and further down there is a table with measurement data form which I want to extract specific values. My problem is, that the column names of the measurements are not the headers of the file because of the information in the top of the .csv-file. Therefore I don't know how to filter the table and extract the values I want based on the colums . In addition I also failed to insert the values at the correct row in the pre-existing output table.

The data from the machine looks like this:

# generating minimal input data
A <- c("Excel Analysed Data Export", "Copyright (c) 2013 QIAGEN GmbH. All Rights Reserved.", "File", "Date", "Time", NA, "Operator", "Channel", "Cycling A.Green (Page 1)", NA, "Quantitative analysis of Cycling A.Green (Page 1)", NA, "No.", "A1", "A2", "A3", "B6", "B7", "C8", "D1")

B <- c(NA, NA, "20241103_Test-Data.rex", "03.11.2024", "18:06:22", NA, "AB", "Threshold", 0.02, NA, NA, NA, "Name", "H2O", "B01M", "B01M", "B07M", "B07M", "B12M", "B12M")

C <- c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Ct", NA, 28.57, 28.76, 27.95, 27.82, NA, NA)

D <- c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Ct Comment", "NEG (NTC)", NA, NA, NA, NA, "NEG (NTC)", "NEG (NTC)")

E <- c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Calc Conc (copies/ul)", NA, 7.51E+02, 6.74E+02, 1.06E+03, 1.13E+03, NA, NA)

F <- c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Rep. Calc. Conc.", NA, 712, NA, 1100, NA, NA, NA)


minimal.input.data <- tibble(A, B, C, D, E, F)

and the table I would like to add the values to like this:

# pre-defined output table
ID <- c(1, 2, 4, 5)

Method <- c(rep("A", 4))

Channel <- c(rep("green", 4))

Sample <- c("H2O", "B12M", "B01M", "B07M")

Spiking <- c("none", "Culex", "DNV1", "DNV2")

Copies <- c(rep(NA, 4))

empty.output.data <- tibble(ID, Method, Channel, Sample, Spiking, Copies)

Row 13 of the input data (.csv-file generated by machine) contains the headers of table with the actual measurements. From this table I want to extract the values from the column "Rep. Calc. Conc." of the samples which are defined in the output-table and add them to the output table in the column "copies". As an example: for sample "B01M" I want to extract the value "712" from the "minimal.input.data" (header "Name" in row 13) and add it to the column "Copies" in the "empty.output.data" in row 3 (where the Sample is "B01M").

The desired output table should look like this:

# desired output data
ID <- c(1, 2, 4, 5)

Method <- c(rep("A", 4))

Channel <- c(rep("green", 4))

Sample <- c("H2O", "B12M", "B01M", "B07M")

Spiking <- c("none", "Culex", "DNV1", "DNV2")

Copies <- c(NA, NA, 712, 1100)

desired.output.data <- tibble(ID, Method, Channel, Sample, Spiking, Copies)

I think the best solution to this problem is, to separate it into two parts:

  1. generate a table with only the measurement data from the machine .csv and then
  2. add the values at the correct position in the empty output table

I am not sure if step one can or should be done in R or better somewhere else. I'm using a windows machine so I cannot use bash unfortunately.

Thank you very much for reading and suggestions,


Solution

  • Here is my proposal:

    Create a small dataframe named d1 with only 2 columns: the column with the Sample names (which was named B in your example), and the column with the value you want to merge (I get this using select_if, as the only thing we know about this column is that it contain the value "Rep. Calc. Conc." at some point).

    Then, it is just a simple left_join to add the column with the new values in the correct position based on the Sample column (which is present in both d1 and empty.output.data.

    Then you can rename the column that has been merged as you prefer

    # step 1
    d1 <- minimal.input.data %>%
         select_if(~ any(. == "Rep. Calc. Conc.")) %>%
         mutate(Sample = minimal.input.data$B) %>%
         drop_na() %>%
         filter(Sample != "Name")
    
    #step 2
    final <- empty.output.data %>%
      left_join(d1, by = "Sample")
    
    final
    
         ID Method Channel Sample Spiking Copies F    
      <dbl> <chr>  <chr>   <chr>  <chr>   <lgl>  <chr>
    1     1 A      green   H2O    none    NA     NA   
    2     2 A      green   B12M   Culex   NA     NA   
    3     4 A      green   B01M   DNV1    NA     712  
    4     5 A      green   B07M   DNV2    NA     1100