Search code examples
rdplyrdata.tabledata-cleaningread.csv

Read CSV file it include several numbers


I have a data like this and for male and female column, I just need the first row (highlited ones). Could you please how can I read just these numbers and exclude rests.

df <- structure(list(
  col1 = c("First", "Frequency\nPercent", "CA", "TX"),
  col2 = c("Sex_3585", "Male", "298026\n5\n9", "45678\n15\n89"),
  col3 = c("", "Female", "57039\n10\n25", "64290\n100\n258")
),
class = "data.frame",
row.names = c(NA,-4L))

                col1          col2            col3
1              First      Sex_3585                
2 Frequency\nPercent          Male          Female
3                 CA  298026\n5\n9   57039\n10\n25
4                 TX 45678\n15\n89 64290\n100\n258


Solution

  • First, I create a simple example of your data.

    df <- structure(list(
      col1 = c("First", "Frequency\nPercent", "CA", "TX"),
      col2 = c("Sex_3585", "Male", "298026\n5\n9", "45678\n15\n89"),
      col3 = c("", "Female", "57039\n10\n25", "64290\n100\n258")
    ),
    class = "data.frame",
    row.names = c(NA,-4L))
    
                    col1          col2            col3
    1              First      Sex_3585                
    2 Frequency\nPercent          Male          Female
    3                 CA  298026\n5\n9   57039\n10\n25
    4                 TX 45678\n15\n89 64290\n100\n258
    

    Second, after reading in the file with read.csv, one option is to separate the rows that have carriage returns (i.e., \n). Then, we can group by the first column and keep only the first row for each group.

    library(tidyverse)
    
    df %>% 
      separate_rows(everything(), sep = "\n") %>% 
      group_by(col1) %>% 
      filter(row_number()==1)
    

    Output

      col1      col2     col3    
      <chr>     <chr>    <chr>   
    1 First     Sex_3585 ""      
    2 Frequency Male     "Female"
    3 Percent   Male     "Female"
    4 CA        298026   "57039" 
    5 TX        45678    "64290"