r dplyr data.table data-cleaning read.csv

Read CSV file it include several numbers

I have a data like this and for male and female column, I just need the first row (highlited ones). Could you please how can I read just these numbers and exclude rests.

df <- structure(list(
  col1 = c("First", "Frequency\nPercent", "CA", "TX"),
  col2 = c("Sex_3585", "Male", "298026\n5\n9", "45678\n15\n89"),
  col3 = c("", "Female", "57039\n10\n25", "64290\n100\n258")
),
class = "data.frame",
row.names = c(NA,-4L))

                col1          col2            col3
1              First      Sex_3585                
2 Frequency\nPercent          Male          Female
3                 CA  298026\n5\n9   57039\n10\n25
4                 TX 45678\n15\n89 64290\n100\n258

Solution

First, I create a simple example of your data.

df <- structure(list(
  col1 = c("First", "Frequency\nPercent", "CA", "TX"),
  col2 = c("Sex_3585", "Male", "298026\n5\n9", "45678\n15\n89"),
  col3 = c("", "Female", "57039\n10\n25", "64290\n100\n258")
),
class = "data.frame",
row.names = c(NA,-4L))

                col1          col2            col3
1              First      Sex_3585                
2 Frequency\nPercent          Male          Female
3                 CA  298026\n5\n9   57039\n10\n25
4                 TX 45678\n15\n89 64290\n100\n258

Second, after reading in the file with read.csv, one option is to separate the rows that have carriage returns (i.e., \n). Then, we can group by the first column and keep only the first row for each group.

library(tidyverse)

df %>% 
  separate_rows(everything(), sep = "\n") %>% 
  group_by(col1) %>% 
  filter(row_number()==1)

Output

  col1      col2     col3    
  <chr>     <chr>    <chr>   
1 First     Sex_3585 ""      
2 Frequency Male     "Female"
3 Percent   Male     "Female"
4 CA        298026   "57039" 
5 TX        45678    "64290"