Search code examples
rdataframesubsetdata-cleaning

How to locate a structured region of data inside of a not structured data frame in R?


I have a certain kind of data frames that contain a subset of interest. The problem is that this subset, is non consistent between the different data frames. Nonetheless, in a more abstract level, follows a general structure: a rectangular region inside the data frame.

example1 <- data.frame(x = c("name", "129-2", NA, NA, "acc", 2, 3, 4, NA, NA), 
       y = c(NA, NA, NA, NA, "deb", 3, 2, 5, NA, NA),
       z = c(NA, NA, NA, NA, "asset", 1, 1, 2, NA, NA))

print(example1)

      x    y     z
1   name <NA>  <NA>
2  129-2 <NA>  <NA>
3   <NA> <NA>  <NA>
4   <NA> <NA>  <NA>
5    acc  deb asset
6      2    3     1
7      3    2     1
8      4    5     2
9   <NA> <NA>  <NA>
10  <NA> <NA>  <NA>

The example1 contain a clear rectangular región with a structure information:

5    acc  deb asset
6      2    3     1
7      3    2     1
8      4    5     2

As mentioned before, the region is not always consistent,

  1. the position of the columns are not always the same
  2. the name of the variables insde the subset of interest are not always the same

Here another example2:

example2 <- data.frame(x = c("name", "129-2", "wallabe #23", NA, NA, "acc", 2, 3, 4, NA ), 
       y = c(NA, NA, NA, NA, "balance", "deb", 3, 2, 5, NA),
       z = c(NA, NA, NA, NA, NA, "asset", 1, 1, 2, NA),
       u = c(NA, NA, NA, "currency:", NA, NA, NA, NA, NA, NA),
       i = c(NA, NA, NA, "USD", "result", "win", 2, 3, 1, NA),
       o = c(NA, NA, NA, NA, NA, "lose", 2, 2, 1, NA))

print(example2)
> example2
            x       y     z         u      i    o
1         name    <NA>  <NA>      <NA>   <NA> <NA>
2        129-2    <NA>  <NA>      <NA>   <NA> <NA>
3  wallabe #23    <NA>  <NA>      <NA>   <NA> <NA>
4         <NA>    <NA>  <NA> currency:    USD <NA>
5         <NA> balance  <NA>      <NA> result <NA>
6          acc     deb asset      <NA>    win lose
7            2       3     1      <NA>      2    2
8            3       2     1      <NA>      3    2
9            4       5     2      <NA>      1    1
10        <NA>    <NA>  <NA>      <NA>   <NA> <NA>

The example2 contain a not clear rectangular región:

6          acc     deb asset      <NA>    win lose
7            2       3     1      <NA>      2    2
8            3       2     1      <NA>      3    2
9            4       5     2      <NA>      1    1

One method to scan this dataframe to locate this kind of region inside of it?

Any idea is appreciated


Solution

  • You might want to try the longest sequence with same amount of NAs:

    findTable <- function(df){
      naSeq <- rowSums(is.na(df))          # How many NA per row
      myRle <- rle(naSeq )$length          # Find sequences length
      df[rep(myRle == max(myRle), myRle),] # Get longest sequence
    }
    
    findTable(example1)
        x   y     z
    5 acc deb asset
    6   2   3     1
    7   3   2     1
    8   4   5     2
    
    findTable(example2)
        x   y     z    u   i    o
    6 acc deb asset <NA> win lose
    7   2   3     1 <NA>   2    2
    8   3   2     1 <NA>   3    2
    9   4   5     2 <NA>   1    1