Search code examples
rdataframedata-cleaning

Find typos (unidentical rows by factor) in a supposedly flattened dataframe in R


I have a supposedly flattened dataframe, with about 40 columns of different data types. There is a variable that acts as a unique index, for the first 15 columns or so. Because it's a flattened relational database, in these columns, all the rows with the same value in this index variable should be identical. But they aren't. I want to find where the typos are.

I've made this very simplified example:

    structure(list(f = structure(c(1L, 2L, 3L, 3L, 4L, 5L, 6L, 6L, 
7L, 7L), .Label = c("a", "b", "c", "d", "e", "f", "g"), class = "factor"), 
    number = c(1, 2, 3, 3, 4, 5, 6, 7, 21, 21), name = structure(c(1L, 
    2L, 4L, 3L, 5L, 6L, 7L, 7L, 8L, 8L), .Label = c("alfa", "beta", 
    "calostrE", "calostrO", "dedo", "elefante", "fiasco", "general"
    ), class = "factor")), .Names = c("f", "number", "name"), row.names = c(NA, 
-10L), class = "data.frame")

It looks like this:

   f number     name
1  a      1     alfa
2  b      2     beta
3  c      3 calostrO
4  c      3 calostrE
5  d      4     dedo
6  e      5 elefante
7  f      6   fiasco
8  f      7   fiasco
9  g     21  general
10 g     21  general

f is the unique index. In my original dataframe, it's a date that has been converted to factor, but that's irrelevant. As you see, rows 9 and 10 are correct, because all the other variable values are identical. Rows 1,2,5, and 6 are also correct, because there is only one row per factor value. But row pairs 3-4 and 7-8 are incorrect: they have typos, and there are values of the variables that are not identical.

The result I want is something like this:

Rows.with.typos..........Column.names  
.....3......................."name"  
.....7......................."number"

As you see, I have problems with markdown, too.

This example is simple, but if there are inequalities (typos) in more than one column, there should be more than one element under "Column names" in the final result. Note also that my original dataframe is quite wide and has quite a lot of columns, and only some of them should be identical by row for a given value of f.

Clarification a posteriori: the row selected is always the first one of the group (see my response to the comments below).

I have only managed to get the rows with typos, but in a very convoluted manner, that I think it would be not useful to post.


Solution

  • I myself have created a function that does the trick. It's even better, for it creates another Excel file in which the typos can be easily seen because all the other cells are filled with lines. I think it can be useful to many a beginner or to data cleaners, but also sure that the code can be perfected. The variables and functions names are in Spanish.

    detectar_errores<-function(x,variables,index){
    
    #The first argument is the dataframe. The third argument is the index. And the second is a vector, that can be numerical (positional) or of variable names, and specifies which variables are the ones that should have identical values if the variable "index" has the same values.
    
    
    #Checks packages
    
    
    if(require("xlsx")){
        print("xlsx está correctamente cargado")
    } else {
        print("tratando de instalar xlsx")
        install.packages("xlsx")
        if(require(xlsx)){
            print("xlsx instalado y cargado")
        } else {
            stop("no pude instalar y cargar xlsx")
        }
    }
    
    if(require("dplyr")){
        print("dplyr está correctamente cargado")
    } else {
        print("tratando de instalar dplyr")
        install.packages("dplyr")
        if(require(dplyr)){
            print("dplyr instalado y cargado")
        } else {
            stop("no pude instalar y cargar dplyr")
        }
    }
    
    
    #Selects the variables and groups by index
    #Then creates a new variable, that is TRUE is there is more than one row in the group and there are the same rows as unique index values
    #The result is stored in a new dataframe called "primera"
    
    primera<-x %>% select(variables) %>% group_by(index) %>% do({
      clasificador<-nrow(.)==nrow(unique(.)) & nrow(.)>1
      data.frame(.,clasificador) #El punto es un símbolo para el grupo
    })
    
    #Selects the rows that interest us and stores them in another dataframe
    
    segunda<-primera[primera$clasificador==T,]
    
    #Creates a function that takes a vector and checks if all its elements are identical(i.e: 3, 3, 3)
    #If they are, returns as many NAs as the vector length (that variables doesn't have typos)
    #If they aren't, returns the same vector, in order to the discrepancies to be seen
    
    todosiguales<-function(x){
      clase<-class(x)
      if(identical(x,rep(x[1],length(x)))){
      solucion<-rep(NA,length(x))
      class(solucion)<-clase
      return(solucion)
      }else{
    return(x)}
    }
    
    #Creates a function that substitutes the NAs for lines in a character vector
    
    rayas<-function (y){
      y[is.na(y)]<-"--"
      return(y)
    }
    
    #Creates another dataframe by manipulating the previous one
    #It groups by the index and then transforms the variables
    #It coerces them to character, then applies the function todosiguales and then the funcion rayas
    
    tercera<-segunda %>%
          group_by(index) %>%
          mutate_each(funs(as.character)) %>%
          mutate_each(funs(todosiguales)) %>%
          mutate_each(funs(rayas))
    
    #That returns the last dataframe. Now it's written as a new Excel file
    
    write.xlsx(tercera,"Errores_detectados.xls")
    }