Search code examples
rcsvimportdata-processingdata-cleaning

Importing CSV files with CRLF broken lines in R


I'm an urban planner migrating towards spatial data analysis. I am not oblivious to R and programming in general but since I don't have the proper training my skills are limited sometimes.

At the moment I am trying to analyse about 50 CSV files containing financial data concerning public auctions which are from 60000 to 300000 lines long with 39 fields. The files are exports from the Romanian national public auctioning system, which is a form-like platform.

The issue is that some of the lines are broken by CRLF line endings in the middle of the address fields. I suspect that when people entered their address in the form they copy/pasted it from other files where it was multiline.

The issue cannot be resolved by Find&Replace as this will also replace the correct CRLF at the end of the line.

As an example the data is formatted something like this and has a CRLF after each line(They used ^ as the delimiter):

Castigator^CastigatorCUI^CastigatorTara^CastigatorLocalitate^CastigatorAdresa^Tip^TipContract^TipProcedura^AutoritateContractanta^AutoritateContractantaCUI^TipAC^TipActivitateAC^NumarAnuntAtribuire^DataAnuntAtribuire^TipIncheiereContract^TipCriteriiAtribuire^CuLicitatieElectronica^NumarOfertePrimite^Subcontractat^NumarContract^DataContract^TitluContract^Valoare^Moneda^ValoareRON^ValoareEUR^CPVCodeID^CPVCode^NumarAnuntParticipare^DataAnuntParticipare^ValoareEstimataParticipare^MonedaValoareEstimataParticipare^FonduriComunitare^TipFinantare^TipLegislatieID^FondEuropean^ContractPeriodic^DepoziteGarantii^ModalitatiFinantare
S.C. RCTHIA CO S.R.L.^65265644^Romania^Bucharest^DN1
Nr. 1, ^Anunt de atribuire la anunt de participare^Furnizare^Licitatie deschisa^COMPANIA NATIONALA DE TRANSPORT AL ENERGIEI ^R656556^^Electricitate^96594^2007-12-14^Un contract de achizitii publice^Pretul cel mai scazut^^1^^61^2007-11-08 00:00:00.000^Televizoare^304503.95^RON^304503.950000000001^89650.5^45937^323124100-1^344578^2007-10-02^49700.00^RON^^^^^^Nu este cazul;^Surse proprii;
ASOC : SC MNG SRLsi SC AquaiM SA ^56565575;656224^Romania^Ploiesti^Str. Independentei nr.15; 
Str. Carol nr. 45^Anunt de atribuire la anunt de participare^Lucrari^Negociere fara anunt de participare^MUNICIPIUL RAMNICU VALCEA^6562655^Administratie publica locala (municipii, orase, comune), institutie publica in subordonarea/coordonarea administratiei publice locale^Servicii generale ale administratiilor publice^56566^2007-10-10^Un contract de achizitii publice^Pretul cel mai scazut^^1^^65656^2007-09-12^Proiectare si executie lucrari^5665560.00^RON^659966.0^5455222^7140^65689966-2^^^^^^^^^^^

In order to properly process the data I would need the CSV to be read like this, by removing only the CRLF that break lines - which Find&Replace cannot do:

Castigator^CastigatorCUI^CastigatorTara^CastigatorLocalitate^CastigatorAdresa^Tip^TipContract^TipProcedura^AutoritateContractanta^AutoritateContractantaCUI^TipAC^TipActivitateAC^NumarAnuntAtribuire^DataAnuntAtribuire^TipIncheiereContract^TipCriteriiAtribuire^CuLicitatieElectronica^NumarOfertePrimite^Subcontractat^NumarContract^DataContract^TitluContract^Valoare^Moneda^ValoareRON^ValoareEUR^CPVCodeID^CPVCode^NumarAnuntParticipare^DataAnuntParticipare^ValoareEstimataParticipare^MonedaValoareEstimataParticipare^FonduriComunitare^TipFinantare^TipLegislatieID^FondEuropean^ContractPeriodic^DepoziteGarantii^ModalitatiFinantare
S.C. RCTHIA CO S.R.L.^65265644^Romania^Bucharest^DN1 Nr. 1, ^Anunt de atribuire la anunt de participare^Furnizare^Licitatie deschisa^COMPANIA NATIONALA DE TRANSPORT AL ENERGIEI ^R656556^^Electricitate^96594^2007-12-14^Un contract de achizitii publice^Pretul cel mai scazut^^1^^61^2007-11-08 00:00:00.000^Televizoare^304503.95^RON^304503.950000000001^89650.5^45937^323124100-1^344578^2007-10-02^49700.00^RON^^^^^^Nu este cazul;^Surse proprii;
ASOC : SC MNG SRLsi SC AquaiM SA ^56565575;656224^Romania^Ploiesti^Str. Independentei nr.15; Str. Carol nr. 45^Anunt de atribuire la anunt de participare^Lucrari^Negociere fara anunt de participare^MUNICIPIUL RAMNICU VALCEA^6562655^Administratie publica locala (municipii, orase, comune), institutie publica in subordonarea/coordonarea administratiei publice locale^Servicii generale ale administratiilor publice^56566^2007-10-10^Un contract de achizitii publice^Pretul cel mai scazut^^1^^65656^2007-09-12^Proiectare si executie lucrari^5665560.00^RON^659966.0^5455222^7140^65689966-2^^^^^^^^^^^

I have found a possible solution (Is there a way in R to join broken lines of csv file?), but it required some tweaking to fit my needs. The end result is that the code below hangs and does not reach the end of the process, even on small sample files.

My alteration of the accepted solution code from the above mentioned post:

dat <- readLines("filename.csv") # read whatever is in there, one line at a time
varnames <- unlist(strsplit(dat[1], "^", fixed = TRUE)) # extract variable names
nvar <- length(varnames)

k <- 1 # setting up a counter
dat1 <- matrix(NA, ncol = nvar, dimnames = list(NULL, varnames))

while(k <= length(dat)){
  k <- k + 1
  if(dat[k] == "") {k <- k + 1
  print(paste("data line", k, "is an empty string"))
  if(k > length(dat)) {break}
  }
  temp <- dat[k]
  # checks if there are enough commas or if the line was broken
  while(length(gregexpr("^", temp)[[1]]) < nvar-1){
    k <- k + 1
    temp <- paste0(temp, dat[k])
  }
  temp <- unlist(strsplit(temp, "^"))
  message(k)
  dat1 <- rbind(dat1, temp)
}

dat1 = dat1[-1,] # delete the empty initial row    

Counting fields between delimiters seems like a good solution but I am unable to find a good way to do this and my R programming skills are not enough apparently.

So is there any way to fix this type of broken CSV files in R?

Working files sample can be accessed here: http://data.gv.ro/dataset/4a4903c4-b1e3-46d1-82a5-238287f9496c/resource/c6abc0ef-3efb-4aef-bc0a-411f8cab2a28/download/contracte-2007.csv

Thanks for any help you can give!


Solution

  • The trouble seems to be that a ^ is a special character. If you step through your code you will see that you have 627 variables instead of 39. It is making each character a variable. Try this:

    dat <- readLines("filename.csv") # read whatever is in there, one line at a time
    varnames <- unlist(strsplit(dat[1], "\\^"))  # extract variable names
    nvar <- length(varnames)
    
    k <- 1 # setting up a counter
    dat1 <- matrix(NA, ncol = nvar, dimnames = list(NULL, varnames))
    
    while(k <= length(dat)){
      k <- k + 1
      #if(dat[k] == "") {k <- k + 1
      #print(paste("data line", k, "is an empty string"))
      if(k > length(dat)) {break}
      #}
      temp <- dat[k]
      # checks if there are enough commas or if the line was broken
      while(length(gregexpr("\\^", temp)[[1]]) < nvar-1){
        k <- k + 1
        temp <- paste0(temp, dat[k])
      }
      temp <- unlist(strsplit(temp, "\\^"))
      message(k)
      dat1 <- rbind(dat1, temp)
    }
    
    dat1 = dat1[-1,] # delete the empty initial row    
    

    Sorry missed that difference in your code and mine. You don't want fixed=true. changing it to the above gives you this:

    > varnames
     [1] "Castigator"                       "CastigatorCUI"                    "CastigatorTara"                  
     [4] "CastigatorLocalitate"             "CastigatorAdresa"                 "Tip"                             
     [7] "TipContract"                      "TipProcedura"                     "AutoritateContractanta"          
    [10] "AutoritateContractantaCUI"        "TipAC"                            "TipActivitateAC"                 
    [13] "NumarAnuntAtribuire"              "DataAnuntAtribuire"               "TipIncheiereContract"            
    [16] "TipCriteriiAtribuire"             "CuLicitatieElectronica"           "NumarOfertePrimite"              
    [19] "Subcontractat"                    "NumarContract"                    "DataContract"                    
    [22] "TitluContract"                    "Valoare"                          "Moneda"                          
    [25] "ValoareRON"                       "ValoareEUR"                       "CPVCodeID"                       
    [28] "CPVCode"                          "NumarAnuntParticipare"            "DataAnuntParticipare"            
    [31] "ValoareEstimataParticipare"       "MonedaValoareEstimataParticipare" "FonduriComunitare"               
    [34] "TipFinantare"                     "TipLegislatieID"                  "FondEuropean"                    
    [37] "ContractPeriodic"                 "DepoziteGarantii"                 "ModalitatiFinantare"