Search code examples
rbcpfile-conversion

Opening .bcp files in R


I have been trying to convert UK charity commission data which is in .bcp file format into .csv file format which could then be read into R. The data I am referring to is available here: http://data.charitycommission.gov.uk/. What I am trying to do is turn these .bcp files into useable dataframes that I can clean and run analyses on in R.

There are suggestions on how to do this through python on this github page https://github.com/ncvo/charity-commission-extract but unfortunately I haven't been able to get these options to work.

I am wondering if there is any syntax or packages that will allow me to open these data in R directly? I haven't been able to find any.

Another option would be to simply open the files within R as a single character vector using readLines. I have done this and the files are delimited with @**@ for columns and *@@* for rows. (See here: http://data.charitycommission.gov.uk/data-definition.aspx). Is there an R command that would allow me to create a dataframe from a long character string, defining de-limiters for both rows and columns?


Solution

  • R-solution

    edited version

    Not sure if all .bcp files are in the same format.. I downloaded the dataset you mentioned, and tried a solution for the smallest file; extract_aoo_ref.bcp

    library(data.table)
    
    #read the file as-is
    text <- readChar("./extract_aoo_ref.bcp", 
                     nchars = file.info( "./extract_aoo_ref.bcp" )$size, 
                     useBytes = TRUE)
    #replace column and row separator
    text <- gsub( ";", ":", text)
    text <- gsub( "@\\*\\*@", ";", text)
    text <- gsub( "\\*@@\\*", "\n", text, perl = TRUE)
    #read the results
    result <- data.table::fread( text, 
                                 header = FALSE, 
                                 sep = ";", 
                                 fill = TRUE, 
                                 quote = "", 
                                 strip.white = TRUE)
    
    head(result,10)
    
    #    V1 V2                           V3                                           V4 V5 V6
    # 1:  A  1 THROUGHOUT ENGLAND AND WALES At least 10 authorities in England and Wales  N NA
    # 2:  B  1             BRACKNELL FOREST                             BRACKNELL FOREST  N NA
    # 3:  D  1                  AFGHANISTAN                                  AFGHANISTAN  N  2
    # 4:  E  1                       AFRICA                                       AFRICA  N NA
    # 5:  A  2           THROUGHOUT ENGLAND      At least 10 authorities in England only  N NA
    # 6:  B  2               WEST BERKSHIRE                               WEST BERKSHIRE  N NA
    # 7:  D  2                      ALBANIA                                      ALBANIA  N  3
    # 8:  E  2                         ASIA                                         ASIA  N NA
    # 9:  A  3             THROUGHOUT WALES        At least 10 authorities in Wales only  Y NA
    # 10:  B  3                      READING                                      READING  N NA
    

    same for the tricky file; extract_charity.bcp

    head(result[,1:3],10)
    #       V1 V2                                                                                 V3
    # 1: 200000  0                                                          HOMEBOUND CRAFTSMEN TRUST
    # 2: 200001  0                                                          PAINTERS' COMPANY CHARITY
    # 3: 200002  0                                              THE ROYAL OPERA HOUSE BENEVOLENT FUND
    # 4: 200003  0                                                          HERGA WORLD DISTRESS FUND
    # 5: 200004  0 THE WILLIAM GOLDSTEIN LAY STAFF BENEVOLENT FUND (ROYAL HOSPITAL OF ST BARTHOLOMEW)
    # 6: 200005  0                              DEVON AND CORNWALL ROMAN CATHOLIC DEVELOPMENT SOCIETY
    # 7: 200006  0                                                    THE HORLEY SICK CHILDREN'S FUND
    # 8: 200007  0                                            THE HOLDENHURST OLD PEOPLE'S HOME TRUST
    # 9: 200008  0                                                         LORNA GASCOIGNE TRUST FUND
    # 10: 200009  0                                          THE RALPH LEVY CHARITABLE COMPANY LIMITED
    

    so.. looks like it is working :)