Search code examples
rstringbashfix-protocoltidyr

Convert FIX message format ("Tag=Value") into CSV


I have a csv/log file of 35=S (Quote messages; "Tag=Value") and I need to extract the rates into a proper CSV file for data mining. This is not strictly FIX related, it's more of a R related question on how to clean a dataset.

The raw messages look something like this:

190=1.1204 ,191=-0.000029,193=20141008,537=0        ,631=1.12029575,642=0.000145,10=56
190=7.20425,191=0.000141 ,537=0       ,631=7.2034485,10=140        ,            ,
190=1.26237,191=0        ,537=1       ,10=068       ,              ,            ,

I need first to get to an intermediate data set that looks like this, where the same tags are aligned.

190=1.1204 ,191=-0.000029,193=20141008,537=0,631=1.12029575,642=0.000145,10=56
190=7.20425,191=0.000141 ,            ,537=0,631=7.2034485 ,            ,10=140
190=1.26237,191=0        ,            ,537=1,              ,            ,10=068

which in turn will need to be converted to this:

190    ,191      ,193     ,537,631       ,642     ,10
1.1204 ,-0.000029,20141008,0  ,1.12029575,0.000145,56
7.20425,0.000141 ,        ,0  ,7.2034485 ,        ,140
1.26237,0        ,        ,1  ,          ,        ,068

I'm in the midst of developing a bash script with awk but I wonder if I can do that in R. At present, my greatest challenge is arriving to the intermediate table. From the intermediate to the final table I thought of using the R with the tidyr package, specifically function 'separate'. If anybody can suggest a better logic, I'll greatly appreciate!


Solution

  • Another possibility. Start with same scan as @Andrie, but also use arguments strip.white and na.strings:

    x <- scan(text = "190=1.1204 ,191=-0.000029,193=20141008,537=0        ,631=1.12029575,642=0.000145,10=56
    190=7.20425,191=0.000141 ,537=0       ,631=7.2034485,10=140        ,            ,
    190=1.26237,191=0        ,537=1       ,10=068       ,              ,            ,",
               sep = ",",
               what = "character", 
               strip.white = TRUE,
               na.strings = "")
    
    # remove NA
    x <- x[!is.na(x)]
    

    Then use colsplit and dcast from reshape2package:

    library(reshape2)
    
    # split 'x' into two columns
    d1 <- colsplit(string = x, pattern = "=", names = c("x", "y")) 
    
    # create an id variable, needed in dcast
    d1$id <- ave(d1$x, d1$x, FUN = seq_along)   
    
    # reshape from long to wide
    d2 <- dcast(data = d1, id ~ x, value.var = "y")
    
    #   id  10     190       191      193 537      631      642
    # 1  1  56 1.12040 -0.000029 20141008   0 1.120296 0.000145
    # 2  2 140 7.20425  0.000141       NA   0 7.203449       NA
    # 3  3  68 1.26237  0.000000       NA   1       NA       NA
    

    Because you mentioned tidyr:

    library(tidyr)
    d1 <- separate(data = data.frame(x), col = x, into = c("x", "y"), sep = "=")
    d1$id <- ave(d1$x, d1$x, FUN = seq_along)
    spread(data = d1, key = x, value = y)
    #   id  10     190       191      193 537        631      642
    # 1  1  56  1.1204 -0.000029 20141008   0 1.12029575 0.000145
    # 2  2 140 7.20425  0.000141     <NA>   0  7.2034485     <NA>
    # 3  3 068 1.26237         0     <NA>   1       <NA>     <NA>
    

    This retains the values as character. If you want numeric, you can set convert = TRUE in spread.