Search code examples
rbinaryhttr

Convert Dukascopy binary data to text (dataframe) in R


I want ot download and save quote market data from Dukascopy: https://www.dukascopy.com/swiss/english/marketwatch/historical/

I have made succesfull request on their server and get binary data as result of GET request:

library(httr)


url <- "https://datafeed.dukascopy.com/datafeed/USA30IDXUSD/2022/07/15/23h_ticks.bi5"
p <- GET(url,
         add_headers("user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"))
p
content(p)

I don't know how to convert this binary to readable data. I have found some scripts in python: https://www.driftinginrecursion.com/post/dukascopy_opensource_data/ https://github.com/terukusu/download-tick-from-dukascopy/blob/master/download_tick_from_dukascopy.py

For example, first link uses function:

def bi5_to_csv(date_ts, out_dir, files):
    print('Starting Coversion of All .bi5 Files...')

    sort = sorted(files)
    chunk_size = struct.calcsize('>3i2f')
    data = []

    for bi5 in sort:

        try:
            size = os.path.getsize(bi5)
        except (IOError, OSError):
            break

        if size > 0:
            with lzma.open(bi5) as f:
                while True:
                    chunk = f.read(chunk_size)
                    if chunk:
                        data.append(struct.unpack('>3i2f', chunk))
                    else:
                        break

        os.remove(bi5)

    if not data:
        print('All Downloaded Files Where Empty!')
        return 1

    df = pd.DataFrame(data)
    df.columns = ['UTC', 'AskPrice', 'BidPrice', 'AskVolume', 'BidVolume']
    df.AskPrice = df.AskPrice / 100000
    df.BidPrice = df.BidPrice / 100000
    df.UTC = pd.TimedeltaIndex(df.UTC, 'ms')
    df.UTC = df.UTC.astype(str)
    df.UTC = df.UTC.replace(regex=['0 days'], value=[str(date_ts)])
    df.UTC = df.UTC.str[:-3]

    df.to_csv(out_dir + '/daily.csv', index=False)
    print('Finished Converting Files!')
    return 0

and in second scripts it uses:

def tokenize(buffer):
    token_size = 20
    token_count = int(len(buffer) / token_size)
    tokens = list(map(lambda x: struct.unpack_from('>3I2f', buffer, token_size * x), range(0, token_count)))
    return tokens


def normalize_tick(symbol, day, time, ask, bid, ask_vol, bid_vol):
    date = day + timedelta(milliseconds=time)

    # TODO 網羅する。この通過ペア以外も有るかも
    if any(map(lambda x: x in symbol.lower(), ['usdrub', 'xagusd', 'xauusd', 'jpy'])):
        point = 1000
    else:
        point = 100000

I don't know how to appl this code in E and in general how to convert this binary file to tick data. have also found C++ implementation: Reading data from Dukascopy tick binary file


Solution

  • My weird attempt:

    Reading a bit about bi5 files gives us an information, that's binary files archived with lzma (see: reedit. So, first we have to decompress the file. To decompress it I'm using lzma.

    library(httr)
    
    binF <- "/home/sapi/aaa/23h_ticks.lzma"
    url <- "https://datafeed.dukascopy.com/datafeed/USA30IDXUSD/2022/07/15/23h_ticks.bi5"
    p <- GET(url,
             add_headers("user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"), write_disk(binF, overwrite = TRUE))
    
    system(paste("lzma -d", binF))
    

    Having it decompressed we have to read in (as binary) and convert to any meaningful data. From the reddit discussion above we can see, that:

    The data is stored in 20 byte wide rows, with each 4 byte segment corresponding to a piece of data. Example:

    # TIME is a 32-bit big-endian integer representing the number of milliseconds that have passed since the beginning of this hour.
    # ASKP is a 32-bit big-endian integer representing the asking price of the pair, multiplied by 100,000.
    # BIDP is a 32-bit big-endian integer representing the bidding price of the pair, multiplied by 100,000.
    # ASKV is a 32-bit big-endian floating point number representing the asking volume, divided by 1,000,000.
    # BIDV is a 32-bit big-endian floating point number representing the bidding volume, divided by 1,000,000.
    

    Now we have to find a way to convert the big-endian to something. And this stackoverflow question gives us a hint.

    library(dplyr)
    
    binF <- "/home/sapi/aaa/23h_ticks"
    con <- file(binF, "rb")
    
    
    for (i in 0:(file.size(binF)/20-1)) {
      data <- readBin(con = con, "raw", 20)
      TIME <- data[4:1] %>% rawToBits %>% as.logical %>% which %>% {2^(. - 1)} %>% sum
      ASKP <- data[8:5] %>% rawToBits %>% as.logical %>% which %>% {2^(. - 1)} %>% sum
      BIDP <- data[12:9] %>% rawToBits %>% as.logical %>% which %>% {2^(. - 1)} %>% sum
      ASKV <- data[16:13] %>% rawToBits %>% as.logical %>% which %>% {2^(. - 1)} %>% sum
      BIDV <- data[20:17] %>% rawToBits %>% as.logical %>% which %>% {2^(. - 1)} %>% sum
      print(paste(TIME, ASKP, BIDP, ASKV, BIDV))
    }
    #> [1] "246 33867231 33862497 946528651 943230116"
    #> [1] "296 33867291 33861749 947628162 943230116"
    #> [1] "421 33867003 33862299 897988541 943230116"
    #> [1] "472 33867781 33862219 947628162 943230116"
    #> [1] "2440 33867773 33862779 897988541 943230116"
    [...]
    #> [1] "3583902 33882501 33877289 947628162 943230116"
    

    Please note I have no idea if the data are correct/meaningful. However looking on milliseconds column it makes sense, as it ends around 3600.

    Created on 2022-09-05 with reprex v2.0.2