I have high frequency data in a .csv
file that consists of order quotes and trade quotes. Following is an example of the first 13 lines of a 600,000 line dataset: (sorry for the format, copy/paste wouldn't fit all the columns on one line hence I have spaced them out)
1442527200000750850 11539422 15110 1 1 15120 4 3 15105 1 1 15125 17 2 15100 4 3 15130 8 2 15095 7 6 15135 3 1 15090 33 3 15140 3 1 15085 1 1 15145 4 4 15080 2 1 15150 2 2 15075 2 2 15160 2 2 15070 1 1 15165 3 2 0 0 0 15170 6 1
1442527200000750850 11539422 15110 1 1 15120 4 3 15105 1 1 15125 17 2 15100 4 3 15130 8 2 15095 7 6 15135 3 1 15090 33 3 15140 3 1 15085 1 1 15145 4 4 15080 2 1 15150 2 2 15075 2 2 15160 2 2 15070 1 1 15165 3 2 15060 3 3 15170 6 1
1442527200000750850 11539422 15110 1 1 15120 2 2 15105 1 1 15125 17 2 15100 4 3 15130 8 2 15095 7 6 15135 3 1 15090 33 3 15140 3 1 15085 1 1 15145 4 4 15080 2 1 15150 2 2 15075 2 2 15160 2 2 15070 1 1 15165 3 2 15060 3 3 15170 6 1
1442527200024857519 11539479 15115 1 1 15120 2 2 15110 1 1 15125 17 2 15105 1 1 15130 8 2 15100 4 3 15135 3 1 15095 7 6 15140 3 1 15090 33 3 15145 4 4 15085 1 1 15150 2 2 15080 2 1 15160 2 2 15075 2 2 15165 3 2 15070 1 1 15170 6 1
1442527200024960184 11539480 15115 2 2 15120 2 2 15110 1 1 15125 17 2 15105 1 1 15130 8 2 15100 4 3 15135 3 1 15095 7 6 15140 3 1 15090 33 3 15145 4 4 15085 1 1 15150 2 2 15080 2 1 15160 2 2 15075 2 2 15165 3 2 15070 1 1 15170 6 1
1442527200025091242 11539482 15115 2 2 15120 2 2 15110 1 1 15125 17 2 15105 1 1 15130 8 2 15100 4 3 15135 3 1 15095 7 6 15140 3 1 15090 33 3 15145 4 4 15085 1 1 15150 2 2 15080 2 1 15160 2 2 15075 2 2 15165 3 2 15070 1 1 15170 7 2
1442527200060240623 11539555 15115 2 2 15120 2 2 15110 1 1 15125 17 2 15105 1 1 15130 8 2 15100 4 3 15135 3 1 15095 7 6 15140 3 1 15090 33 3 15145 4 4 15085 1 1 15150 2 2 15080 2 1 15160 2 2 15075 3 3 15165 3 2 15070 1 1 15170 7 2
1442527200065644806 11539556 15115 2 2 15120 2 2 15110 1 1 15125 17 2 15105 1 1 15130 8 2 15100 4 3 15135 4 2 15095 7 6 15140 3 1 15090 33 3 15145 4 4 15085 1 1 15150 2 2 15080 2 1 15160 2 2 15075 3 3 15165 3 2 15070 1 1 15170 7 2
1442527200088944431 11539565 15115 2 2 15120 2 2 15110 1 1 15125 17 2 15105 1 1 15130 8 2 15100 4 3 15135 4 2 15095 7 6 15140 3 1 15090 33 3 15145 4 4 15085 2 2 15150 2 2 15080 2 1 15160 2 2 15075 3 3 15165 3 2 15070 1 1 15170 7 2
1442527200088944431 11539565 15115 2 2 15120 2 2 15110 1 1 15125 17 2 15105 1 1 15130 8 2 15100 4 3 15135 4 2 15095 7 6 15140 3 1 15090 33 3 15145 4 4 15085 2 2 15150 2 2 15080 2 1 15160 2 2 15075 2 2 15165 3 2 15070 1 1 15170 7 2
1442527200323230870 11539621 15115 3 3 15120 2 2 15110 1 1 15125 17 2 15105 1 1 15130 8 2 15100 4 3 15135 4 2 15095 7 6 15140 3 1 15090 33 3 15145 4 4 15085 2 2 15150 2 2 15080 2 1 15160 2 2 15075 2 2 15165 3 2 15070 1 1 15170 7 2
1442527200323600606 11539622 15115 3 3 15120 1 1 15110 1 1 15125 17 2 15105 1 1 15130 8 2 15100 4 3 15135 4 2 15095 7 6 15140 3 1 15090 33 3 15145 4 4 15085 2 2 15150 2 2 15080 2 1 15160 2 2 15075 2 2 15165 3 2 15070 1 1 15170 7 2
1442527200324698612 11539623 15120 1 1
As you can see, the last row in this example has only 5 columns while the others have many more. The row with only 5 columns is the trade execution line and rest are order quotes. Basically, I want to read the csv
file while omitting these trade rows with only 5 columns and read the rest of the file so I can start my analysis on the Bid
/Ask
quotes.
I tried using the read.csv()
function, but I can't seem to figure how to code it to skip the trade lines ( 5 column lines ) and only read the order quotes, it reads everything and gives me NaN
for the empty columns in the trade line. I also tried opening a connection and then using the scan function but again faced the same problem. Is there a way I can put in an if
statement somewhere in there and omit the lines or any other way?
In case its needed, the format of the data is following:
ORDER_BOOK: ( timestamp_in_ns ) ( feed-sequence-number ) x bid_price_in_USD, bid_qty, number_of_bid_orders, ask_price_in_USD, ask_qty, number_of_ask_orders, .....
OR
TRADE: ( timestamp_in_ns ) ( feed-sequence-number ) trade_price_in_USD, trade_qty, trade_side ( 1 for Buy and 2 for Sell, might be 0 if CME doesn't give us this information for some trade )
There are no headers in the dataset and objects are separated by ""
Any help would be greatly appreciated. Please let me know if you need any more information.
You can use complete.cases
to detect the complete rows:
X <- read.csv(filename,sep=" ")
X <- X[-which(!complete.cases(X)),]