I have a csv data file created by an instrument with ~1 million lines. I'm creating a GUI program in Matlab on a Windows machine for analyzing this data. I need to detect where the data begins because the file starts with a lot of various experiment data. However, the wrong line number is being returned in Matlab with the grep utility from the file exchange. So I copied the file over to my Mac and found this weird behavior with built-in Unix utilities.
Not only is it not returning the whole line where the searched term exists; it is also deleting the line number that the regular expression or script should return!
I've reduced the file to a small example below. Here are weird behaviors:
Desired results:
11: Synchronized blah blah some variable,30 ms
17: Synchronized beats for well A1:
Line number is removed and the start of the line, "Synchronized beats for well A1:", is removed:
$ grep -n "Synchronized" example.csv
11: Synchronized blah blah some variable,30 ms
Time (s),var1,var2,var3,var4,Included In Statistics,var5,var6,var7,var8,var9
I wrote a Python script that is giving the same result:
$ python preprocessing.py
11 : Synchronized blah blah some variable,30 ms
Time (s),var1,var2,var3,var4,Included In Statistics,var5,var6,var7,var8,var9
Here's the Python script:
file = 'example.csv'
lineNum = 1
with open(file,'r') as f:
for line in f:
if "Synchronized" in line:
print lineNum, ":", line
lineNum += 1
With the Matlab grep utility, it looks like there is a newline at the beginning of this line. However, it can still recognized the "Synchronized" word after that.
[fl,p]=grep('-e','Synchronized','C:\Users\Traveler\Documents\20160825\example.csv')
example.csv: Synchronized blah blah some variable,30 ms
example.csv:
Synchronized beats for well A1:
Time (s),var1,var2,var3,var4,Included In Statistics,var5,var6,var7,var8,var9
By the way, this also means that when I want to detect later data, the line number is off by quite a bit because this type of problem occurs on multiple lines.
So my question: Why is this happening, and what can I do about it within the context of a Matlab program? (I can build in anything as long as it can be called from within Matlab, i.e. so the user of this GUI isn't involved.) There's clearly some issue with a newline character that I can't see, but what about with deleting the line number? I'm not even sure what to do with the newline character anyway. I can't load the file into Matlab memory all at once.
Example data file:
Investigator:
Experiment ID:
Description:
,
Some Settings,
File Time,something
Sampling Frequency,12.5 kHz
,
Machine Settings,
Synchronized blah blah some variable,30 ms
Detection Method,Polynomial Regression
,
,
Synchronized beats for well A1:
Time (s),var1,var2,var3,var4,Included In Statistics,var5,var6,var7,var8,var9
1,2,3,4,5,False,1,0,2,3,4
2,3,4,5,6,False,2,0,3,4,5
Line ending conventions are nasty things. They are some combination of carriage return (CR, \r, ASCII 13) and linefeed (LF, \n, ASCII 10). Windows convention is CR/LF. Mac OSX is LF. Your instrument may be neither. It looks to me like at least some of your lines end in with a bare carriage return, produced in MATLAB (and many other languages) with \r
. If you output a line of text followed by only CR on a UNIX-ish OS, you will get just the CR without the LF, meaning the next line of text will overwrite the previous on your screen.
Examine the file for this by looking at the bare ASCII codes. In MATLAB for example, you can look at uint8(linestring)
to see what happens. Then, you can either the fix the file with an external utility, or you can use MATLAB to process the whole file line-at-a-time, trimming the line or adjusting your own line count to compensate for whatever you see happening. For example:
fid = fopen('file', 'rt'); % Note the t for text
linenum = 0;
while 1
line = fgetl(fid);
linenum = linenum + 1;
if ~ischar(line), break, end
disp(uint8(line)); % For debug, to see what's going on
disp(line);
end
fclose(fid);
Once you've fixed the basic line reading and counting, use regexp
or similar to pick out the lines you need and process them directly.