Dear all,
I am trying to read a very large text file (>100 GB) containing covariances between different variables. The arrangement is such that the first variable is related to all, second is related to all except the first one (e.g., 14766203 or line[0:19] in the figure), and so on (see 1,2, 3 in figure). Here is my sample data:
14766203 -10.254364177 105.401485677 0.0049 0.0119 0.0024 0.0014 88.3946 7.340657124e-06 -7.137818870e-06 1.521836659e-06
3.367715952e-05 -6.261063214e-06
3.105358202e-06
14766204 6.126218197e-06 -7.264675283e-06 1.508365235e-06
-7.406839249e-06 3.152004956e-05 -6.020433814e-06
1.576663440e-06 -6.131501924e-06 2.813007315e-06
14766205 4.485532069e-06 -6.601931549e-06 1.508397490e-06
-7.243398379e-06 2.870296214e-05 -5.777139540e-06
1.798277242e-06 -6.343898734e-06 2.291452454e-06
14766204 -10.254727963 105.401101357 0.0065 0.0147 0.0031 0.0019 87.2542 1.293562659e-05 -1.188084039e-05 1.932569051e-06
5.177847716e-05 -7.850639841e-06
4.963314613e-06
14766205 6.259830057e-06 -8.072416685e-06 1.785233052e-06
-8.854538457e-06 3.629463550e-05 -6.703120240e-06
2.047196889e-06 -7.229432710e-06 2.917899913e-06
14766205 -10.254905775 105.400622259 0.0051 0.0149 0.0024 0.0016 88.4723 9.566876325e-06 -1.357014809e-05 2.378290143e-06
5.210766141e-05 -8.356178456e-06
4.016328161e-06
Now I wanted to be able to extract them as blocks in python or at the least read one block and exit the file read (e.g., 1, 2, 3). I couldn't succeed but here is my effort:
with open(inFile, 'rb') as f: listData = []
for line in f:
MarkNumber = None;
if line[0:19].strip() != '' and line[23:36].strip() !='':
MarkNumber = str(line[0:19].strip())
if line[0:19].strip() == MarkNumber and len(line[23:36].strip()) !=0:
isMark = True
if line[0:19].strip() != MarkNumber and len(line[23:36].strip()) !=0:
isMark = False
if isMark == True:
ListOfData.append(line)
The ListOfData tends to read all lines until the end of file. So it does not really help.
Any help to get this thing sorted out will be appreciated.
Thanks Nakhap
Could you use regex to find blocks that have a second chunk of numbers within, say, 15 characters of the chunk of numbers that begins a line?
import re
inFile = 'C:/path/myData.txt'
myregex = r'(^[.0-9e-]{1,15}[\W]{1,15}[.0-9e-])'
thisBlock = []
with open(inFile, 'rb') as f:
for line in f:
if re.search(re.compile(myregex),line):
print("Previous block finished. Here it is in a chunk:")
print(thisBlock)
print("\n\n\nNew block starting")
thisBlock = [line]
else:
thisBlock.append(line)
The regex expression myregex
looks for 1 to 15 {1,15}
characters of whitespace [\W]
in between two numeric sequences [.0-9e-]
-- this looks for digits 0-9, as well as decimal points, negative signs, and the exponent e. The first {1,15} in the expression assumes that your first numeric expression at the start of the row is at least 1, but less than 15, characters.