Reading blocks of data from a large text file in Python

Dear all,

I am trying to read a very large text file (>100 GB) containing covariances between different variables. The arrangement is such that the first variable is related to all, second is related to all except the first one (e.g., 14766203 or line[0:19] in the figure), and so on (see 1,2, 3 in figure). Here is my sample data:

14766203               -10.254364177  105.401485677     0.0049     0.0119       0.0024       0.0014      88.3946    7.340657124e-06   -7.137818870e-06    1.521836659e-06    
                                                                                                                                       3.367715952e-05   -6.261063214e-06    
                                                                                                                                                          3.105358202e-06    
14766204                                                                                                            6.126218197e-06   -7.264675283e-06    1.508365235e-06    
                                                                                                                   -7.406839249e-06    3.152004956e-05   -6.020433814e-06    
                                                                                                                    1.576663440e-06   -6.131501924e-06    2.813007315e-06    
14766205                                                                                                            4.485532069e-06   -6.601931549e-06    1.508397490e-06    
                                                                                                                   -7.243398379e-06    2.870296214e-05   -5.777139540e-06    
                                                                                                                    1.798277242e-06   -6.343898734e-06    2.291452454e-06    
14766204               -10.254727963  105.401101357     0.0065     0.0147       0.0031       0.0019      87.2542    1.293562659e-05   -1.188084039e-05    1.932569051e-06    
                                                                                                                                       5.177847716e-05   -7.850639841e-06    
                                                                                                                                                          4.963314613e-06    
14766205                                                                                                            6.259830057e-06   -8.072416685e-06    1.785233052e-06    
                                                                                                                   -8.854538457e-06    3.629463550e-05   -6.703120240e-06    
                                                                                                                    2.047196889e-06   -7.229432710e-06    2.917899913e-06    
14766205               -10.254905775  105.400622259     0.0051     0.0149       0.0024       0.0016      88.4723    9.566876325e-06   -1.357014809e-05    2.378290143e-06    
                                                                                                                                       5.210766141e-05   -8.356178456e-06    
                                                                                                                                                          4.016328161e-06

Now I wanted to be able to extract them as blocks in python or at the least read one block and exit the file read (e.g., 1, 2, 3). I couldn't succeed but here is my effort:

with open(inFile, 'rb') as f: listData = []
   for line in f: 
   MarkNumber = None; 
   if line[0:19].strip() != '' and line[23:36].strip() !='':
      MarkNumber = str(line[0:19].strip())                                                        
   if line[0:19].strip() == MarkNumber and len(line[23:36].strip()) !=0:
       isMark = True                                                
   if line[0:19].strip() != MarkNumber and len(line[23:36].strip()) !=0:
       isMark = False                                               
   if isMark == True:                                               
        ListOfData.append(line)

The ListOfData tends to read all lines until the end of file. So it does not really help.

Any help to get this thing sorted out will be appreciated.

Thanks Nakhap

Solution

Could you use regex to find blocks that have a second chunk of numbers within, say, 15 characters of the chunk of numbers that begins a line?

import re

inFile = 'C:/path/myData.txt'
myregex = r'(^[.0-9e-]{1,15}[\W]{1,15}[.0-9e-])'
thisBlock = []

with open(inFile, 'rb') as f:
  for line in f:
  if re.search(re.compile(myregex),line):
       print("Previous block finished.  Here it is in a chunk:")
       print(thisBlock)

       print("\n\n\nNew block starting")
       thisBlock = [line]
  else:
       thisBlock.append(line)

The regex expression myregex looks for 1 to 15 {1,15} characters of whitespace [\W] in between two numeric sequences [.0-9e-] -- this looks for digits 0-9, as well as decimal points, negative signs, and the exponent e. The first {1,15} in the expression assumes that your first numeric expression at the start of the row is at least 1, but less than 15, characters.