Search code examples
pythonnumpyfor-loopsplit

Deleteing unwanted characters from .dat file then performing calculations on the result


I need to read a .dat file in Python. The file has 3 columns in total and hundreds of rows. The second and third column contain two characters followed by a float that I would like to extract--the second column always starts with "SA" and the third column always starts with "SC". I am currently loading in the data and looping through each row to extract the values, but is there a better way to do it?

Once the data is cleaned, I want to perform some calculations on the result, namely computing the average.

Here is an example of two lines from the .dat file:

9:01:15 SA7.998  SC7.968 
9:01:16 SA7.998  SC7.968 

Here is the code I am currently using.

import numpy as np
import os.path
from statistics import mean

time=[]
s_1=[]
s_2=[]
s1=[]
s2=[]
r1=[]
r2=[]
avgg=[]

# Reading data from file
with open('serial_2.dat','r') as f:
    dat=f.readlines()      
for i in dat: 
     y=i.split()                          # cleaning and getting columns without spaces  
     time.append(y[0])
     s1.append(y[1])
     s2.append(y[2])
#getting only numbers without strings (SA and SC)           
for counter in (range(0,len(s1))):
    S_1=s1[counter]
    r1.append(S_1[2:])
    r1_f=np.array(r1, dtype='float32')  

   S_2=s2[counter]
   r2.append(S_2[2:])
   r2_f=np.array(r2, dtype='float32')  
   avgg=r1_f+r2_f/2
print(np.mean(avgg))

Solution

  • You can use pandas to do that:

    #! pip install pandas
    import pandas as pd
    import numpy as np
    
    df = pd.read_csv('serial_2.dat', sep='\s+', header=None, names=['time', 's1', 's2'])
    df['s1'] = df['s1'].str.extract('^[\D]+(.*)').astype(float)
    df['s2'] = df['s2'].str.extract('^[\D]+(.*)').astype(float)
    

    Output:

    >>> df
          time     s1     s2
    0  9:01:15  7.998  7.968
    1  9:01:16  7.998  7.968
    
    >>> df.dtypes
    time     object
    s1      float64
    s2      float64
    dtype: object
    

    If you have always 2 characters in s1 and s2 columns, you can avoid regex and strip the first two characters:

    df['s1'] = df['s1'].str[2:].astype(float)
    df['s2'] = df['s2'].str[2:].astype(float)
    

    Then compute the global average:

    # With pandas
    >>> df[['s1', 's2']].mean().mean()
    7.983
    
    # With numpy
    >>> np.mean(df[['s1', 's2']])
    7.983
    

    You can also compute the row average:

    df['avg'] = df[['s1', 's2']].mean(axis=1)
    print(df)
    
    # Output
          time     s1     s2    avg
    0  9:01:15  7.998  7.968  7.983
    1  9:01:16  7.998  7.968  7.983