Search code examples
pythoncsvheaderoutputfloating

Using Python v3.5 to load a tab-delimited file, omit some rows, and output max and min floating numbers in a specific column to a new file


I've tried for several hours to research this, but every possible solution hasn't suited my particular needs. I have written the following in Python (v3.5) to download a tab-delimited .txt file.

#!/usr/bin/env /Library/Frameworks/Python.framework/Versions/3.5/bin/python3.5
import urllib.request
import time
timestr = time.strftime("%Y-%m-%d %H-%M-%S")
filename="/data examples/"+ "ace-magnetometer-" + timestr + '.txt'
urllib.request.urlretrieve('http://services.swpc.noaa.gov/text/ace-magnetometer.txt', filename=filename)

This downloads the file from here and renames it based on the current time. It works perfectly.

I am hoping that I can then use the "filename" variable to then load the file and do some things to it (rather than having to write out the full file path and file name, because my ultimate goal is to do the following to several hundred different files, so using a variable will be easier in the long run).

This using-the-variable idea seems to work, because adding the following to the above prints the contents of the file to STDOUT... (so it's able to find the file without any issues):

import csv
with open(filename, 'r') as f:
    reader = csv.reader(f, dialect='excel', delimiter='\t')
    for row in reader:
            print(row)

As you can see from the file, the first 18 lines are informational. Line 19 provides the actual column names. Then there is a line of dashes.

The actual data I'm interested in starts on line 21.

I want to find the minimum and maximum numbers in the "Bt" column (third column from the right). One of the possible solutions I found would only work with integers, and this dataset has floating numbers.

Another possible solution involved importing the pyexcel module, but I can't seem to install that correctly...

import pyexcel as pe
data = pe.load(filename, name_columns_by_row=19)
min(data.column["Bt"])

I'd like to be able to print the minimum Bt and maximum Bt values into two separate files called minBt.txt and maxBt.txt.

I would appreciate any pointers anyone may have, please.


Solution

  • The following will work as long as all the files are formatted in the same way, i.e. data 21 lines in, same number of columns and so on. Also, the file that you linked did not appear to be tab delimited, and thus I've simply used the string split method on each row instead of the csv reader. The column is read from the file into a list, and that list is used to calculate the maximum and minimum values:

    from itertools import islice
    
    # Line that data starts from, zero-indexed.
    START_LINE = 20
    # The column containing the data in question, zero-indexed.
    DATA_COL = 10
    # The value present when a measurement failed.
    FAILED_MEASUREMENT = '-999.9'
    
    with open('data.txt', 'r') as f:
    
        bt_values = []
    
        for val in (row.split()[DATA_COL] for row in islice(f, START_LINE, None)):
    
            if val != FAILED_MEASUREMENT:
                bt_values.append(float(val))
    
        min_bt = min(bt_values)
        max_bt = max(bt_values)
    
    with open('minBt.txt', 'a') as minFile:
        print(min_bt, file=minFile)
    
    with open('maxBt.txt', 'a') as maxFile:
        print(max_bt, file=maxFile)
    

    I have assumed that since you are doing this to multiple files you are looking to accumulate multiple max and min values in the maxBt.txt and minBt.txt files, and hence I've opened them in 'append' mode. If this is not the case, please swap out the 'a' argument for 'w', which will overwrite the file contents each time.

    Edit: Updated to include workaround for failed measurements, as discussed in comments.

    Edit 2: Updated to fix problem with negative numbers, also noted by Derek in separate answer.