Search code examples
pythondatacolumncollection

import text file to process specific columns



I am new to python use. But learn by practice to use in my data processing.

I have a big data file in the format as shown here.
Always unknown number of rows and columns. In this example there are 2 consecutive rows shown.
The 1st column is "time" and nth column is relevant data to be chosen from an indentifier ('abc' in the 1st line).

................
"2013-01-01 00:00:02" 228 227 15.65 15.84 14.85 14.68 14.53 13.75 12.45 12.55
"2013-01-02 00:01:03" 225 227 16.35 15.99 14.85 14.73 14.43 13.8 12.85 13.2
................

Desired output as

  1. Column1 = in terms of time so that time difference can be calculated.
  2. column (n) = data to be processed further, should be in float.

In my past trials, I end up in list, hence unable to convert either of the column.

I tried to search over past questions and answers. But failed to interpret all, as I am a beginner. I expect your quick help to read the data into column format, so as to process later. I believe, further processing can be taken care as it is more mathematical operation.

I thank you for your help indeed.

Regards
Gouri

CORRECTION-1:
I understood pandas gives a compact version to extract the column as I needed earlier. Good learning after suggestion from group.
code looks like as follows:

import pandas as pd
data = pd.read_csv(fp, sep='\t')
entry=[]
entry = data['u90']
print entry, '\n', entry[5]

out_file = open("out.txt", "w")
entry.to_csv(out_file)

Regards
Gouri


Solution

  • As pointed out by Hugo Honorem in comment, you can use pandas.

    If you do not want to introduce more dependencies to your project, you could use a function like this:

    from operator import itemgetter
    
    def load_dataset(fp, columns, types=None, delimiter=' ', skip_header=True):
        get_columns = itemgetter(*columns)
        if skip_header:
            next(fp)
        dataset = []
        for line in fp:
            parts = line.split(delimiter)
            columns = get_columns(parts)
            if types is not None:
                columns = [convertor(col) for convertor, col in zip(types, columns)]
            dataset.append(columns)
        return dataset
    

    columns should be list of integers, types is list of callable objects that convert desired columns into types you want them to be. For floats, just pass in float and for your date, you could pass custom to_date function.