Search code examples
pythonpandascsvtext

Formatting txt files using pandas/python


I have a txt file from a lab equipment which saves data in the following format:

Run1 
Selected data
        Time (s)    Charge Q (nC)   Charge density q (nC/g) Mass (g)    
Initial -   21.53   -2.81E-01   -1.41E-03   200.0   
Flow    -   0.00    0.00E+00    0.00E+00    0.0 
Charge (in Coulomb) temporal evolution
3.61    2.44e-11
4.11    2.44e-11
4.61    2.44e-11
5.11    3.66e-11
5.63    3.66e-11
6.14    2.44e-11
6.66    3.66e-11
7.14    3.66e-11
7.67    2.44e-11
8.19    3.66e-11
8.70    2.44e-11
9.20    2.44e-11
9.72    2.44e-11
10.23   2.44e-11
10.73   2.44e-11

Run2 
Selected data
        Time (s)    Charge Q (nC)   Charge density q (nC/g) Mass (g)    
Initial -   21.53   -2.81E-01   -1.41E-03   200.0   
Flow    -   0.00    0.00E+00    0.00E+00    0.0 
Charge (in Coulomb) temporal evolution
3.61    2.44e-11
4.11    2.44e-11
4.61    2.44e-11
5.11    3.66e-11
5.63    3.66e-11
6.14    2.44e-11
6.66    3.66e-11
7.14    3.66e-11
7.67    2.44e-11
8.19    3.66e-11

Run3 
Selected data
        Time (s)    Charge Q (nC)   Charge density q (nC/g) Mass (g)    
Initial -   21.53   -2.81E-01   -1.41E-03   200.0   
Flow    -   0.00    0.00E+00    0.00E+00    0.0 
Charge (in Coulomb) temporal evolution
3.61    2.44e-11
4.11    2.44e-11
4.61    2.44e-11
5.11    3.66e-11
5.63    3.66e-11
6.14    2.44e-11
6.66    3.66e-11
7.14    3.66e-11
7.67    2.44e-11
8.19    3.66e-11
8.70    2.44e-11
9.20    2.44e-11

And i have multiples of these in my test folder. I was looking to simplify and automate the analysis i do on these data sets because for another equipment i had similar success with a simpler code.

What i want to do is extract the 2 column test data for each of the 3 runs from each file with FileName, and export into a commas separated text file with filename = FileName-Run#.txt

What i have done till now is attempt to convert the text file contents into a list of lists and then try and process the numerical data alone into a new csv, but that hasn't worked well since i am unable to detect the length of column data of interest to me.

A couple of other Q-As here helped in that regard including how to run the code on files within a folder, if it works, that is.

I have used a jupyter notebook - i can share the code i have wrote up here if it would be useful, although i am ashamed to show it.


Solution

  • Try this:

    import re
    from pathlib import Path
    
    input_path = Path("path/to/input_folder")
    output_path = Path("path/to/output_folder")
    run_name_pattern = re.compile("Run\d+")
    data_line_pattern = re.compile("(.+?) +(.+?)")
    
    
    def write_output(input_file: Path, run_name: str, data: str):
        output_file = output_path / f"{input_file.stem}-{run_name}.csv"
        with output_file.open("w") as fp_out:
            fp_out.write(data)
    
    
    for input_file in input_path.glob("*.txt"):
        with input_file.open() as fp:
            run_name, data, start_reading = "", "", False
    
            for line in fp:
                # If a line matches "Run...", start a new run name
                if run_name_pattern.match(line):
                    run_name = line.strip()
                # If the line matches "Charge (in Coulomb)...",
                # read in the data, starting with the next line
                elif line.startswith("Charge (in Coulomb) temporal evolution"):
                    start_reading = True
                # For the data lines, replace spaces in the middle with a comma
                elif start_reading and line != "\n":
                    data += data_line_pattern.sub(r"\1,\2", line)
                # If we encounter a blank line, that means the end of data.
                # Flush the data to disk.
                elif line == "\n":
                    write_output(input_file, run_name, data)
                    run_name, data, start_reading = "", "", False
            else:
                # If we have reached the end of the file but there still
                # data we haven't written to disk, flush it
                if data:
                    write_output(input_file, run_name, data)