Search code examples
pythonawkmerging-data

How to merge specific lines from multiple text files


I have four files each containing 153 data points. Each data point cosists of 3 lines, ie.

File 1:

datapoint_1_name
datapoint_1_info
datapoint_1_data_file1
datapoint_2_name
datapoint_2_info
datapoint_2_data_file1
datapoint_3_name
datapoint_3_info
datapoint_3_data_file1

File 2:

datapoint_1_name
datapoint_1_info
datapoint_1_data_file2
datapoint_2_name
datapoint_2_info
datapoint_2_data_file2
datapoint_3_name
datapoint_3_info
datapoint_3_data_file2

File 3:

datapoint_1_name
datapoint_1_info
datapoint_1_data_file3
datapoint_2_name
datapoint_2_info
datapoint_2_data_file3
datapoint_3_name
datapoint_3_info
datapoint_3_data_file3

File 4:

datapoint_1_name
datapoint_1_info
datapoint_1_data_file4
datapoint_2_name
datapoint_2_info
datapoint_2_data_file4
datapoint_3_name
datapoint_3_info
datapoint_3_data_file4

and so on.

The data in all files is the same except for the third line of each. I am trying to merge these files in such a way that the output contains the datapoint_name, datapoint_info, from just the first file, and then the third line (datapoint_data) from all remaining files, like so:

Output:

datapoint_1_name
datapoint_1_info
datapoint_1_data_file1
datapoint_1_data_file2
datapoint_1_data_file3
datapoint_1_data_file4
datapoint_2_name
datapoint_2_info
datapoint_2_data_file1
datapoint_2_data_file2
datapoint_2_data_file3
datapoint_2_data_file4
datapoint_3_name
datapoint_3_info
datapoint_3_data_file1
datapoint_3_data_file2
datapoint_3_data_file3
datapoint_3_data_file4

I've tried with the below script in Python (I've replaced the pattern matching with 'some pattern' in these lines; the patterns are matching the lines correctly and I've verified that)

output_file = "combined_sequences_and_data2.txt"

with open(output_file, 'w') as output:
    combined_data = []

    with open('file1', 'r') as file:
        for line in file:
            line = line.strip()
            if line.startswith('some pattern'):
                combined_data.append(line)
            elif line.isalpha():
                combined_data.append(line)
            elif line.startswith('some pattern'):
                combined_data.append(line)
                with open('file2', 'r') as file:
                    for line in file:
                        line = line.strip()
                        if line.startswith('some pattern'):
                            combined_data.append(line)
                            with open('file3', 'r') as file:
                                for line in file:
                                    line = line.strip()
                                    if line.startswith('some pattern'):
                                        combined_data.append(line)
                                        with open('file4', 'r') as file:
                                            for line in file:
                                                line = line.strip()
                                                if line.startswith('some pattern'):
                                                    combined_data.append(line)



        # Write the combined data to the output file
        output.write('\n'.join(combined_data) + '\n')

This doesn't run at all just freezes and I can't understand where.

I also tried awk:

`#!/bin/bash

file1="filename"
file2="filename"
file3="filename"
file4="filename"

group_size=3
line_count=1

while read -r line; do
  if [ $line_count -le $group_size ]; then
    group_lines[$line_count]=$line
    line_count=$((line_count + 1))
  fi

  if [ $line_count -gt $group_size ]; then
    for i in "${group_lines[@]}"; do
      echo "$i"
    done

    awk 'NR == 3' "$file2"
    awk 'NR == 3' "$file3"
    awk 'NR == 3' "$file4"

    line_count=1
    unset group_lines
  fi
done < "$file1"`

This one is closer to working but doesn't loop over the 3rd lines for the remaining 3 files - just prints the same line over and over for each datapoint 1 in file 1


Solution

  • You don't need to examine the file contents as you know that the values you're interested in are in groups of 3. Therefore:

    INFILES = "file1", "file2", "file3", "file4"
    OUTFILE = "combined_sequences_and_data2.txt"
    
    mfd, *ofds = (open(file) for file in INFILES)
    
    with open(OUTFILE, "w") as output:
        for i, line in enumerate(mfd, 1):
            output.write(line)
            if i % 3:
                for fd in ofds:
                    next(fd)
            else:
                for fd in ofds:
                    output.write(next(fd))