Search code examples
pythonpandasdata-manipulationfile-manipulation

Merge different text files into one Excel file, and delete first column of all but first


I need to merge different text files (.sav files) into one Excel file (output.xls), but starting from the second input file onwards, I want to exclude the first column present in each file.

What I want to obtain is the following (with placeholder data to easily see the rows and columns).

File 1

A1 A2 A3 A4
B1 B2 B3 B4
C1 C2 C3 C4
and so on

File 2

X1 X2 X3 X4
Y1 Y2 Y3 Y4
Z1 Z2 Z3 Z4
and so on

File output

A1 A2 A3 A4 X2 X3 X4
B1 B2 B3 B4 Y2 Y3 Y4
C1 C2 C3 C4 Z2 Z3 Z4
and so on

Here is my code.

import glob
    
filenames = glob.glob("*.sav")
filenames.sort()
with open('output.txt', 'w') as writer:
    readers = [open(filename) for filename in filenames]
    for lines in zip(*readers):
        print(' '.join([line.strip() for line in lines]), file=writer)
    
    
import pandas as pd
df = pd.read_table('output.txt')
df.to_excel('output.csv', 'DATI', index=False, header=False)
    
import os
os.remove('output.txt')

However, this keeps all the columns. How do I omit the ones I don't need?


Solution

  • It should be as simple as the following.

    from pathlib import Path
    import pandas as pd
    import numpy as np
    
    
    fnames = Path("your-path").glob("*.sav")
    first_df, *dfs = [pd.read_csv(f, sep="\t") for f in fnames] 
    dfs = [df.iloc[:, 1:] for df in dfs]  # Drop first column
    
    df = pd.concat([first_df, *dfs], axis=1)
    df.to_excel("output.xlsx", index=False, header=False)