Search code examples
pythonfilemergeduplication

Merge Columns and Remove Duplication


I have a input file that has data in 2 columns. I need to merge both the columns and remove the duplication. Any suggestions how to start with ? Thanks !

Input file

5045 2317
5045 1670
5045 2156
5045 1509
5045 3833
5045 1013
5045 3491
5045 32
5045 1482
5045 2495
5045 4280
5045 1380
5045 3998

Expected output

 5045 
 2317
 1670
 2156
 1509
 3833
 1013
 3491
 32
 1482
 2495
 4280
 1380
 3998

Solution

  • To keep the order:

    from itertools import chain
    with open("in.txt") as f:
        lines = list(chain.from_iterable(x.split() for x in f))
        with open("in.txt","w") as f1:
            for ind, line in enumerate(lines,1):
                if not line in lines[:ind-1]:
                    f1.write(line+"\n")
    

    output:

    5045
    2317
    1670
    2156
    1509
    3833
    1013
    3491
    32
    1482
    2495
    4280
    1380
    3998
    

    If order does not matter:

    from itertools import chain
    with open("in.txt") as f:
        lines = set(chain.from_iterable(x.split() for x in f))
        with open("in.txt","w") as f1:
            f1.writelines("\n".join(lines))
    

    If there is only one number repeated in the first column:

    with open("in.txt") as f:
        col_1 = f.next().split()[0] # get first column number
        lines = set(x.split()[1] for x in f) # get all second column nums
        lines.add(col_1) # add first column num
        with open("in.txt","w") as f1:
            f1.writelines("\n".join(lines))