Search code examples
pythonpython-itertools

How can I make an unique list cells?


I have a txt file which looks like below including 4 rows as an example and each row strings are separated by a ,.

"India1,India2,myIndia     "
"Where,Here,Here   "
"Here,Where,India,uyete"
"AFD,TTT"

https://gist.github.com/anonymous/cee79db7029a7d4e46cc4a7e92c59c50

the file can be downloaded from here

I want to extract all unique cells across all , the output2

   India1
   India2
   myIndia
   Where
   Here
   India
   uyete
   AFD 
   TTT

I tried to read line by line and print it ìf i call my data as df`

myfile = open("df.txt")
lines = myfile.readlines()
for line in lines:
   print lines

Solution

  • Option 1: .csv, .txt Files

    Native Python is unable to read .xls files. If you convert your file(s) to .csv or .txt, you can use the csv module within the Standard Library:

    # `csv` module, Standard Library
    import csv
    
    filepath = "./test.csv"
    
    with open(filepath, "r") as f:
        reader = csv.reader(f, delimiter=',')
        header = next(reader)                                  # skip 'A', 'B'
        items = set()
        for line in reader:
            line = [word.replace(" ", "") for word in line if word]
            line = filter(str.strip, line)
            items.update(line)
    
    print(list(items))
    # ['uyete', 'NHYG', 'QHD', 'SGDH', 'AFD', 'DNGS', 'lkd', 'TTT']
    

    Option 2: .xls, .xlsx Files

    If you want to retain the original .xls format, you have to install a third-party module to handle Excel files.

    Install xlrd from the command prompt:

    pip install xlrd
    

    In Python:

    # `xlrd` module, third-party
    import itertools
    import xlrd
    
    filepath = "./test.xls"
    
    with xlrd.open_workbook(filepath) as workbook:
        worksheet = workbook.sheet_by_index(0)                 # assumes first sheet
        rows = (worksheet.row_values(i) for i in range(1, worksheet.nrows))
        cells = itertools.chain.from_iterable(rows)
        items = list({val.replace(" ", "") for val in cells if val})
    
    print(list(items))
    # ['uyete', 'NHYG', 'QHD', 'SGDH', 'AFD', 'DNGS', 'lkd', 'TTT']
    

    Option 3: DataFrames

    You can handle csv and text files with pandas DataFrames. See documentation for other formats.

    import pandas as pd
    import numpy as np
    
    # Using data from gist.github.com/anonymous/a822647a00087abc12de3053c700b9a8
    filepath = "./test2.txt"
    
    # Determines columns from the first line, so add commas in text file, else may throw an error
    df = pd.read_csv(filepath, sep=",", header=None, error_bad_lines=False)
    df = df.replace(r"[^A-Za-z0-9]+", np.nan, regex=True)      # remove special chars    
    stack = df.stack()
    clean_df = pd.Series(stack.unique())
    clean_df
    

    DataFrame Output

    0     India1
    1     India2
    2    myIndia
    3      Where
    4       Here
    5      India
    6      uyete
    7        AFD
    8        TTT
    dtype: object
    

    Save as Files

    # Save as .txt or .csv without index, optional
    
    # target = "./output.csv"
    target = "./output.txt"
    clean_df.to_csv(target, index=False)
    

    Note: Results from options 1 & 2 can be converted to unordered, pandas columnar objects too with pd.Series(list(items)).

    Finally: As a Script

    Save any of the three options above in a function (stack) within a file (named restack.py). Save this script to a directory.

    # restack.py
    import pandas as pd
    import numpy as np
    
    def stack(filepath, save=False, target="./output.txt"):
        # Using data from gist.github.com/anonymous/a822647a00087abc12de3053c700b9a8
    
        # Determines columns from the first line, so add commas in text file, else may throw an error
        df = pd.read_csv(filepath, sep=",", header=None, error_bad_lines=False)
        df = df.replace(r"[^A-Za-z0-9]+", np.nan, regex=True)      # remove special chars    
        stack = df.stack()
        clean_df = pd.Series(stack.unique())
    
        if save:
            clean_df.to_csv(target, index=False)
            print("Your results have been saved to '{}'".format(target))
    
        return clean_df
    
    if __name__ == "__main__":
        # Set up input prompts
        msg1 = "Enter path to input file e.g. ./test.txt: "
        msg2 = "Save results to a file? y/[n]: "
    
        try:
            # Python 2
            fp = raw_input(msg1)
            result = raw_input(msg2)
        except NameError:
            # Python 3
            fp = input(msg1)
            result = input(msg2)
    
        if result.startswith("y"):
            save = True
        else:
            save = False
    
        print(stack(fp, save=save))
    

    From its working directory, run the script via commandline. Answer the prompts:

    > python restack.py 
    
    Enter path to input file e.g. ./test.txt: ./@data/test2.txt
    Save results to a file? y/[n]: y
    Your results have been saved to './output.txt'
    

    Your results should print in you console and optionally save to a file output.txt. Adjust any parameters to suit your interests.