Search code examples
pythonpandasdataframestdin

pandas read_csv add attributes by stdin issue


I want to add a new column in the dataframe. The new column is depend on some rules.

This is my code:

#!/usr/bin/python3.6
# coding=utf-8

import sys
import pandas as pd
import numpy as np
import io
import csv


df = pd.read_csv(sys.stdin,sep=',',encoding='utf-8',engine="python")

col_0 = check
df['df_cal'] = df.groupby(col_0)[col_0].transform('count') 
df['status'] = np.where(
                    df['df_cal'] > 1,'change',
                    'New')

df = df.drop_duplicates(
        subset=df.columns.difference(['keep']),keep = False)
df = df[(df.keep == '2')]
df.drop(['keep','df_cal'],axis = 1,inplace = True)

# print(sys.stdin)
df.to_csv(sys.stdout,encoding='utf-8',index = None)

sample csv:

VIP_number,keep
ab1,1
ab1,2
ab2,2
ab3,1

when I try to run this code, I write the command like this:

python3.6 nifi_python.py < test.csv check = VIP_number

and I get the error:

name 'check' is not defined

This is still not work because I don't know how can I input the column name to col_0 by stdin. col_0 should be 'VIP_number'. I don't want to hardcode the column name because the script will use in next time but the columns are different.

How can I add a new column in the dataframe by stdin? Any help would be very much appreciated.


Solution

  • #!/usr/bin/python3.6
    # coding=utf-8
    
    import sys
    import pandas as pd
    import numpy as np
    import io
    import csv
    
    if len(sys.argv) < 2:
        print( "Usage:  nifi_python.py check=<column>"
        sys.exit(1)
    
    df = pd.read_csv(sys.stdin,sep=',',encoding='utf-8',engine="python")
    
    col_0 = sys.argv[1].split('=')[1]
    
    ...
    python nifi_python.py check=VIP_number < test.csv