Search code examples
pythonpython-3.xstreaminginputstream

How to read two different files in Python by sys.stdin


I want to read two different files from sys.stdin, I can read and write file but there is no separation from first and second file.

When I run below code on cmd in Windows 10 and Python 3.6:

D:\digit>cat s.csv s2.csv

Result is:

1
2
3
4
5
1
2
3
4
5
6
7

I can print both files.

My Python code is:

import sys 
import numpy as np

train=[]
test=[]

#Assume below code is function 1 which just and must read s.csv
reader = sys.stdin.readlines()
for row in reader:          
    train.append(int(row[0]))
train = np.array(train)

print(train)

#I need some thing here to make separation
#sys.stdin.close()
#sys.stdin = sys.__stdin__ 
#sys.stdout.flush() 

#Assume below code is function 2 which just and must read s2.csv
reader = sys.stdin.readlines()
for row in reader:          
    test.append(int(row[0]))
test = np.array(test)

print(test)

I run below command on cmd prompt:

D:\digit>cat s.csv s2.csv | python pytest.py

Result is:

[1 2 3 4 5 1 2 3 4 5 6 7]
[]

Do I need reset sys.stdin for next file? I used below ones but none of them was answer:

sys.stdin.close()
sys.stdin = sys.__stdin__ 
sys.stdout.flush() 

Solution

  • Let me try to explain.

    d:\digit>cat s.csv s2.csv
    

    has only 1 output, not 2. What it does it 'streams' the content of file1 to stdout and then 'streams' the content of file2 to stdout, wihtout any pause or seperator!!

    so only 1 'stream' of outputs, which then you redirect using the | to your pyton script:

    | pytest.py
    

    So pytest.py will receive 1 'stream' of inputs, it doesn't know any better or more.

    If you want to process the files seperately by pytest.py, you can do the following

    D:\digit>cat s.csv | python pytest.py # process the first file
    D:\digit>cat s2.csv | python pytest.py # process the second file
    

    or on a one liner:

    D:\digit>cat s.csv | python pytest.py && cat s2.csv | python pytest.py
    

    Just remember that the pytest.py is actually running twice. So you need to adapt your python script for this.

    But while you are editing your python script...

    What you should do: If you want both file in your pytest.py, then you should write some code to read both files in your python script. If it is csv structured data, then have a look at the csv module for reading and writing csv files

    [EDIT based on comment:]

    I could read multiple files it by pandas "pd.read_csv" , but my problem is how can I do it by sys.stdin?

    You should really question why you are so focused on using stdin. Reading it from within the python script is likely to be much more effective.

    If you must use stdin then you can deploy various, but external to python, headers, footers, separators. Once you have this defined and able to do so, then you can change the code in python to do various functions depending on what header/footer/separator is received from stdin.

    This all sounds a bit complex and open for error. I would strongly advice you to reconsider the use of stdin as input for your script. Alternatively please update your question with the technical requirements and limitations you are facing which limits you to use stdin.

    [EDIT based on comment:]

    I want to load these files I Hadoop ecosystem and I am using Hadoop streaming for that

    Somehow, you need to "signal" your python script that it is processing a new file, with new information.

    Suppose you have 2 files, the first line need to be some sort of "header" indicating the file, and which function needs to execute on the remainder of the data, until a new "header" is received.

    so lets say that your "train" data is prefixed with the line @is_train@ and your "test" data is prefixed with the line @is_test@

    How you do that in your environment, is not part of the scope of this question

    Now the redirection to stdin will send these two headers before the data. And you can have python to check for those, example:

    import sys 
    import numpy as np
    
    train=[]
    test=[]
    
    is_train = False
    is_test = False
    
    while True:
        line = sys.stdin.readline()
        if '@stop@' in line:
            break
        if '@is_train@' in line:
            is_train = True
            is_test = False
            continue
        if '@is_test@' in line:
            is_train = False
            is_test = True
            continue
        #if this is csv data, you might want to split on ,
        line = line.split(',')
        if is_train:
            train.append(int(line[0]))
        if is_test:
            test.append(int(line[0]))
    
    test = np.array(test)
    train = np.array(train)
    
    print(train)
    print(test)
    

    As you see at the code, you also need a "footer" to determine when the data has come to an end, in this example @stop@ is chosen.

    One way of sending header/footers, could be:

    D:\digit>cat is_train.txt s.csv is_test.txt s2.csv stop.txt | python pytest.py
    

    and the three extra files, just contain the appropriate header or footer