Search code examples
pythonstreamlit

f string to pass file path issue


I have a function which accepts a file path. It's as below:

def document_loader(doc_path: str) -> Optional[Document]:
        """ This function takes in a document in a particular format and 
        converts it into a Langchain Document Object 
        
        Args:
            doc_path (str): A string representing the path to the PDF document.

        Returns:
            Optional[DocumentLoader]: An instance of the DocumentLoader class or None if the file is not found.
        """
        
        # try:
        loader = PyPDFLoader(doc_path)
        docs = loader.load()
        print("Document loader done")

PyPDfLoader is a wrapper around PyPDF2 to read in a pdf file path

Now,when I call the function with hardcoding the file path string as below:

document_loader('/Users/Documents/hack/data/abc.pdf')

The function works fine and is able to read the pdf file path.

But now if I want a user to upload their pdf file via Streamlit file_uploader() as below:

uploaded_file = st.sidebar.file_uploader("Upload a file", key= "uploaded_file")
print(st.session_state.uploaded_file)

if uploaded_file is not None:
    filename = st.session_state.uploaded_file.name
    print(os.path.abspath(st.session_state.uploaded_file.name))
    document_loader(f'"{os.path.abspath(filename)}"')

I get the error:

ValueError: File path "/Users/Documents/hack/data/abc.pdf" is not a valid file or url

This statement print(os.path.abspath(st.session_state.uploaded_file.name)) prints out the same path as the hardcoded one.

Note: Streamlit is currently on localhost on my laptop and I am the "user" who is trying to upload a pdf via locally runnin streamlit app.

Edit1:

So as per @MAtchCatAnd I added tempfile and it WORKS. But with an issue:

My function where tempfile_path is passed, it is re-running everytime there is any interaction by a user. This is because tempfile path is changing automatically thereby making the function re-run even if I had decorated it with @st.cache_data.

The pdf file uploaded remains the same, so I don't want the same function to re run as it consumes some cost everytime it is run.

How to fix this as I see Streamlit has deprecated allow_mutation=True parameter in st.cache.

Here's the code:

@st.cache_data
def document_loader(doc_path: str) -> Optional[Document]:
        """ This function takes in a document in a particular format and 
        converts it into a Langchain Document Object 

        Args:
            doc_path (str): A string representing the path to the PDF document.

        Returns:
            Optional[DocumentLoader]: An instance of the DocumentLoader class or None if the file is not found.
        """

        # try:
        loader = PyPDFLoader(doc_path)
        docs = loader.load()
        print("Document loader done")

uploaded_file = st.sidebar.file_uploader("Upload a file", key= "uploaded_file")

if uploaded_file is not None:
    with tempfile.NamedTemporaryFile(delete=False) as temp_file:
            temp_file.write(uploaded_file.getvalue())
            temp_file_path = temp_file.name
            print(temp_file_path)

    custom_qa = document_loader(temp_file_path)

Solution

  • The object returned by st.file_uploader is a "file-like" object inheriting from BytesIO.

    From the docs:

    The UploadedFile class is a subclass of BytesIO, and therefore it is "file-like". This means you can pass them anywhere where a file is expected.

    While the returned object does have a name attribute, it has no path. It exists in memory and is not associated to a real, saved file. Though Streamlit may be run locally, it does in actuality have a server-client structure where the Python backend is usually on a different computer than the user's computer. As such, the file_uploader widget is not designed to provide any real access or pointer to the user's file system.

    You should either

    1. use a method that allows you to pass a file buffer instead of a path,
    2. save the file to a new, known path,
    3. use tempfiles

    A brief example working with temp files and another question about them that may be helpful.

    import streamlit as st
    import tempfile
    import pandas as pd
    
    file = st.file_uploader('Upload a file', type='csv')
    tempdir = tempfile.gettempdir()
    
    if file is not None:
        with tempfile.NamedTemporaryFile(delete=False) as tf:
            tf.write(file.read())
            tf_path = tf.name
        st.write(tf_path)
        df = pd.read_csv(tf_path)
        st.write(df)
    

    Response to Edit 1

    I would remove the caching and instead rely on st.session_state to store your results.

    Create a spot in session state for the object you want at the beginning of your script

    if 'qa' not in st.session_state:
        st.session_state.qa = None
    

    Have your function return the object you want

    def document_loader(doc_path: str) -> Optional[Document]:
        loader = PyPDFLoader(doc_path)
        return loader # or return loader.load(), whichever is more suitable
    

    Check for results in session state before running the document loader

    uploaded_file = st.sidebar.file_uploader("Upload a file", key= "uploaded_file")
    
    if uploaded_file is not None and st.session_state.qa is None:
        with tempfile.NamedTemporaryFile(delete=False) as temp_file:
            temp_file.write(uploaded_file.getvalue())
            temp_file_path = temp_file.name
            print(temp_file_path)
    
        st.session_state.qa = document_loader(temp_file_path)
    
    custom_qa = st.session_state.qa
    
    # put a check on custom_qa before continuing, either "is None" with  
    # stop or "is not None" with the rest of your code nested inside
    if custom_qa is None:
        st.stop()
    

    Add in a way to reset, by adding on_change=clear_qa to the file uploader

    def clear_qa():
        st.session_state.qa = None