I have a function which accepts a file path. It's as below:
def document_loader(doc_path: str) -> Optional[Document]:
""" This function takes in a document in a particular format and
converts it into a Langchain Document Object
Args:
doc_path (str): A string representing the path to the PDF document.
Returns:
Optional[DocumentLoader]: An instance of the DocumentLoader class or None if the file is not found.
"""
# try:
loader = PyPDFLoader(doc_path)
docs = loader.load()
print("Document loader done")
PyPDfLoader is a wrapper around PyPDF2 to read in a pdf file path
Now,when I call the function with hardcoding the file path string as below:
document_loader('/Users/Documents/hack/data/abc.pdf')
The function works fine and is able to read the pdf file path.
But now if I want a user to upload their pdf file via Streamlit file_uploader() as below:
uploaded_file = st.sidebar.file_uploader("Upload a file", key= "uploaded_file")
print(st.session_state.uploaded_file)
if uploaded_file is not None:
filename = st.session_state.uploaded_file.name
print(os.path.abspath(st.session_state.uploaded_file.name))
document_loader(f'"{os.path.abspath(filename)}"')
I get the error:
ValueError: File path "/Users/Documents/hack/data/abc.pdf" is not a valid file or url
This statement print(os.path.abspath(st.session_state.uploaded_file.name))
prints out the same path as the hardcoded one.
Note: Streamlit is currently on localhost on my laptop and I am the "user" who is trying to upload a pdf via locally runnin streamlit app.
Edit1:
So as per @MAtchCatAnd I added tempfile and it WORKS. But with an issue:
My function where tempfile_path is passed, it is re-running everytime there is any interaction by a user. This is because tempfile path is changing automatically thereby making the function re-run even if I had decorated it with @st.cache_data.
The pdf file uploaded remains the same, so I don't want the same function to re run as it consumes some cost everytime it is run.
How to fix this as I see Streamlit has deprecated allow_mutation=True parameter in st.cache.
Here's the code:
@st.cache_data
def document_loader(doc_path: str) -> Optional[Document]:
""" This function takes in a document in a particular format and
converts it into a Langchain Document Object
Args:
doc_path (str): A string representing the path to the PDF document.
Returns:
Optional[DocumentLoader]: An instance of the DocumentLoader class or None if the file is not found.
"""
# try:
loader = PyPDFLoader(doc_path)
docs = loader.load()
print("Document loader done")
uploaded_file = st.sidebar.file_uploader("Upload a file", key= "uploaded_file")
if uploaded_file is not None:
with tempfile.NamedTemporaryFile(delete=False) as temp_file:
temp_file.write(uploaded_file.getvalue())
temp_file_path = temp_file.name
print(temp_file_path)
custom_qa = document_loader(temp_file_path)
The object returned by st.file_uploader
is a "file-like" object inheriting from BytesIO.
From the docs:
The UploadedFile class is a subclass of BytesIO, and therefore it is "file-like". This means you can pass them anywhere where a file is expected.
While the returned object does have a name
attribute, it has no path. It exists in memory and is not associated to a real, saved file. Though Streamlit may be run locally, it does in actuality have a server-client structure where the Python backend is usually on a different computer than the user's computer. As such, the file_uploader
widget is not designed to provide any real access or pointer to the user's file system.
You should either
A brief example working with temp files and another question about them that may be helpful.
import streamlit as st
import tempfile
import pandas as pd
file = st.file_uploader('Upload a file', type='csv')
tempdir = tempfile.gettempdir()
if file is not None:
with tempfile.NamedTemporaryFile(delete=False) as tf:
tf.write(file.read())
tf_path = tf.name
st.write(tf_path)
df = pd.read_csv(tf_path)
st.write(df)
I would remove the caching and instead rely on st.session_state
to store your results.
if 'qa' not in st.session_state:
st.session_state.qa = None
def document_loader(doc_path: str) -> Optional[Document]:
loader = PyPDFLoader(doc_path)
return loader # or return loader.load(), whichever is more suitable
uploaded_file = st.sidebar.file_uploader("Upload a file", key= "uploaded_file")
if uploaded_file is not None and st.session_state.qa is None:
with tempfile.NamedTemporaryFile(delete=False) as temp_file:
temp_file.write(uploaded_file.getvalue())
temp_file_path = temp_file.name
print(temp_file_path)
st.session_state.qa = document_loader(temp_file_path)
custom_qa = st.session_state.qa
# put a check on custom_qa before continuing, either "is None" with
# stop or "is not None" with the rest of your code nested inside
if custom_qa is None:
st.stop()
on_change=clear_qa
to the file uploaderdef clear_qa():
st.session_state.qa = None