Search code examples
pythonmemorystreamlitlangchain

How to release memory correctly in streamlit app?


Recently, while developing a Streamlit app, the app frequently crashes and requires manual rebooting.

After spending some time, I identified the issue as "exceeding RAM". The free version of RAM is only 1GB, and my app easily surpasses this limit when multiple users are using it simultaneously.

Application of the App

  1. Using langchain to build a document GPT.
  2. Users upload PDFs and start asking questions.

Main Problematic Code

Complete code from Github

app.py

model = None

doc_container = st.container()
with doc_container:
    # when user upload pdf base on upload_and_process_pdf()
    # the create_doc_gpt() can execute successfully
    docs = upload_and_process_pdf()
    model = create_doc_gpt(docs)
    del docs
    st.write('---')
def create_doc_gpt(docs):
    if not docs:
        return

    ... instance docGPT which will use HuggingFaceEmbedding

What I've Tried

I attempted to identify where the issue in the code lies and whether optimization is possible. I conducted the following experiments:

  1. Used Windows Task Manager's detailed view.

  2. Executed the app (streamlit run app.py) and simultaneously identified its PID, observing memory usage.

  3. When opening the app, memory usage occupied 150,000 KB.

  4. Based on the simplified code above, after uploading a PDF, the docGPT instance (my model) is instantiated. At this point, memory rapidly spikes to 1,000,000 KB. I suspect this is due to HuggingFaceEmbedding causing this. (When I switched to a lighter embedding, memory decreased significantly)

  5. Since memory's main source is the model instance, but when I re-upload the same PDF, memory increases again to 1,750,000 KB. This seems like two models are occupying memory.

  6. Additionally, I have attempted to repeatedly upload the same PDF on my app. After uploading the 8000KB file approximately 4 times, the app crashes.

Question

How should I correctly release the initially instantiated model?

If I use st.cache_resource to decorate create_doc_gpt(docs), I have a few points of confusion as follows:

  1. When the same user uploads the first PDF, the embedding is performed, and the model is returned. At this point, does the app create a cache and occupy memory? If the user uploads a new PDF again, will the app go through embedding and returning the model, creating a new cache and occupying memory again?

  2. If the assumption in #1 is correct, can I use the ttl and max_entries parameters to avoid excessive caching?

  3. If the assumptions in #1 and #2 are correct, when there are two users simultaneously, and my max_entries is set to 2, will the cached models they create be counted separately?


I'm unsure if this type of question is appropriate to ask here. If it's against the rules, I'm willing to delete the post and seek help elsewhere.


Solution

  • I would recommend you to use session states in your application as much as possible. In combination with that you should use @st.cache_data and @st.cache_resource for example in your function:

    @st.cache_resource
    def create_doc_gpt(docs):
        if not docs:
            return
    

    You could then also delete the session state and refresh the site like this

    for key in st.session_state.keys():
                    del st.session_state[key]
    

    You can read more about it here