Search code examples
pythonpandasjupyter-notebookio

Is the file read in import module saved in memory in Jupyter Notebook?


There are a bunch of datasets that I have to import/preprocess many times.

What I'm doing is putting all of pd.read_csv() inside a single my_datasets.py file like this:

# my_datasets.py

import pandas as pd

dataset1 = pd.read_csv('file1.csv')
dataset2 = pd.read_csv('file2.csv')
dataset3 = pd.read_csv('file3.csv')

and then I simply import this module from Jupyter Notebook whenever I need some data.

When I do this, on EDA.ipynb, am I storing dataset1, 2, 3 in RAM memory so that I don't make file IO every time I call my_datasets.dataset1?

Is there any other inefficiencies that you'd like to address?


Solution

  • TL;DR:

    Did you try %run my_datasets.py for your intended use yet? And not import.



    The details:

    You most likely don't want to use import if you are doing this to 'import/preprocess many times', as you state. import uses special handling so that it doesn't waste time already reimporting if you already did and so any subsequent imports of the same named code are ignored in the same active session. Thus, if you update file2.csv while in your active notebook and then run the import statement again that imports my_datasets.py, you aren't going to get the updated dataset2 that you are probably expecting.

    If you want to run that code my_datasets.py in the same kernel as your notebook interactively so that it can use what you've defined in your notebook and your notebook can use it, you can do this inside a cell in your notebook:

    %run -i my_datasets.py
    

    See here about the %run magic.
    Note the use of the -i flag described there.

    If you didn't want that script to have access to anything you ran in the notebook namespace prior; however, want to have the code you run define variables (objects) you can access in the current notebook, you can simply use:

    %run my_datasets.py
    

    For your use case here, that may be sufficient.


    And if you did need fine-grained control of what is presently in memory and only load certain ones, you'd do it in the notebook or script you run from in the notebook, as juanpa.arrivillaga suggests. You wouldn't put all 100 calls to read in the my_datasets.py and then expect to have the ability to selectively skip some. First you can clear out the objects in your notebook code and then run the reading of the big ones to keep your memory low. You can just put the small ones in the my_datasets.py to do this safely.


    If you did want to use import you can take advantage of the the tricks to reload modules in Jupyter, as if you are developing.