Search code examples
pythonserializationpersistence

Saving an Object (Data persistence)


I've created an object like this:

company1.name = 'banana' 
company1.value = 40

I would like to save this object. How can I do that?


Solution

  • You could use the pickle module in the standard library. Here's an elementary application of it to your example:

    import pickle
    
    class Company(object):
        def __init__(self, name, value):
            self.name = name
            self.value = value
    
    with open('company_data.pkl', 'wb') as outp:
        company1 = Company('banana', 40)
        pickle.dump(company1, outp, pickle.HIGHEST_PROTOCOL)
    
        company2 = Company('spam', 42)
        pickle.dump(company2, outp, pickle.HIGHEST_PROTOCOL)
    
    del company1
    del company2
    
    with open('company_data.pkl', 'rb') as inp:
        company1 = pickle.load(inp)
        print(company1.name)  # -> banana
        print(company1.value)  # -> 40
    
        company2 = pickle.load(inp)
        print(company2.name) # -> spam
        print(company2.value)  # -> 42
    

    You could also define your own simple utility like the following which opens a file and writes a single object to it:

    def save_object(obj, filename):
        with open(filename, 'wb') as outp:  # Overwrites any existing file.
            pickle.dump(obj, outp, pickle.HIGHEST_PROTOCOL)
    
    # sample usage
    save_object(company1, 'company1.pkl')
    

    Update

    Since this is such a popular answer, I'd like touch on a few slightly advanced usage topics.

    cPickle (or _pickle) vs pickle

    It's almost always preferable to actually use the cPickle module rather than pickle because the former is written in C and is much faster. There are some subtle differences between them, but in most situations they're equivalent and the C version will provide greatly superior performance. Switching to it couldn't be easier, just change the import statement to this:

    import cPickle as pickle
    

    In Python 3, cPickle was renamed _pickle, but doing this is no longer necessary since the pickle module now does it automatically—see What difference between pickle and _pickle in python 3?.

    The rundown is you could use something like the following to ensure that your code will always use the C version when it's available in both Python 2 and 3:

    try:
        import cPickle as pickle
    except ModuleNotFoundError:
        import pickle
    

    Data stream formats (protocols)

    pickle can read and write files in several different, Python-specific, formats, called protocols as described in the documentation, "Protocol version 0" is ASCII and therefore "human-readable". Versions > 0 are binary and the highest one available depends on what version of Python is being used. The default also depends on Python version. In Python 2 the default was Protocol version 0, but in Python 3.8.1, it's Protocol version 4. In Python 3.x the module had a pickle.DEFAULT_PROTOCOL added to it, but that doesn't exist in Python 2.

    Fortunately there's shorthand for writing pickle.HIGHEST_PROTOCOL in every call (assuming that's what you want, and you usually do), just use the literal number -1 — similar to referencing the last element of a sequence via a negative index. So, instead of writing:

    pickle.dump(obj, outp, pickle.HIGHEST_PROTOCOL)
    

    You can just write:

    pickle.dump(obj, outp, -1)
    

    Either way, you'd only have specify the protocol once if you created a Pickler object for use in multiple pickle operations:

    pickler = pickle.Pickler(outp, -1)
    pickler.dump(obj1)
    pickler.dump(obj2)
       etc...
    

    Note: If you're in an environment running different versions of Python, then you'll probably want to explicitly use (i.e. hardcode) a specific protocol number that all of them can read (later versions can generally read files produced by earlier ones).

    Multiple Objects

    While a pickle file can contain any number of pickled objects, as shown in the above samples, when there's an unknown number of them, it's often easier to store them all in some sort of variably-sized container, like a list, tuple, or dict and write them all to the file in a single call:

    tech_companies = [
        Company('Apple', 114.18), Company('Google', 908.60), Company('Microsoft', 69.18)
    ]
    save_object(tech_companies, 'tech_companies.pkl')
    

    and restore the list and everything in it later with:

    with open('tech_companies.pkl', 'rb') as inp:
        tech_companies = pickle.load(inp)
    

    The major advantage is you don't need to know how many object instances are saved in order to load them back later (although doing so without that information is possible, it requires some slightly specialized code). See the answers to the related question Saving and loading multiple objects in pickle file? for details on different ways to do this. Personally I liked @Lutz Prechelt's answer the best, so that's the approach used in the sample code below:

    class Company:
        def __init__(self, name, value):
            self.name = name
            self.value = value
    
    def pickle_loader(filename):
        """ Deserialize a file of pickled objects. """
        with open(filename, "rb") as f:
            while True:
                try:
                    yield pickle.load(f)
                except EOFError:
                    break
    
    print('Companies in pickle file:')
    for company in pickle_loader('company_data.pkl'):
        print('  name: {}, value: {}'.format(company.name, company.value))