Search code examples
pythonpandaserror-handlingjson-normalizeisbnlib

How to handle returned errors from applying isbnlib.meta with pandas


I'm using isbnlib.meta which pulls metadata (book title, author, year publisher, etc.) when you enter in an isbn. I have a dataframe with 482,000 isbns (column title: isbn13). When I run the function, I'll get an error like NotValidISBNError which stops the code in it's tracks. What I want to happen is if there is an error the code will simply skip that row and move onto the next one.

Here is my code now:

list_df[0]['publisher_isbnlib'] = list_df[0]['isbn13'].apply(lambda x: isbnlib.meta(x).get('Publisher', None))
list_df[0]['yearpublished_isbnlib'] = list_df[0]['isbn13'].apply(lambda x: isbnlib.meta(x).get('Year', None))
#list_df[0]['language_isbnlib'] = list_df[0]['isbn13'].apply(lambda x: isbnlib.meta(x).get('Language', None))
list_df[0]

list_df[0] is the first 20,000 rows since I'm trying to chunk through the dataframe. I've just manually entered in this code 24 times to handle each chunk.

I attempted try: and except: but all that ends up happening is the code stops and I don't get any meta data reported.

Traceback:

---------------------------------------------------------------------------
NotValidISBNError                         Traceback (most recent call last)
<ipython-input-39-a06c45d36355> in <module>
----> 1 df['meta'] = df.isbn.apply(isbnlib.meta)

e:\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
   4198             else:
   4199                 values = self.astype(object)._values
-> 4200                 mapped = lib.map_infer(values, f, convert=convert_dtype)
   4201 
   4202         if len(mapped) and isinstance(mapped[0], Series):

pandas\_libs\lib.pyx in pandas._libs.lib.map_infer()

e:\Anaconda3\lib\site-packages\isbnlib\_ext.py in meta(isbn, service)
     23 def meta(isbn, service='default'):
     24     """Get metadata from Google Books ('goob'), Open Library ('openl'), ..."""
---> 25     return query(isbn, service) if isbn else {}
     26 
     27 

e:\Anaconda3\lib\site-packages\isbnlib\dev\_decorators.py in memoized_func(*args, **kwargs)
     22             return cch[key]
     23         else:
---> 24             value = func(*args, **kwargs)
     25             if value:
     26                 cch[key] = value

e:\Anaconda3\lib\site-packages\isbnlib\_metadata.py in query(isbn, service)
     18     if not ean:
     19         LOGGER.critical('%s is not a valid ISBN', isbn)
---> 20         raise NotValidISBNError(isbn)
     21     isbn = ean
     22     # only import when needed

NotValidISBNError: (abc) is not a valid ISBN

Solution

    • The current implementation for extracting isbn meta data, is incredibly slow and inefficient.
      • As stated, there are 482,000 unique isbn values, for which the data is being downloaded multiple times (e.g. once for each column, as the code is currently written)
    • It will be better to download all the meta data at once, and then extract the data from the dict, as a separate operation.
    • A try-except block is used to capture the error from invalid isbn values.
      • An empty dict, {} is returned, because pd.json_normalize won't work with NaN or None.
      • It will be unnecessary to chunk the isbn column.
    • pd.json_normalize is used to expand the dict returned from .meta.
    • Use pandas.DataFrame.rename to rename columns, and pandas.DataFrame.drop to delete columns.
    • This implementation will be significantly faster than the current implementation, and will make far fewer requests to the API being used to get the meta data.
    • To extract values from lists, such as the 'Authors' column, use df_meta = df_meta.explode('Authors'); if there is more than one author, a new row will be created for each additional author in the list.
    import pandas as pd  # version 1.1.3
    import isbnlib  # version 3.10.3
    
    # sample dataframe
    df = pd.DataFrame({'isbn': ['9780446310789', 'abc', '9781491962299', '9781449355722']})
    
    # function with try-except, for invalid isbn values
    def get_meta(col: pd.Series) -> dict:
        try:
            return isbnlib.meta(col)
        except isbnlib.NotValidISBNError:
            return {}
    
    
    # get the meta data for each isbn or an empty dict
    df['meta'] = df.isbn.apply(get_meta)
    
    # df
                isbn                                                                                                                                                                                                                                                   meta
    0  9780446310789                                                                                   {'ISBN-13': '9780446310789', 'Title': 'To Kill A Mockingbird', 'Authors': ['Harper Lee'], 'Publisher': 'Grand Central Publishing', 'Year': '1988', 'Language': 'en'}
    1            abc                                                                                                                                                                                                                                                     {}
    2  9781491962299  {'ISBN-13': '9781491962299', 'Title': 'Hands-On Machine Learning With Scikit-Learn And TensorFlow - Techniques And Tools To Build Learning Machines', 'Authors': ['Aurélien Géron'], 'Publisher': "O'Reilly Media", 'Year': '2017', 'Language': 'en'}
    3  9781449355722                                                                                                                  {'ISBN-13': '9781449355722', 'Title': 'Learning Python', 'Authors': ['Mark Lutz'], 'Publisher': '', 'Year': '2013', 'Language': 'en'}
    
    # extract all the dicts in the meta column
    df = df.join(pd.json_normalize(df.meta)).drop(columns=['meta'])
    
    # extract values from the lists in the Authors column
    df = df.explode('Authors')
    
    # df
                isbn        ISBN-13                                                                                                         Title         Authors                 Publisher  Year Language
    0  9780446310789  9780446310789                                                                                         To Kill A Mockingbird      Harper Lee  Grand Central Publishing  1988       en
    1            abc            NaN                                                                                                           NaN             NaN                       NaN   NaN      NaN
    2  9781491962299  9781491962299  Hands-On Machine Learning With Scikit-Learn And TensorFlow - Techniques And Tools To Build Learning Machines  Aurélien Géron            OReilly Media   2017       en
    3  9781449355722  9781449355722                                                                                               Learning Python       Mark Lutz                            2013       en