Search code examples
pythonpandascsv

How to group data using pandas by an array column


I have a data frame collected from a CSV in the following format:

Book Name,Languages
"Book 1","['Portuguese','English']"
"Book 2","['English','Japanese']"
"Book 3","[Spanish','Italian','English']"
...

I was able to convert the string array representation on the column Languages to a python array using transform, but now i'm struggling to find a way to group Books by language.

I would like to produce from this data set a dict like this:

{
  'Portuguese': 'Book 1'
  'English': ['Book 1', 'Book 2', 'Book 3'],
  'Spanish': 'Book 3',
  'Italian': 'Book 3',
  'Japanese': 'Book 2'
}

I tried to look into groupby on the array column but could not figure out how to make each entry on the array a key to be used as grouping.

Any pointers would be really apreciated.


Solution

  • You can do this by iterating through the DataFrame and updating a dictionary dynamically.

    import pandas as pd
    import ast
    
    data = {
        "Book Name": ["Book 1", "Book 2", "Book 3"],
        "Languages": ["['Portuguese','English']", "['English','Japanese']", "['Spanish','Italian','English']"]
    }
    df = pd.DataFrame(data)
    df["Languages"] = df["Languages"].apply(ast.literal_eval)
    language_dict = {}
    for _, row in df.iterrows():
        book_name = row["Book Name"]
        for lang in row["Languages"]:
            if lang in language_dict:
                if isinstance(language_dict[lang], list):
                    language_dict[lang].append(book_name)
                else:
                    language_dict[lang] = [language_dict[lang], book_name]
            else:
                language_dict[lang] = book_name
    
    print(language_dict)
    

    Output will be

    {
      'Portuguese': 'Book 1',
      'English': ['Book 1', 'Book 2', 'Book 3'],
      'Japanese': 'Book 2',
      'Spanish': 'Book 3',
      'Italian': 'Book 3'
    }