python pandas dataframe dictionary feature-extraction

Find optimal combination of values of a dictionary of lists (perhaps with pandas)

The following problem is more an algorithm one than a code challenge.

Imagine I have a data structure as follows:

cities = {'price'   : ['malaga','berlin'],
          'food'    : ['milano','barcelona'],
          'shopping': ['milano','barcelona'],
          'weather' : ['barcelona','paris','lisabon','milano'],
          'museums' : ['malaga','berlin','lisabon'],
          'cafes'   : ['paris','roma','lisabon'],
          'kids'    : ['milano','barcelona','paris','roma']}

There are a bunch of characteristics that can be found in different cities. What is the minimal amount of cities that covers all those characteristics. I.e. the minimal amount of cities I have to visit in order to get all the benefits.

So far I started using Counter

totals=[]
for key in cities.keys():
    totals.append(cities[key])
totals_together = [city for cities in totals for city in cities]
totals_together
myCounter = Counter(totals_together)
print(myCounter.most_common())

RESULT so far:

[('milano', 4), ('barcelona', 4), ('paris', 3), ('lisabon', 3), ('malaga', 2), ('berlin', 2), ('roma', 2)]

myCounter gives me an idea of the best cities, but by far not yet the optimal combination of cities. from here I could get the first city, get the characteristics, and go on adding characteristics till all are there. Very tedious.

There should be a better way.

I am even thinking about pandas, but can not see what pandas would bring for this problem. This looks to me a pretty common problem.

NOTE: I am not even looking for the code as such, just ideas about how to embrace the problem are more than welcome.

NOTE2: Be aware that there might be one or more cities with all the characteristics, but there might be the case (normally) where there is no single city with all the characteristics.

SO THE RESULT I AM LOOKING FOR IS: ['milano','lisabon'] assuming that this combination covers all the characteristics.

Solution

One way to go ahead is to create all combinations (using itertools), and then running through them and counting the activities that these combinations give you. You can stop once you have found a combination that gives you all activities.

Using pandas gives you a simple way to calculate the number of possible activities in each city. I am sure you could also do without.

import pandas as pd
import itertools

travel = {'price':['malaga','berlin'],
          'food':['milano','barcelona'],
          'shopping':['milano','barcelona'],
          'weather':['barcelona','paris','lisabon','milano'],
          'museums':['malaga','berlin','lisabon'],
          'cafes':['paris','roma','lisabon'],
         'kids':['milano','barcelona','paris','roma']}

# very ugly way to convert the travel into a data frame
# first we create a list of all cities
c = []
for activity in travel.keys():
    for city in travel[activity]:
        c.append(city)
c = set(c)    
a = list(travel.keys())
df = pd.DataFrame(index=pd.Index(c, name='city'), 
                  columns=pd.Index(a, name='activity'))

# then we set all city/activity crosspoints to True
for activity in travel.keys():
    for city in travel[activity]:
        df.loc[city, activity] = True
# and fill the rest with False
df = df.fillna(False)

# how many activities do we want to do?
all_activities = len(df.columns)

# let's store the results in a dictionary

results = {}
for combo_len in range(1, len(df.index)):
    combos = list(itertools.combinations(df.index, combo_len))
    for c in combos:
        # print(f"Combo: {c}")
        activity_count = df.query(f"city in {c}").any().sum()
        results[c] = activity_count
        if activity_count == all_activities:
            print(f"{c}: {max_activities}")
            break
    else:
        continue
    break

The code will either stop when all combinations have been tried, or when it finds a combination which includes all activities.

The first possible combination it comes up with is:

('barcelona', 'paris', 'berlin'): 7