python dictionary dictionary-comprehension

Set value as key and a list of values as value in Python

I have a big dictionary (250k+ keys) like this:

dict = {
        0: [apple, green],
        1: [banana, yellow],
        2: [apple, red],
        3: [apple, brown],
        4: [kiwi, green],
        5: [kiwi, brown],             
        ...
}

Goal to achieve:

1. I want a new dictionary with the first value of the list as key, and a list of values for the same key. Something like this:

new_dict = {
               apple: [green, red, brown]
               banana: [yellow]
               kiwi: [green, brown],
               ... 
    }

2. After that I want to count the number of values for each key (e.g. {apple:3, banana:1, kiwi,2} ), and this could be easily achieved with a Counter, so it shouldn't be a problem. Then, I want to select only the keys that have a certain number of values (for example, if I want to mantain only keys associated to 2 or more values, the final_dict will be this:

final_dict = {
              apple:3,
              kiwi:2,
              ....
}

3. Then I want to return the original keys from dict of the elements that have at least 2 values, so at the end I will have:

original_keys_with_at_least_2_values = [0, 2, 3, 4, 5]

My code

# Create new_dict like: new_dict = {apple:None, banana:None, kiwi:None,..}  

new_dict = {k: None for k in dict.values()[0]}  
for k in new_dict.keys():
    for i in dict.values()[0]:
        if i == k:
        new_dict[k] = dict[i][1]

I'm stuck using nested for cicles, even if I know Python comprehension is faster, but I really don't know how to solve it. Any solution or idea would be appreciated.

Solution

You can use a defaultdict to group the items by the first entry

from collections import defaultdict

fruits = defaultdict(list)

data = {
  0: ['apple', 'green'],
  1: ['banana', 'yellow'],
  2: ['apple', 'red'],
  3: ['apple', 'brown'],
  4: ['kiwi', 'green'],
  5: ['kiwi', 'brown']
}

for _, v in data.items():
  fruits[v[0]].extend(v[1:])

print(dict(fruits))
# {'apple': ['green', 'red', 'brown'], 'banana': ['yellow'], 'kiwi': ['green', 'brown']}

If there is less than two entries in any list, you'll need to account for that...

Then, use comprehension to get the counts, not Counter as that won't give you the lengths of those lists.

fruits_count = {k: len(v) for k, v in fruits.items()}
fruits_count_with_at_least_2 = {k: v for k, v in fruits_count.items() if v >= 2}

And then a loop will be needed to collect the original keys

original_keys_with_2_count = []
for k, values in data.items():
    fruit = values[0]
    count = fruits_count.get(fruit, -1)
    if count >= 2:
      original_keys_with_2_count.append(k)

print(original_keys_with_2_count)
# [0, 2, 3, 4, 5]