Search code examples
pythonpandascombinationsvenn-diagrammatplotlib-venn

How to dynamically get the combinations from venn diagrams in python


I can plot the venn diagrams (using pyvenn), choosing how many to compare with musiciansdf.iloc[:, 0:3] or like musiciansdf = musiciansdf.loc[:, ["Played at Woodstock", "Members of The Beatles", "Guitarists"]] (anywhere from 2 to 6 keys, here 3) as in,

import pandas as pd
from venn import venn

musiciansdf = pd.DataFrame({
    "Members of The Beatles": ["Paul McCartney", "John Lennon", "George Harrison", "Ringo Starr"],
    "Members of The Beats": ["Paul McCartney", "Lennon", "George Harrison", "Starr"],
    "Guitarists": ["John Lennon", "George Harrison", "Jimi Hendrix", "Eric Clapton"],
    "Played at Woodstock": ["Jimi Hendrix", "Carlos Santana", "Keith Moon", "Carlos Santana"],
    "Played at more": ["Jimi Hendrix", "Santana", "Keith Moon", "Santana"],
    "Cheese factory": ["Jimi", "Carlos Santana", "Keith", "Carlos Santana"]
})
musiciansdf = musiciansdf.iloc[:, 0:3]

Then put the data in the right format (dictionary with sets for values) with

vennmus = {}
for k, v in musiciansdf.to_dict('list').items():
    vennmus[k] = set(v)

And plot with

venn(vennmus)

But is there a way to get the values in each part of the venn diagrams, with the corresponding key combinations? Like a dictionary showing all the unions and the values that go with them. I know I could just check what columns are used, and write out sets and unions manually, for any combination, but I'm wondering about a quicker dynamical way.

For example, if I use musiciansdf.iloc[:, 0:2] I would want a dict like,

{'Members of The Beatles only': {'John Lennon',
  'Ringo Starr'},
 'Members of The Beats only': {'Lennon',
  'Starr'}
 'Members of The Beatles & Members of The Beats': {'George Harrison',
  'Paul McCartney'}
}

matplotlib-venn could be used instead if it's a better option. I'm looking for a solution where either musiciansdf = musiciansdf.loc[:, ["Played at Woodstock", "Members of The Beatles", "Guitarists"]] or musiciansdf = musiciansdf.iloc[:, 0:3] could be used for selection, so they could be in order or not.


Solution

  • If you're tempted to use a pure pandas approach :

    d = (
        musiciansdf.iloc[:, 0:2] # or `.loc`
        .stack().droplevel(0).rename_axis("membership")
        .reset_index(name="musician").drop_duplicates()
        .groupby("musician", as_index=False).agg(
            lambda x: " & ".join(x) if len(x)>1 else x + " only")
        .groupby("membership")["musician"].agg(set).to_dict()
    )
    

    Output :

    print(d)
    
    {'Members of The Beatles & Members of The Beats': {'George Harrison',
      'Paul McCartney'},
     'Members of The Beatles only': {'John Lennon', 'Ringo Starr'},
     'Members of The Beats only': {'Lennon', 'Starr'}}