Search code examples
pythonpandaslabeling

cartesian coordinates to label


Using coordinates for labelling? I was asked it was possible and to see for myself, I am trying to program it and read up more on it. I do not know what it is called or where to look but the general idea is as follows:

labels are converted into an N-dimensional space and trajectories are calculated along the N-dimensional space. Based on the direction, a label is assigned with a confidence interval.

The data

basic_data = [
  {"label":"First-Person RPG", "Tags":["Open-world", "Fantasy", "Adventure", "Single-player", "Exploration", "Dragons", "Crafting", "Magic", "Story-rich", "Moddable"]},
  {"label":"Action RPG", "Tags":["Open-world", "Fantasy", "Story-rich", "Adventure", "Single-player", "Monsters", "Crafting", "Horse-riding", "Magic", "Narrative"]},
  {"label":"Adventure", "Tags":["Difficult", "Dark Fantasy", "Action", "Single-player", "Exploration", "Lore-rich", "Combat", "Permadeath", "Monsters", "Atmospheric"]},
  {"label":"Party Game", "Tags":["Multiplayer", "Social Deduction", "Indie", "Strategy", "Casual", "Space", "Deception", "Survival", "Teams", "Interactive"]}
]

code for the first part below

mlb = MultiLabelBinarizer()

for idx, data in enumerate(basic_data):
    basic_data[idx]["tag_str"] = ",".join(data["Tags"])
    

pd_basic_data: pd.DataFrame = pd.DataFrame(basic_data)
tags: List = [str(pd_basic_data.loc[i,'tag_str']).split(',') for i in range(len(pd_basic_data))]

mlb_result = mlb.fit_transform(tags)
df_final: pd.DataFrame = pd.concat([pd_basic_data['label'],pd.DataFrame(mlb_result,columns=list(mlb.classes_))],axis=1)

a simple one word answer telling the theory works as well for an answer. I just need to know where to look.


Solution

  • You are most probably referring to something called Embedding and uses dimensionality reduction techniques, such as PCA. Those are often used in ML for tasks such as classification and clustering.

    If I were you, I would investigate Word Embeddings first: Word2Vec from the modeule gensim.models is a really good candidate. Basically, you converts words into continuous vector space and preserve contextual relationships.

    Here is an example of how to do this:

    import pandas as pd
    from gensim.models import Word2Vec
    from sklearn.decomposition import PCA
    import matplotlib.pyplot as plt
    
    basic_data = [
        {"label": "First-Person RPG", "Tags": ["Open-world", "Fantasy", "Adventure", "Single-player", "Exploration", "Dragons", "Crafting", "Magic", "Story-rich", "Moddable"]},
        {"label": "Action RPG", "Tags": ["Open-world", "Fantasy", "Story-rich", "Adventure", "Single-player", "Monsters", "Crafting", "Horse-riding", "Magic", "Narrative"]},
        {"label": "Adventure", "Tags": ["Difficult", "Dark Fantasy", "Action", "Single-player", "Exploration", "Lore-rich", "Combat", "Permadeath", "Monsters", "Atmospheric"]},
        {"label": "Party Game", "Tags": ["Multiplayer", "Social Deduction", "Indie", "Strategy", "Casual", "Space", "Deception", "Survival", "Teams", "Interactive"]}
    ]
    
    sentences = [data["Tags"] for data in basic_data]
    
    model = Word2Vec(sentences, vector_size=50, window=3, min_count=1, sg=1)
    tag_embeddings = {tag: model.wv[tag] for tag in model.wv.index_to_key}
    tag_vectors = [tag_embeddings[tag] for tag in model.wv.index_to_key]
    tag_labels = list(tag_embeddings.keys())
    
    pca = PCA(n_components=2)
    pca_result = pca.fit_transform(tag_vectors)
    
    plt.figure(figsize=(12, 8))
    plt.scatter(pca_result[:, 0], pca_result[:, 1])
    
    for i, tag in enumerate(tag_labels):
        plt.annotate(tag, (pca_result[i, 0], pca_result[i, 1]))
    
    plt.xlabel('PCA Component 1')
    plt.ylabel('PCA Component 2')
    plt.title('Tag Embeddings Visualized with PCA')
    plt.show()
    

    which gives you (here you have your coorinates)

    enter image description here

    Note that this involves a PCA. It makes sense since you need to reduce dimensionality and keep contextual relationships between words.

    Another nother alterantive is FastText, also from gensim.models:

    import pandas as pd
    import numpy as np
    from gensim.models import FastText
    from sklearn.decomposition import PCA
    import matplotlib.pyplot as plt
    
    basic_data = [
        {"label": "First-Person RPG", "Tags": ["Open-world", "Fantasy", "Adventure", "Single-player", "Exploration", "Dragons", "Crafting", "Magic", "Story-rich", "Moddable"]},
        {"label": "Action RPG", "Tags": ["Open-world", "Fantasy", "Story-rich", "Adventure", "Single-player", "Monsters", "Crafting", "Horse-riding", "Magic", "Narrative"]},
        {"label": "Adventure", "Tags": ["Difficult", "Dark Fantasy", "Action", "Single-player", "Exploration", "Lore-rich", "Combat", "Permadeath", "Monsters", "Atmospheric"]},
        {"label": "Party Game", "Tags": ["Multiplayer", "Social Deduction", "Indie", "Strategy", "Casual", "Space", "Deception", "Survival", "Teams", "Interactive"]}
    ]
    
    sentences = [data["Tags"] for data in basic_data]
    model = FastText(sentences, vector_size=50, window=3, min_count=1, sg=1, epochs=10)
    tag_embeddings = {tag: model.wv[tag] for tag in model.wv.key_to_index}
    
    tag_vectors = np.array([tag_embeddings[tag] for tag in tag_embeddings])
    tag_labels = list(tag_embeddings.keys())
    
    pca = PCA(n_components=2)
    pca_result = pca.fit_transform(tag_vectors)
    
    plt.figure(figsize=(12, 8))
    plt.scatter(pca_result[:, 0], pca_result[:, 1])
    
    for i, tag in enumerate(tag_labels):
        plt.annotate(tag, (pca_result[i, 0], pca_result[i, 1]))
    
    plt.xlabel('PCA Component 1')
    plt.ylabel('PCA Component 2')
    plt.title('Tag Embeddings Visualized with PCA')
    plt.show()
    
    

    with tag_vector given by

    array([[ 3.5692262e-03,  1.5384286e-03,  1.7154109e-03, ...,
             5.1359989e-04,  1.0005912e-03, -1.4637399e-03],
           [-3.5827586e-03,  2.6330323e-04,  2.4984824e-04, ...,
             2.1814678e-03, -7.5217336e-06,  2.8979264e-03],
           [-1.4136693e-03, -1.3430609e-03, -1.2442525e-03, ...,
             2.1025788e-03,  3.1783513e-04, -1.0448305e-05],
           ...,
           [-3.3974617e-03,  4.9481675e-04, -2.5317934e-04, ...,
            -1.1619454e-03,  1.1570274e-03, -2.4804280e-03],
           [ 1.7241882e-03,  9.6893904e-04, -2.9550551e-04, ...,
            -1.6130345e-04, -1.8300014e-03, -8.8712422e-04],
           [ 3.8428712e-04, -6.7049061e-04, -2.3678755e-03, ...,
             1.6739646e-03, -2.6099158e-03,  2.2148804e-03]], dtype=float32)
    

    enter image description here

    There are of course other methods, and my tip to you would be to look for references on Word-embedding and understand dimensionality reduction techniques, such as principal component analysis.

    Note also that depending on the technique you choose, you'll get different results. Look into what these technique actually do.