Using coordinates for labelling? I was asked it was possible and to see for myself, I am trying to program it and read up more on it. I do not know what it is called or where to look but the general idea is as follows:
labels are converted into an N-dimensional space and trajectories are calculated along the N-dimensional space. Based on the direction, a label is assigned with a confidence interval.
The data
basic_data = [
{"label":"First-Person RPG", "Tags":["Open-world", "Fantasy", "Adventure", "Single-player", "Exploration", "Dragons", "Crafting", "Magic", "Story-rich", "Moddable"]},
{"label":"Action RPG", "Tags":["Open-world", "Fantasy", "Story-rich", "Adventure", "Single-player", "Monsters", "Crafting", "Horse-riding", "Magic", "Narrative"]},
{"label":"Adventure", "Tags":["Difficult", "Dark Fantasy", "Action", "Single-player", "Exploration", "Lore-rich", "Combat", "Permadeath", "Monsters", "Atmospheric"]},
{"label":"Party Game", "Tags":["Multiplayer", "Social Deduction", "Indie", "Strategy", "Casual", "Space", "Deception", "Survival", "Teams", "Interactive"]}
]
code for the first part below
mlb = MultiLabelBinarizer()
for idx, data in enumerate(basic_data):
basic_data[idx]["tag_str"] = ",".join(data["Tags"])
pd_basic_data: pd.DataFrame = pd.DataFrame(basic_data)
tags: List = [str(pd_basic_data.loc[i,'tag_str']).split(',') for i in range(len(pd_basic_data))]
mlb_result = mlb.fit_transform(tags)
df_final: pd.DataFrame = pd.concat([pd_basic_data['label'],pd.DataFrame(mlb_result,columns=list(mlb.classes_))],axis=1)
a simple one word answer telling the theory works as well for an answer. I just need to know where to look.
You are most probably referring to something called Embedding and uses dimensionality reduction techniques, such as PCA. Those are often used in ML for tasks such as classification and clustering.
If I were you, I would investigate Word Embeddings first: Word2Vec
from the modeule gensim.models
is a really good candidate. Basically, you converts words into continuous vector space and preserve contextual relationships.
Here is an example of how to do this:
import pandas as pd
from gensim.models import Word2Vec
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
basic_data = [
{"label": "First-Person RPG", "Tags": ["Open-world", "Fantasy", "Adventure", "Single-player", "Exploration", "Dragons", "Crafting", "Magic", "Story-rich", "Moddable"]},
{"label": "Action RPG", "Tags": ["Open-world", "Fantasy", "Story-rich", "Adventure", "Single-player", "Monsters", "Crafting", "Horse-riding", "Magic", "Narrative"]},
{"label": "Adventure", "Tags": ["Difficult", "Dark Fantasy", "Action", "Single-player", "Exploration", "Lore-rich", "Combat", "Permadeath", "Monsters", "Atmospheric"]},
{"label": "Party Game", "Tags": ["Multiplayer", "Social Deduction", "Indie", "Strategy", "Casual", "Space", "Deception", "Survival", "Teams", "Interactive"]}
]
sentences = [data["Tags"] for data in basic_data]
model = Word2Vec(sentences, vector_size=50, window=3, min_count=1, sg=1)
tag_embeddings = {tag: model.wv[tag] for tag in model.wv.index_to_key}
tag_vectors = [tag_embeddings[tag] for tag in model.wv.index_to_key]
tag_labels = list(tag_embeddings.keys())
pca = PCA(n_components=2)
pca_result = pca.fit_transform(tag_vectors)
plt.figure(figsize=(12, 8))
plt.scatter(pca_result[:, 0], pca_result[:, 1])
for i, tag in enumerate(tag_labels):
plt.annotate(tag, (pca_result[i, 0], pca_result[i, 1]))
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.title('Tag Embeddings Visualized with PCA')
plt.show()
which gives you (here you have your coorinates)
Note that this involves a PCA. It makes sense since you need to reduce dimensionality and keep contextual relationships between words.
Another nother alterantive is FastText
, also from gensim.models
:
import pandas as pd
import numpy as np
from gensim.models import FastText
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
basic_data = [
{"label": "First-Person RPG", "Tags": ["Open-world", "Fantasy", "Adventure", "Single-player", "Exploration", "Dragons", "Crafting", "Magic", "Story-rich", "Moddable"]},
{"label": "Action RPG", "Tags": ["Open-world", "Fantasy", "Story-rich", "Adventure", "Single-player", "Monsters", "Crafting", "Horse-riding", "Magic", "Narrative"]},
{"label": "Adventure", "Tags": ["Difficult", "Dark Fantasy", "Action", "Single-player", "Exploration", "Lore-rich", "Combat", "Permadeath", "Monsters", "Atmospheric"]},
{"label": "Party Game", "Tags": ["Multiplayer", "Social Deduction", "Indie", "Strategy", "Casual", "Space", "Deception", "Survival", "Teams", "Interactive"]}
]
sentences = [data["Tags"] for data in basic_data]
model = FastText(sentences, vector_size=50, window=3, min_count=1, sg=1, epochs=10)
tag_embeddings = {tag: model.wv[tag] for tag in model.wv.key_to_index}
tag_vectors = np.array([tag_embeddings[tag] for tag in tag_embeddings])
tag_labels = list(tag_embeddings.keys())
pca = PCA(n_components=2)
pca_result = pca.fit_transform(tag_vectors)
plt.figure(figsize=(12, 8))
plt.scatter(pca_result[:, 0], pca_result[:, 1])
for i, tag in enumerate(tag_labels):
plt.annotate(tag, (pca_result[i, 0], pca_result[i, 1]))
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.title('Tag Embeddings Visualized with PCA')
plt.show()
with tag_vector
given by
array([[ 3.5692262e-03, 1.5384286e-03, 1.7154109e-03, ...,
5.1359989e-04, 1.0005912e-03, -1.4637399e-03],
[-3.5827586e-03, 2.6330323e-04, 2.4984824e-04, ...,
2.1814678e-03, -7.5217336e-06, 2.8979264e-03],
[-1.4136693e-03, -1.3430609e-03, -1.2442525e-03, ...,
2.1025788e-03, 3.1783513e-04, -1.0448305e-05],
...,
[-3.3974617e-03, 4.9481675e-04, -2.5317934e-04, ...,
-1.1619454e-03, 1.1570274e-03, -2.4804280e-03],
[ 1.7241882e-03, 9.6893904e-04, -2.9550551e-04, ...,
-1.6130345e-04, -1.8300014e-03, -8.8712422e-04],
[ 3.8428712e-04, -6.7049061e-04, -2.3678755e-03, ...,
1.6739646e-03, -2.6099158e-03, 2.2148804e-03]], dtype=float32)
There are of course other methods, and my tip to you would be to look for references on Word-embedding and understand dimensionality reduction techniques, such as principal component analysis.
Note also that depending on the technique you choose, you'll get different results. Look into what these technique actually do.