I have two lists: 1.List of IPA symbols - M 2.List of single words - N
Now I need to create a third list X = [N,M] where for each IPA symbol found in a single word I have to assign 1 to the new list and 0. For example if M = ['ɓ', 'u', 'l', 'i', 'r', 't', 'ə', 'w', 'a', 'b'] and for simplicity N has only two words = ['ɓuli', 'rutə'], then the output should look like X = [[1,1,1,1,0,0,0,0,0,0], [0,1,0,0,1,1,1,0,0,0]]
So it's kind of co-occurence matrix but simpler - because I do not need to hold count of how many times the symbol occur in the word. I just need to assign 1 to X when a symbol occur in a word in a proper position. Maybe I am overthinking this but I can't seem to find a way to hold index of both lists. Here is my code snippet:
M = ['ɓ', 'u', 'l', 'i', 'r', 't', 'ə', 'w', 'a', 'b']
N = ['ɓuli', 'rutə']
X = np.zeros((len(N), len(M)))
for n_idx in range(len(N)):
print('Current word index', n_idx)
for symbol in N[n_idx]:
if symbol in M:
print(symbol, 'found, at word index', n_idx, ', and symbol index')
# if found then ad to X at proper position
#Expected result
X = [[1,1,1,1,0,0,0,0,0,0],
[0,1,0,0,1,1,1,0,0,0]]
You can build such an index with this line :
X = [[1 if e in s else 0 for e in M] for s in N]
which is a double comprehension list looping on letters and words. However you should use libraries such as sklearn to perform such operations more efficiently (e.g. https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)