I have a giant list of words corpus
and a particular word w
. I know the index of every occurence of w
in the corpus. I want to look at an n
sized window around every occurence of w
and create a dictionary of other words that occur within that window. The dictionary is a mapping from int
to list[str]
where the key is how many positions away from my target word I am, either to the left (negative) or to the right (positive), and the value is a list of words at that position.
For example, if I have the corpus: ["I", "made", "burgers", "Jack", "made", "sushi"]
; my word is "made"
and I am looking at a window of size 1
, then I ultimately want to return {-1: ["I", "Jack"], 1: ["burgers", "sushi"]}
.
There are two problems that can occur. My window may go out of bounds (if I looked at a window of size 2 in the above example) and I can encounter the same word multiple times in that window, which are cases I want to ignore. I have written the following code which seems to work, but I want to make this cleaner.
def find_neighbor(word: str, corpus: list[str], n: int = 1) -> dict[int, list[str]]:
mapping = {k: [] for k in list(range(-n,n+1)) if k != 0}
idxs = [k for k, v in enumerate(corpus) if v == word]
for idx in idxs:
for i in [x for x in range(-n,n+1) if x != 0]:
try:
item = corpus[idx+i]
if item != word:
mapping[i].append(corpus[item])
except IndexError:
continue
return mapping
Is there a way to incorporate options and pattern matching so that I can remove the try block and have something like this...
match corpus[idx+i]
case None: continue; # If it doesn't exist (out of bounds), continue / i can also break
case word: continue; # If it is the word itself, continue
case _: mapping[i].append(corpus[item]) # Otherwise, add it to the dictionary
Introduce a helper function that returns corpus[i]
if i
is a legal index and None
otherwise:
corpus = ["foo", "bar", "baz"]
def get(i):
return corpus[i] if i<len(corpus) else None
print([get(0), get(1), get(2), get(3)])
The result of the above is:
['foo', 'bar', 'baz', None]
Now you can write:
match get(idx+i)
case None: something
case word: something
case _: something