I have a sequence of data that I wish to build n-grams from it. An excerpt of a sequence look is as follows.
8c b0 00 f0 05 fc 04 46 00 f0 fe fb 40 f2 00 05 c2 f2 00 05 28 78 00
I currently uses ntlk's ngrams()
function to build 4-grams from this data as
8c b0 00 f0
, b0 00 f0 05
,00 f0 05 fc
...etc. which is just creating 4-grams by sliding one by one. However, my requirement is instead of sliding one by one, I need to slide two by two, while creating the n-grams. So the expected out 8c b0 00 f0
, 00 f0 05 fc
,05 fc 04 46
...etc. I searched but could not find any way to do this instead of shifting one by one as I currently have. following is a part of the 4 line code that emphasis the current work
s = finalString.lower()
s = re.sub(r'[^a-zA-Z0-9\s]', ' ', s)
tokens = [token for token in s.split(" ") if token != ""]
output = list(ngrams(tokens, 4))
You can do the following trick,
s = '8c b0 00 f0 05 fc 04 46 00 f0 fe fb 40 f2 00 05 c2 f2 00 05 28 78 00'
from nltk import ngrams
output = list(ngrams(s.split(), 4))[::2] # Using only alternate records from ngrams,
# Here 2 is the sliding window that you want.
[('8c', 'b0', '00', 'f0'), ('00', 'f0', '05', 'fc'), ('05', 'fc', '04', '46'), ('04', '46', '00', 'f0'), ('00', 'f0', 'fe', 'fb'), ('fe', 'fb', '40', 'f2'), ('40', 'f2', '00', '05'), ('00', '05', 'c2', 'f2'), ('c2', 'f2', '00', '05'), ('00', '05', '28', '78')]