How do I generate bigrams using basic language?
I can do that in Python like this...
import nltk, sys
from nltk.tokenize import word_tokenize
sys.stdout = open("mygram1.txt", "w")
with open("mytext.txt") as f:
for text in f:
tokens = nltk.word_tokenize(text)
bigrm = (nltk.bigrams(tokens))
print(*map(' '.join, bigrm), sep='\n')
But I need a macro that I can run in Libreoffice writer. I do not want to use Python.
Update:
just like bigrams, nltk has trigrams method that I call using nltk.trigrams And if I need four or five grams there is everygrams!
from nltk import everygrams
import nltk, sys
from nltk.tokenize import word_tokenize
sys.stdout = open("myfourgram1.txt", "w")
with open("/home/ubuntu/mytext.txt") as f:
for text in f:
tokens = nltk.word_tokenize(text)
for i in list(everygrams(tokens, 4, 4)):
print((" ".join(i)))
Is it possible in libreoffice basic?
You could replicate the behaviour of your Python code by recycling the code in my answer to your previous question (Can you Print the wavy lines generated by Spell check in writer?). First strip out all the stuff relating to spell checking, generating alternatives and sorting, thereby making it considerably shorter, and change the line that inserts the results into the new document to make it just insert pairs of words. Rather than having your input text in a .txt
file, you would have to put them into a writer document, and the results would appear in a new writer document.
It should look something like the listing below. This also includes the subsidiary function IsWordSeparator()
Option Explicit
Sub ListBigrams
Dim oSource As Object
oSource = ThisComponent
Dim oSourceCursor As Object
oSourceCursor = oSource.getText.createTextCursor()
oSourceCursor.gotoStart(False)
oSourceCursor.collapseToStart()
Dim oDestination As Object
oDestination = StarDesktop.loadComponentFromURL( "private:factory/swriter", "_blank", 0, Array() )
Dim oDestinationText as Object
oDestinationText = oDestination.getText()
Dim oDestinationCursor As Object
oDestinationCursor = oDestinationText.createTextCursor()
Dim s As String, sParagraph As String, sPreviousWord As String, sThisWord As String
Dim i as Long, j As Long, nWordStart As Long, nWordEnd As Long, nChar As Long
Dim bFirst as Boolean
sPreviousWord = ""
bFirst = true
Do
oSourceCursor.gotoEndOfParagraph(True)
sParagraph = oSourceCursor.getString() & " " 'It is necessary to add a space to the end of
'the string otherwise the last word of the paragraph is not recognised.
nWordStart = 1
nWordEnd = 1
For i = 1 to Len(sParagraph)
nChar = ASC(Mid(sParagraph, i, 1))
If IsWordSeparator(nChar) Then '1
If nWordEnd > nWordStart Then '2
sThisWord = Mid(sParagraph, nWordStart, nWordEnd - nWordStart)
If bFirst Then
bFirst = False
Else
oDestinationText.insertString(oDestinationCursor, sPreviousWord & " " & sThisWord & Chr(13), False)
EndIf
sPreviousWord = sThisWord
End If '2
nWordEnd = nWordEnd + 1
nWordStart = nWordEnd
Else
nWordEnd = nWordEnd + 1
End If '1
Next i
Loop While oSourceCursor.gotoNextParagraph(False)
End Sub
'----------------------------------------------------------------------------
' OOME Listing 360.
Function IsWordSeparator(iChar As Long) As Boolean
' Horizontal tab \t 9
' New line \n 10
' Carriage return \r 13
' Space 32
' Non-breaking space 160
Select Case iChar
Case 9, 10, 13, 32, 160
IsWordSeparator = True
Case Else
IsWordSeparator = False
End Select
End Function
Even if it would be easier to do it in Python, as Jim K suggested, the BASIC approach would make it easier to distribute the functionality to users, since they would not have to install Python and the NLTK library (which is not straightforward).