I'm making a spam classifier using a Naive Bayes Classifier model from the Julia TextAnalysis.jl package.
The text pre-processing functions (like remove_corrupt_utf8!(sd)
where sd
is a StringDocument
) can only be applied to Document types (specific to the package) and not to string
type.
Is there any way I can convert this StringDocument
back into a string to put back into my dataframe
.
Current code:
#global messageLis = []
for row in eachrow(data)
message = row.v2
#push!(messageLis, message)
StringDoc = StringDocument(message)
remove_corrupt_utf8!(StringDoc) #to remove the corrupt characters (if any) in the message so that model doesnt fail
#convert StringDoc back into a string so that text is preprocessed from the dataframe itself.
end
Any help would be appreciated.
Use text
to access the processed string:
julia> str = StringDocument("here are some punctuations !!!...");
julia> prepare!(str, strip_punctuation)
julia> text(str)
"here are some punctuations "