Search code examples
nlpjulia

Can I revert StringDocument <Type> back into a string ? (TextAnalysis.jl)


I'm making a spam classifier using a Naive Bayes Classifier model from the Julia TextAnalysis.jl package.

The text pre-processing functions (like remove_corrupt_utf8!(sd) where sd is a StringDocument) can only be applied to Document types (specific to the package) and not to string type.

Is there any way I can convert this StringDocument back into a string to put back into my dataframe.

Current code:

#global messageLis = []
for row in eachrow(data)
    message = row.v2
    #push!(messageLis, message)
    StringDoc = StringDocument(message)
    remove_corrupt_utf8!(StringDoc) #to remove the corrupt characters (if any) in the message so that model doesnt fail
    #convert StringDoc back into a string so that text is preprocessed from the dataframe itself.
end

Any help would be appreciated.


Solution

  • Use text to access the processed string:

    julia> str = StringDocument("here are some punctuations !!!...");
    
    julia> prepare!(str, strip_punctuation)
    
    julia> text(str)
    "here are some punctuations "