Search code examples
pythonscikit-learnstring-concatenation

Concatenate the single characters in texts


I have a list with company names, some of them has abbreviations. ex:

compNames = ['Costa Limited', 'D B M LTD']

I need to convert compNames of text to a matrix of token counts using the following. But this does not output columns for B D M in D B M LTD

count_vect = CountVectorizer(analyzer='word')
count_vect.fit_transform(compNames).toarray()

What is the best way to concatenate the single characters in a text?

ex: 'D B M LTD' to 'DBM LTD'

Solution

  • import re
    string = 'D B M LTD'
    print re.sub("([^ ]) ", r"\1", re.sub(" ([^ ]{2,})", r"  \1", string))
    

    Awkward, but it should work. It introduces an additional space in front of LTD and then replaces "D " with "D", "B " with "B" and so on.