Search code examples
c++arraysstringtext-search

Search a string for any occurence of a word in a list of strings


I want to know how, in C++, to search a string for the first instance of ANY of a list of strings. A kind of full-word version of std::string::find_first_of(): "Searches the string for the first character that matches any of the characters specified in its arguments".

I want something that will search the string for the first WORD that matches any of the words in a provided list/array. To be clear, I don't want to search an array for an instance of a string. I want to search a string, for an instance of something in an array.

My goal is to be able to take a sentence, and remove all words that I have in a list. For example, if I give it the list {"the" "brown", "over"}; and the sentence, "the quick brown fox jumped over the lazy dog", I want it to output, " quick fox jumped lazy dog". And I want to be able to give it a list of even 100 words if I want; I need this to be expandable.

The only solution I can think of is to use std::find(stringArray[0]) in a while loop on my block of text, and save the indexes that that word is found at, then put all that in another for loop and do that on every single word in my array, saving the indexes of each word into one huge list. Optionally then numerically sorting that list, and finally then going through and removing each word that is at a position in that list.

I'm really hoping there's a function or an easier way to do it, because my solution seems difficult and VERY slow, especially since I need to use it many times, on many different strings, to go through all the sentences of a 50,000 character block of text. Anything better optimized would be preferred.


Solution

  • If you look for standard functions, there are somme possiblities if you dare to store your sentences as a container of strings:

    string input="Hello, world ! I whish you all \na happy new year 2016 !";
    vector<string> sentence; 
    
    stringstream sst(input);    // split the string into its pieces 
    string tmp; 
    while (sst>>tmp) 
        sentence.push_back(tmp); 
    

    Of course, in the real world you would do the split not just on whitespaces, but also on punctuations. This is just a proof of concept.

    Once you have it in this form, it's easy to use the <algorithm> form of find_first_of():

    vector<string> search{"We", "You", "I"}; 
    auto it =  find_first_of(sentence.begin(), sentence.end(), 
                               search.begin(), search.end()); 
    
                               // display remaining of the sentence
    copy(it , sentence.end(), ostream_iterator<string>(cout,"/"));    
    cout<<endl;     
    

    And deleting words from a vector shouldn't then be anymore an issue. I let it to you as an exercise.

    Once you have your cleaned vector you can rebuild a string:

    stringstream so;
    copy(it , sentence.end(), ostream_iterator<string>(so," ")); 
    string result = so.str(); 
    

    Here an online demo.

    This solution won't address all your performance issues however. For this you need to analyse further where your performance bottleneck comes from: do you make a lot of unnecessary copies of objects ? Is it that your own algorithm triggers a lot of inefficient memory allocations ? Or is it really the sheer volume of text ?

    Some ideas for further work:

    • build an alphabetical index to the words in the sentence (map> where the unsigned
    • consider a trie data structure (trie and not tree !!)
    • Use regular expressions in <regex>