Search code examples
google-apps-scriptgoogle-sheetsnlpgoogle-sheets-formulastop-words

In Google Sheets remove serie of ngrams from cells containing lists of comma separated ngrams in primary sheet


Have been working in Google Sheets on a general table containing approximately a thousand texts. In one column derived form the column containing the texts in their original "written" form, are ngrams (words and the like) extracted from them, and listed in alphabetic order, one list of ngrams corresponding to each text. I’ve been trying without success to derive a second column, from these lists of such ngrams, from which I want to remove instances of certain ngrams of which I have a list (a long list, hundreds of ngrams, and a list to which I could make additions later). In other words, from the text mining vocabulary, I want to remove stop words from lists of tokens.

enter image description here

I tried with SPLIT and REGEXREPLACE functions, or a combination of both, but with no success.

=JOIN(",";SORT(TRANSPOSE(SPLIT(REGEXREPLACE(AL3;"\bau\b;\baux\b;\bavec\b;\bce\b;\bces\b;\bdans\b;\bde\b;\bdes\b;\bdu\b;\belle\b;\ben\b;\bet\b;\beux\b;\bil\b;\bje\b;\bla\b;\ble\b;\bleur\b;\blui\b;\bma\b;\bmais\b;\bme\b;\bmême\b;\bmes\b;\bmoi\b;\bmon\b;\bne\b;\bnos\b;\bnotre\b;\bnous\b;\bon\b;\bou\b;\bpar\b;\bpas\b;\bpour\b;\bqu\b;\bque\b;\bqui\b;\bsa\b;\bse\b;\bses\b;\bson\b;\bsur\b;\bta\b;\bte\b;\btes\b;\btoi\b;\bton\b;\btu\b;\bun\b;\bune\b;\bvos\b;\bvotre\b;\bvous\b;\bc\b;\bd\b;\bj\b;\bl\b;\bà\b;\bm\b;\bn\b;\bs\b;\bt\b;\by\b;\bété\b;\bétée\b;\bétées\b;\bétés\b;\bétant\b;\bsuis\b;\bes\b;\best\b;\bsommes\b;\bêtes\b;\bsont\b;\bserai\b;\bseras\b;\bsera\b;\bserons\b;\bserez\b;\bseront\b;\bserais\b;\bserait\b;\bserions\b;\bseriez\b;\bseraient\b;\bétais\b;\bétait\b;\bétions\b;\bétiez\b;\bétaient\b;\bfus\b;\bfut\b;\bfûmes\b;\bfûtes\b;\bfurent\b;\bsois\b;\bsoit\b;\bsoyons\b;\bsoyez\b;\bsoient\b;\bfusse\b;\bfusses\b;\bfût\b;\bfussions\b;\bfussiez\b;\bfussent\b;\bayant\b;\beu\b;\beue\b;\beues\b;\beus\b;\bai\b;\bas\b;\bavons\b;\bavez\b;\bont\b;\baurai\b;\bauras\b;\baura\b;\baurons\b;\baurez\b;\bauront\b;\baurais\b;\baurait\b;\baurions\b;\bauriez\b;\bauraient\b;\bavais\b;\bavait\b;\bavions\b;\baviez\b;\bavaient\b;\beut\b;\beûmes\b;\beûtes\b;\beurent\b;\baie\b;\baies\b;\bait\b;\bayons\b;\bayez\b;\baient\b;\beusse\b;\beusses\b;\beût\b;\beussions\b;\beussiez\b;\beussent\b;\bceci\b;\bcela\b;\bcelà\b;\bcet\b;\bcette\b;\bici\b;\bils\b;\bles\b;\bleurs\b;\bquel\b;\bquels\b;\bquelle\b;\bquelles\b;\bsans\b;\bsoi\b";"");" ")));"")

Dumky here has done something neat, writing a script with custom functions that does a couple of things, among them calling a list from a secondary sheet, to clean such lists of ngrams-words-tokens, but what he has done exceed both what I want to do, and what I can do in Google Sheets on my own.

An exemple of a list of stop words (ngrams to remove): à,ai,aie,aient,aies,ait,as,au,aura,aurai,auraient,aurais,aurait,auras,aurez,auriez,aurions,aurons,auront,aux,avaient,avais,avait,avec,avez,aviez,avions,avons,ayant,ayez,ayons,c,ce,ceci,cela,celà,ces,cet,cette,d,dans,de,des,du,elle,en,es,est,et,étaient,étais,était,étant,été,étée,étées,êtes,étés,étiez,étions,eu,eue,eues,eûmes,eurent,eus,eusse,eussent,eusses,eussiez,eussions,eut,eût,eûtes,eux,fûmes,furent,fus,fusse,fussent,fusses,fussiez,fussions,fut,fût,fûtes,ici,il,ils,j,je,l,la,le,les,leur,leurs,lui,m,ma,mais,me,même,mes,moi,mon,n,ne,nos,notre,nous,on,ont,ou,par,pas,pour,qu,que,quel,quelle,quelles,quels,qui,s,sa,sans,se,sera,serai,seraient,serais,serait,seras,serez,seriez,serions,serons,seront,ses,soi,soient,sois,soit,sommes,son,sont,soyez,soyons,suis,sur,t,ta,te,tes,toi,ton,tu,un,une,vos,votre,vous,y

An exemple of the output I am looking for —

List of ngrams derived from a short text: 10,11,21,à,à,ami,ami,amour,aux,baiser,brusko,brusko,car,ce,comme,comme,compagnon,de,de,déclare,déclare,déteste,électricité,elle,en,est,est,est,et,étudiant,fluide,gare,germes,heures,idéalistes,il,infectueux,je,l,l,la,la,le,lumière,médecine,meilleur,moi,moi,mon,montréal,pée,pensez,poète,que,que,qui,rancune,répondez,sans,science,ses,sic,un,un,vébicule,vôtre,vous

Output: ami,ami,amour,baiser,brusko,brusko,car,comme,comme,compagnon,déclare,déclare,déteste,électricité,étudiant,fluide,gare,germes,heures,idéalistes,infectueux,lumière,médecine,meilleur,montréal,pée,pensez,poète,rancune,répondez,science,sic,vébicule,vôtre

enter image description here


Solution

  • I'm not sure if I understand you correctly. If you want to remove some words from some string then basically it can be done this way:

    var txt = "a,b,c,d,aa,bb,cc,dd";
    
    var to_remove = "a,bb,d,dd".split(',');
    
    results = txt.split(',').filter(word => !to_remove.includes(word)).join();
    
    console.log(results); //output: 'b,c,aa,cc'

    But there are need additional steps to make it works with a spreadsheet data of course.

    Update

    The same code with the test data:

    var txt = "10,11,21,à,à,ami,ami,amour,aux,baiser,brusko,brusko,car,ce,comme,comme,compagnon,de,de,déclare,déclare,déteste,électricité,elle,en,est,est,est,et,étudiant,fluide,gare,germes,heures,idéalistes,il,infectueux,je,l,l,la,la,le,lumière,médecine,meilleur,moi,moi,mon,montréal,pée,pensez,poète,que,que,qui,rancune,répondez,sans,science,ses,sic,un,un,vébicule,vôtre,vous";
    
    var to_remove = "à,ai,aie,aient,aies,ait,as,au,aura,aurai,auraient,aurais,aurait,auras,aurez,auriez,aurions,aurons,auront,aux,avaient,avais,avait,avec,avez,aviez,avions,avons,ayant,ayez,ayons,c,ce,ceci,cela,celà,ces,cet,cette,d,dans,de,des,du,elle,en,es,est,et,étaient,étais,était,étant,été,étée,étées,êtes,étés,étiez,étions,eu,eue,eues,eûmes,eurent,eus,eusse,eussent,eusses,eussiez,eussions,eut,eût,eûtes,eux,fûmes,furent,fus,fusse,fussent,fusses,fussiez,fussions,fut,fût,fûtes,ici,il,ils,j,je,l,la,le,les,leur,leurs,lui,m,ma,mais,me,même,mes,moi,mon,n,ne,nos,notre,nous,on,ont,ou,par,pas,pour,qu,que,quel,quelle,quelles,quels,qui,s,sa,sans,se,sera,serai,seraient,serais,serait,seras,serez,seriez,serions,serons,seront,ses,soi,soient,sois,soit,sommes,son,sont,soyez,soyons,suis,sur,t,ta,te,tes,toi,ton,tu,un,une,vos,votre,vous,y".split(',');
    
    results = txt.split(',').filter(word => !to_remove.includes(word)).join();
    
    console.log(results);

    Here is the simple script that goes through all rows (starts from second row), takes the stop words from column B, text from column C, and puts result into column D:

    function myFunction() {
      var sheet    = SpreadsheetApp.getActiveSheet();
      var last_row = sheet.getLastRow();
      var range    = sheet.getRange('B2:C' + last_row);
      var data     = range.getValues();
      var output   = [];
    
      for (row of data) {
        var stop_words = row[0].split(',');
        var txt        = row[1].split(',');
        var result     = txt.filter(word => !stop_words.includes(word)).join();
        output.push([result]);
      }
    
      sheet.getRange('D2:D'+last_row).setValues(output);
    }
    

    If the stop words are the same for every row the script can be even simpler shorter and faster:

    function myFunction2() {
      var sheet      = SpreadsheetApp.getActiveSheet();
      var last_row   = sheet.getLastRow();
      var stop_words = sheet.getRange('B2').getValue().split(','); // get stop words from B2
      var data       = sheet.getRange('C2:C' + last_row).getValues().flat();
      
      var output = data.map(row => 
        [row.split(',').filter(word => !stop_words.includes(word)).join()]
      );
    
      sheet.getRange('D2:D'+last_row).setValues(output);
    }
    

    Update 2

    If you want a custom function it's here:

    function NGRAMSFILTER(stop_words, ngrams) {
      var stop_words = stop_words.split(',');
      return ngrams.split(',').filter(word => !stop_words.includes(word)).join();
    }
    

    It takes a cell with stop words and cells with "ngrams" and returns the the "ngrams" that aren't among the stop words.

    Update 3

    Here is the custom function that collect all 'stop words' from column B:

    function NGRAMSFILTER_ALL(ngrams) {
      var sheet      = SpreadsheetApp.getActiveSheet();
      var last_row   = sheet.getLastRow();
      var stop_words = sheet.getRange('B2:B' + last_row).getValues().flat().join(',').split(',');
      return ngrams.split(',').filter(word => !stop_words.includes(word)).join();
    }
    

    Technically it works. But actually it could froze quite often if there are too much words and rows. And I see no way to speed it up. I'd advice to use a static script for such task.