Search code examples
d

What's the best way to search and replace all for multiple terms?


I have a naive function implemented to remove HTML entities. But this will do a full string search for each entity. What is the best way to do a multi string search and replace?

string replace_entities(ref string x){
  return sanitize(x).replace("’","'").replace("‘","'").replace("'","'").replace("–","-").replace("—","-")
    .replace("“","\"").replace("”","\"").replace("”","\"").replace("'","'")
    .replace("&", "&").replace("&ndash","-").replace("&mdash","-").replace(""", "\"").strip();
}

Solution

  • You can try with Regex. I made a full example focus on performance :)

    import std.stdio : writeln;
    import std.algorithm : reduce, find;
    import std.regex : ctRegex, Captures, replaceAll;   
    
    /*
    Compile time conversion table:
    ["from", "to"]
    */
    enum HTMLEntityTable = [
        ["’"  ,"'"  ],
        ["‘"  ,"'"  ],
        ["'"   ,"'"  ],
        ["–"  ,"-"  ],
        ["—"  ,"-"  ],
        ["“"  ,"\"" ],
        ["”"  ,"\"" ],
        ["”"  ,"\"" ],
        ["'"    ,"'"  ],
        ["&"    ,"&"  ],
        ["&ndash"   ,"-"  ],
        ["&mdash"   ,"-"  ],
        ["""   ,"\"" ]
    ];
    
    /*
    Compile time Regex String:
    Use reduce to concatenate HTMLEntityTable on index 1 to form "’|‘|..."
    */
    enum regex_replace = ctRegex!( 
        reduce!((a, b)=>a~"|"~b[0])(HTMLEntityTable[0][0],HTMLEntityTable[1..$]) 
    );
    
    /*
    Replace Function:
    Find matched string on HTMLEntityTable and replace it.
    (Maybe I should use my HTMLEntityTable as a Associative Array
     but I think this way is faster ) 
    */
    auto HTMLReplace(Captures!string str){      
        return HTMLEntityTable.find!(a=>a[0] == str.hit)[0][1];
    }
    
    //User Function.
    auto replace_entities( ref string html ){   
        return replaceAll!HTMLReplace( html, regex_replace);
    }
    
    void main(){
        auto html = "Start’‘'–—“””'&&ndash&mdash"End";
        replace_entities( html ).writeln;
        //Output:
        //Start'''--"""'&--"End
    }