Search code examples
javascriptkeyworddynamic-keyword

Javascript auto pick keywords from HTML


Given a body of HTML, is there any function out there someone has written that will automatically extract say the top 10 keywords that appear from a chunk of HTML, excluding any HTML tags (IE just plain text)?

It should ignore common words like "and", "is" "but" etc but list the most frequent uncommon words.

Example input:

Mary had a <strong>snow</strong> lamb. <img src=lamb.jpg /> The <i>lamb</i> was snow white, it lay in the snow all white.

Output:

Snow (3)
White (2)
Lamb (2)

Jquery is fine!


Solution

  • in short terms:

    1) take the innerHTML of your body;

    2) strip all punctuation and \n so you have a single line string;

    3) strip all tags with a .replace() (/<[^>]*>/g);

    4) strip all common words (/\band\b/g, /\bbut\b/g, ...); E.g. if your useless words are those with less than 4 chars then strip /\b[.+]{1,3}\b/

    • now you should have a one-line string (str) without markup and useless words

    4a) Optional: if you don't care about WoRdCAse just transform all in lowercase (str.toLowerCase())

    5) make a split over the blank space (str.split(' ')), you obtain an array (arr)

    6)

    var words = {},
            i = arr.length; 
    
        while(--i) {
           war extWord = arr[i];
           words[extWord] = (!!words[extWord])? words[extWord] + 1 : 1;
        }
    

    7) make a for.. in cycle over (words) object to obtain key (a single word) and value (occurencies for that word)

    Hope this help