Search code examples
javastringdata-structureshashmapfind-occurrences

Calculate Keyword Density of each Unique Element in List in Java?


I am able to calculate keyword density of a body of text using the following code:

HashMap<String, Integer> frequencies = new HashMap<String, Integer>();

String[] splitTextArr = StringX.splitStrIntoWordsRtrnArr(text);

int articleWordCount = splitTextArr.length;

for (String splitText : splitTextArr) {


 if (frequencies.containsKey(splitText)) {

    Integer previousInt = frequencies.get(splitText);
    Integer newInt = ++previousInt;
    frequencies.put(splitText, newInt);
    } else {

    frequencies.put(splitText, 1);
    }
}

I then sort keyword density list in order of most occurring keywords to least occurring keywords using the following call:

    Map<String, Integer> sortedMap = new LinkedHashMapX<String, Integer>().sortByValueInDescendingOrder(frequencies);

The above code works as expected however I now have to implement a unique requirement.

SUMMARY: Given a list of title entries I need to extract enough keywords so that each title in the list is represented by exactly one keyword and then calculate the total number of titles represented by that keyword (see below for example).

EXAMPLE: Suppose I have the following 5 titles in the form of a LinkedHashMap<String, String>:

title1: canon photo printer 
title2: canon camera canon
title3: wireless mouse
title4: wireless charger
title5: mouse trap

Where the KEY represents the titleID (i.e, title1, title2 etc) and the VALUE represents the actual title.

Raw occurrences of each keyword (in descending order, case insensitive) is as follows:

canon: 2 | mouse: 2 | wireless: 2 | camera: 1 | charger: 1 | photo: 1 | printer: 1 | trap: 1

Note: Each keyword is only counted once per title. So although the keyword canon appears 3 times, since it appears twice in the same title (i.e, title2) it is only counted once.

In the previous map, the keyword canon is appears in both title1 & title2. Since each title needs to be represented by exactly one keyword, both titles can be represented by the keyword canon. It is not necessary to include the other keywords from title1 and title2 (such as: photo, printer and camera) as each title should be represented by exactly one keyword (not more, not less). Although we can tech choose to represent title1 and title2 using the keywords photo and camera (or printer and camera) - since this will have the effect of increasing the total number of keywords necessary to represent all titles, it is not desired. In other words, we want to represent all titles by the least number of keywords possible.

The important part is to extract the least number of keywords that are able to "represent" all titles in list one time and keep track of the number of titles each keyword is linked to and the titleID. If instead of 5 titles we had a list of 100 titles where the keyword photo appeared 95 times (i.e, more times than the keyword canon) the keyword photo would be used to replace the keyword canon and title2 would be represented by the keyword camera.

If two or more keywords can represent the same number of titles we would select the first one in alphabetical order. Thus, the keyword mouse would be used to represent titles title3 and title5, instead of the keyword wireless. Similarly, to represent title4 the keyword charger would be used as the letter C comes before the letter W in the alphabet (this is true even though the keyword wireless appears twice and the keyword charger only appears once since title3 which contains the keyword wireless is already being represented by the keyword mouse and not by the keyword wireless so we revert to using first keyword in alphabet when two keywords represent same number of titles)

In the previous example of 5 titles, the expected output would have the following format:

LinkedHashMap<String, LinkedHashMap<Integer, ArrayList<String>> map = new LinkedHashMap<String, LinkedHashMap<Integer, ArrayList<String>>();

Where String represents the keyword (such as canon or mouse), Integer represents the number of unique titles represented by that keyword (canon = 2, mouse = 2, charger = 1) and ArrayList<String> is a list of titleIDs which is linked to that keyword. For example, the keyword canon would be linked to titles: title1 and title2. The keyword mouse would be linked to titles: title3 and title5 and the keyword charger would be linked to title: title4.

Note: wireless = 0, trap = 0 so it is omitted from the final result.

What is the most efficient way to achieve this?

Thanks


Solution

  • Basically, you need to create a data structure of some kind where you need to keep track of a keyword, the number of titles it is linked to, and the title IDs it is associated to; something like this:

    [canon, {2, [ title1, title2 ] } ]
    

    If this is the case, This is bad for at least two reasons:

    1. Using the count as a key is bad because the count will repeat for other elements
    2. You can simply calculate the count by returning the size or length of the list of title IDs for a given keyword.

    My proposed solution is two-fold: Using a class to capture your keyword and title ID mappings, and then a set of those mappings objects.

    The Class

    public class KeywordMapping implements Comparable<KeywordMapping> {
        
        private final String keyword;
        private TreeSet<String> titleIds = new TreeSet<String>();
        private int frequency;
        
        public KeywordMapping(String keyword, String titleId) {
            this.keyword = keyword;
            titleIds.add(titleId);
        }
        
        public String getKeyword () {
            return keyword;
        }
        
        public int getTitleCount () {
            return titleIds.size();
        }
        
        public void addTitleId (String titleId) {
            titleIds.add(titleId);
        }
        
        public String getFirstTitleId () {
            return titleIds.first();
        }
        
        public void incrementFrequency(String keyword) {
            if (this.keyword.equals(keyword)) {
                frequency++;            
            }
        }
    
        public int getFrequency() {
            return frequency;
        }
    
        public boolean containsTitle(String titleId) {
            return titleIds.contains(titleId);
        }
    
        @Override
        public int hashCode () {
            final int prime = 31;
            int result = 1;
            result = prime * result + ( (keyword == null) ? 0 : keyword.hashCode());
            result = prime * result + ( (titleIds == null) ? 0 : titleIds.hashCode());
            return result;
        }
    
        @Override
        public boolean equals (Object obj) {
            if (this == obj) return true;
            if (obj == null) return false;
            if (getClass() != obj.getClass()) return false;
            KeywordMapping other = (KeywordMapping) obj;
            if (keyword == null) {
                if (other.keyword != null) return false;
            } else if (!keyword.equals(other.keyword)) return false;
            if (titleIds == null) {
                if (other.titleIds != null) return false;
            } else if (!titleIds.equals(other.titleIds)) return false;
            return true;
        }
    
        @Override
        public String toString () {
            return "KeywordMapping [keyword=" + keyword + ", titleIds=" + titleIds + ", frequency=" + frequency
                + "]\n";
        }
    
        @Override
        public int compareTo (KeywordMapping o) {
            return COMPARATOR.compare(this, o);
        }
    
        private static final Comparator<KeywordMapping> COMPARATOR =
            Comparator.comparing( (KeywordMapping keyMapping) -> keyMapping.keyword);
    }
    

    This solution is OK. I would rather use Builder Pattern to build this and make this class immutable so that every instance of this class is built only when all the title IDs have been added to the class. However, let me explain a couple of details of this implementation. I used a TreeSet to store the title IDs so that they are arranged in natural order (alphabetically). This is not important for storing title IDs, but it will be important later on. Lastly, the class must implement Comparable<T> if you need to insert these objects into a sortable collection.

    The TreeMap of keyword mappings

    TreeMap<String, KeywordMapping> mappings = new TreeMap<>();
    

    Using a tree map is necessary to arrange entries in natural order by their key.

    I created this simple main method to demonstrate this implementation

    public static void main (String[] args) {
        List<String> titles = List.of("title2: canon camera canon",
            "title3: wireless mouse",
            "title1: canon photo printer",
            "title4: wireless charger", "title5: mouse trap");
        
        MasterMap mappings = new MasterMap(); // A class wrapping TreeMap<String, KeywordMapping> mappings (the code is shown later on)
        
        for (String title : titles) {
            String[] tokens = title.split(": ");
            Set<String> keywords = new HashSet<>(Arrays.asList(tokens[1].split(" ")));
            for (String keyword : keywords) {
                KeywordMapping km = mappings.get(keyword); // don't create duplicate keyword mapping objects
                if (km == null) {
                    km = new KeywordMapping(keyword, tokens[0]);
                }
                km.addTitleId(tokens[0]);
                km.incrementFrequency(keyword);
                mappings.put(keyword, km); // update existing or new keyword mapping entry
            }
        }
        System.out.println(mappings);
        System.out.println("First entry: " + mappings.firstMapping());
        System.out.println("First title in first entry: " + mappings.firstMapping().getValue().getFirstTitleId());
    }
    

    The output of this program is:

    camera=KeywordMapping [keyword=camera, titleIds=[title2], frequency=1]
    canon=KeywordMapping [keyword=canon, titleIds=[title1, title2], frequency=2]
    charger=KeywordMapping [keyword=charger, titleIds=[title4], frequency=1]
    mouse=KeywordMapping [keyword=mouse, titleIds=[title3, title5], frequency=2]
    photo=KeywordMapping [keyword=photo, titleIds=[title1], frequency=1]
    printer=KeywordMapping [keyword=printer, titleIds=[title1], frequency=1]
    trap=KeywordMapping [keyword=trap, titleIds=[title5], frequency=1]
    wireless=KeywordMapping [keyword=wireless, titleIds=[title3, title4], frequency=2]
    
    First entry: camera=KeywordMapping [keyword=camera, titleIds=[title2], frequency=1]
    First title in first entry: title2
    

    Notice how, even though the original title list is not in natural order, after insertion into the KeywordMapping object, the set backing the list of title IDs arranges them in order. This is the same with the order in which the entries in the tree map are printed out.

    Lastly, the frequency count is only incremented once per title because the list of title IDs is backed by a Set rather than a List and sets in Java do not allow duplicates. Therefore, for a case like title2: canon camera canon, the keyword canon is counted only once because for that title, the set of keywords is only {cannon, camera}.

    Wrapping TreeMap for added functionality

    public class MasterMap {
        private TreeMap<String, KeywordMapping> mappings = new TreeMap<>();
        
        public KeywordMapping put(String key, KeywordMapping value) {
            return mappings.put(key, value);
        }
        
        public KeywordMapping get(String key) {
            return mappings.get(key);
        }
        
        public KeywordMapping firstMapping() {
            return mappings.firstEntry().getValue();
        }
        
        public String getKeywordForTitle (String titleId) {
            
            List<KeywordMapping> keywords = new ArrayList<>();
            
            Collection<KeywordMapping> values = mappings.values();
            Iterator<KeywordMapping> iter = values.iterator();
            
            while (iter.hasNext()) {
                KeywordMapping value = iter.next();
                if (value.containsTitle(titleId)) {
                    keywords.add(value);
                }
            }
            
            KeywordMapping temp = keywords.stream()
                .max(Comparator.comparingInt(KeywordMapping::getFrequency)
                    .thenComparing(KeywordMapping::getKeyword).reversed())
                .get();
            
            return temp.getKeyword();
        }
    }
    

    In order to satisfy the last requirement, I decided to wrap the TreeMap into a class so that I could add methods to perform the last action: return the keyword associated with a given title based on the following criteria: return the keyword with the highest frequency or the first keyword in alphabetical order if the frequency is the same. The implemented function DOES NOT manipulate the KeywordMapping objects or the wrapped tree map in any way. What it does is use a Comparator in order to find the matched keyword based on that criteria.

    With the added changes (previous code has been update), passing "title3" to the function produces the following output:

    Keyword for title3: mouse
    

    This is because the two associated keywords with "title3" {mouse, wireless} have the same frequency (2) but "mouse" is the first keyword in the set in alphabetical order.