So, I have a large file containing 3 million lines of words. And I need to see if there is any duplicates.
I put the lines in a TreeMap so that they are sorted, put "lines" into key and give "1" to their value. When there is a duplicate, the value of the line stacks up. Then I will have to see if there is any value that is not 1.
Here is my code:
BufferedReader list = new BufferedReader( new FileReader( args[0] ) );
String line;
TreeMap<String,Integer> map = new TreeMap<String,Integer>();
while ( (line = list.readLine()) != null )
{
if (!map.containsKey(line))
{
map.put(line, 0);
}
map.put(line, map.get(line) + 1);
}
if ( !map.containsKey(1) )
{
System.out.print("NOT UNIQUE");
}
else
{
System.out.print("UNIQUE");
}
list.close();
}
Question:
Will the use of TreeMap speed up the process? Or using HashMap will have the same/faster speed?
The output:
Exception in thread "main" java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Integer
at java.lang.Integer.compareTo(Integer.java:52)
at java.util.TreeMap.getEntry(TreeMap.java:346)
at java.util.TreeMap.containsKey(TreeMap.java:227)
at Lab10.main(Lab10.java:22)
which is if ( !map.containsKey(1) )
, but I don't know what went wrong.
The most efficient implementation really depends on your requirements.
From what you've written: So, I have a large file containing 3 million lines of words. And I need to see if there is any duplicates., I assume you're only looking to check whether there is a duplicate line.
In such case you don't need to count how many duplicates there are and using the HashSet and the old, good string hashing function might be good enough (or even better).
Here's the example:
boolean hasDuplicate = false;
Set<String> lines = new HashSet<String>();
while ( (line = list.readLine()) != null && !hasDuplicate )
{
if (lines.contains(line)) {
hasDuplicate = true;
}
lines.add(line);
}
if (hasDuplicate){
System.out.print("NOT UNIQUE");
} else {
System.out.print("UNIQUE");
}
list.close();
}