Selecting distinct records in Hadoop and using combiner

"MapReduce Design Patterns" book has pattern for finding distinct records in dataset. This is the algorithm:

map(key, record):
    emit record, null

reduce(key, records):
    emit key

On page 66 it says:

The Combiner can always be utilized in this pattern and can help if there are a large number of duplicates.

map phase emits record and NullWritable(which does not written on the wire). What Combiner tries to reduce? There is no record to reduce.

Solution

It tries to reduce the duplicates in a map output.

Let's say you have text data of words in every line:

John
Adam
John
John

There is no point in sending every John to the reducer if you can combine them after the map phase and only send:

John
Adam

Which is distinct for each mapper already- thus saves bandwidth if you have a fair amount of non-distinct records in your split.

'wsimport' is not recognized error in command prompt
Best way to compare two JSON files in Java
Java get month sort name from date
Obtain and download Javadoc (JDK API documentation) to a local file for offline reading
How to get the number of days in a specific month using Java Calendar?
Custom Spring annotation for request parameters
License for package Android SDK Platform 29 not accepted
Java Compile Time Error: reached end of file while parsing
ShellIpcClient and NonCelloThread errors java
How to verify a signature from the Phantom wallet?
FirebaseAuth - Get tokenId in Java backend
How to hide constructor on a Java record that offers a public static factory method?
Is it possible to get MariaDB4J to work on an M1 Mac?
Cannot run simple compiled java program?
Getting IntelliJ to generate Java Sources from Proto files
Insert a java string constant in a quarkus qute template?
Why is the run button not working in Eclipse?
Stuck on Card/Deck exercise from Java official tutorial
Spring Batch - Deleting metadata post job completion throws error - Incorrect result size: expected 1, actual 0
Simple export and import of a SQLite database on Android
How to serialize a date to a specific format?
How to make the Youtube's rotating spinner loading screen on Java Swing
How to sort List<Integer[]> in java?
How to prevent spring boot from auto creating instance of bean 'entityManagerFactory' at startup?
Sharing instance of a class between multiple tests running in parallel in Junit5
Launch4J not recognizing Eclipse Temurin OpenJDK Java 17
Turn my stack into a string?
How can I document or exclude the generated BuildConfig class in my documentation?
Is it a bad practice to catch Throwable?
Java: Right Click Copy Cut Paste On TextField