Search code examples
springbatch-processingspring-batch

Spring Batch how to filter duplicated items before send it to ItemWriter


I read a flat file (for example a .csv file with 1 line per User, Ex: UserId;Data1;Date2).

But how to handle duplicated User item in the reader (where is no list of previus readed users...)

stepBuilderFactory.get("createUserStep1")
.<User, User>chunk(1000)
.reader(flatFileItemReader) // FlatFileItemReader
.writer(itemWriter) // For example JDBC Writer
.build();

Solution

  • Filtering is typically done with an ItemProcessor. If the ItemProcessor returns null, the item is filtered and not passed to the ItemWriter. Otherwise, it is. In your case, you could keep a list of previously seen users in the ItemProcessor. If the user hasn't been seen before, pass it on. If it has been seen before, return null. You can read more about filtering with an ItemProcessor in the documentation here: https://docs.spring.io/spring-batch/docs/current/reference/html/processor.html#filteringRecords

    /**
    * This implementation assumes that there is enough room in memory to store the duplicate
    * Users.  Otherwise, you'd want to store them somewhere you can do a look-up on.
    */
    public class UserFilterItemProcessor implements ItemProcessor<User, User> {
    
        // This assumes that User.equals() identifies the duplicates
        private Set<User> seenUsers = new HashSet<User>();
    
        public User process(User user) {
            if(seenUsers.contains(user)) {
                return null;
            }
            seenUsers.add(user);
            return user;
            
        }
    }