Search code examples
pythonlistdictionarymapreducemax

Finding the max _length of word using MapReduce


I need to find all the longest word/words from a txt file using MapReduce. I have written the following code for the mapper and reducer, but it shows the entire dictionary of len(words) as Key and the words as Values. I need help in writing the code to show the result of the max length only and the respective words. Following is my code :

"""mapper.py"""
import sys
> for line in sys.stdin:
>   for word in line.strip().split():
>      print ('%s\t%s' % (len(word), word))



"""reducer.py"""

> import sys results={} for line in sys.stdin:
>     index, value = line.strip().split('\t')
>     if index not in results :
>         results[index] = value
>     else :
>         results[index] += ' '
>         results[index] += value

***** I m just stuck on this part to continue the coding to get the max(key) with corresponding words

Input file : How Peace Begins ? Peace begins with saying sorry, Peace begins with not hurting others, Peace begins with honesty ,trust and dedications, Peace begins with showing cooperation and respect. World Peace Begins with Me !

Output expected : The longest word has 11 characters. The words are: dedications cooperation


Solution

  • I am not sure what you are doing with the stdin or why you are importing sys. Also, the sample input file doesn't seem to be in csv format but just a simple text file. As I understand you problem, you want to read an input file, measure the length of each word and report out the length of the maximum word or words and list the words meeting this criteria. With this in mind, this is how I would proceed:

    inputFile = r'sampleMapperText.txt'
    with open(inputFile, 'r') as f:
        reslt = dict()  #keys = word lengths, values = words of key length
        text = f.read().split('\n')
        for line in text:
            words = line.split()
            for w in words:
                wdlist = reslt.pop(len(w), [])
                wdlist.append(w)
                reslt[len(w)] = wdlist
        maxLen = max(list(reslt.keys()))
        print(f"Max Word Length = {maxLen}, Longest words = {', '.join(reslt[maxLen])}")   
    

    Running this code produces:

    Max Word Length = 12, Longest words = dedications,
    

    If you insist on separating the process into two separate files. Assuming the two files are in the same directory, I would do it as follows:

    The contents of the reducer.py file would be:

    # reducer.py 
    def getData(filepath: str) -> list([str]):
        with open(filepath, 'r') as f:
            text = f.read().split('\n')
        return text   
    

    The contents of the mapper.py file would be:

    # mapper.py
    from reducer import getData
    
    def mapData(text:list(str)):
        reslt = dict()  #keys = word lengths, values = words of key length
        for line in text:
            words = line.split()
            for w in words:
                wdlist = reslt.pop(len(w), [])
                wdlist.append(w)
                reslt[len(w)] = wdlist
        maxLen = max(list(reslt.keys()))
        print(f"Max Word Length = {maxLen}, Longest words = {', '.join(reslt[maxLen])}")     
    
    inputFile = r'sampleMapperText.txt'
    mapData(getData(inputFile))