Search code examples
sortingapache-sparkpysparkrdd

PySpark - Sort RDD by Second Column


I've this RDD:

[[u''], [u'E01', u'Lokesh'], [u'E10', u'Venkat'], [u'EO2', u'Bhupesh'], [u'EO3', u'Amit'], [u'EO4', u'Ratan'], [u'EO5', u'Dinesh'], [u'EO6', u'Pavan'], [u'EO7', u'Tejas'], [u'EO8', u'Sheela']]

And I want to sort by the second column (name). I try this but without success:

[u'EO3', u'Amit'], 
[u'EO2', u'Bhupesh'], 
[u'EO5', u'Dinesh'], 
[u'E01', u'Lokesh'], 
[u'EO6', u'Pavan'],
[u'EO8', u'Sheela'],
[u'EO7', u'Tejas'],
[u'E10', u'Venkat']

I try with this:

sorted = employee_rows.sortBy(lambda line: line[1])

But it gives me this:

IndexError: list index out of range

How can sortby the second column?

Thanks!


Solution

  • In general, you should make all of your higher order rdd functions robust to bad inputs. In this case, your error is because you have at least one record that does not have a second column.

    One way is to put a condition check on the length of line inside the lambda:

    employee_rows.sortBy(lambda line: line[1] if len(line) > 1 else None).collect()
    #[[u''],
    # [u'EO3', u'Amit'],
    # [u'EO2', u'Bhupesh'],
    # [u'EO5', u'Dinesh'],
    # [u'E01', u'Lokesh'],
    # [u'EO6', u'Pavan'],
    # [u'EO4', u'Ratan'],
    # [u'EO8', u'Sheela'],
    # [u'EO7', u'Tejas'],
    # [u'E10', u'Venkat']]
    

    Or you could define a custom sort function with try/except. Here's a way to make the "bad" rows sort last:

    def mysort(line):
        try:
            return line[1]
        except:
            # since you're sorting alphabetically
            return 'Z'
    
    employee_rows.sortBy(mysort).collect()
    #[[u'EO3', u'Amit'],
    # [u'EO2', u'Bhupesh'],
    # [u'EO5', u'Dinesh'],
    # [u'E01', u'Lokesh'],
    # [u'EO6', u'Pavan'],
    # [u'EO4', u'Ratan'],
    # [u'EO8', u'Sheela'],
    # [u'EO7', u'Tejas'],
    # [u'E10', u'Venkat'],
    # [u'']]