Search code examples
pythonpython-2.7apache-sparkpysparkrdd

TypeError: tuple indices must be integers, not str using pyspark and RDD


I'm new to Python. I'm also new to pysaprk. I'm trying to run a code line that takes (kv[0], kv[1]) and then run an ngrams() function on kv[1].

Also here is the sample layout of the mentions data that the code works on:

Out[12]: 
[{'_id': u'en.wikipedia.org/wiki/Kamchatka_Peninsula',
  'source': 'en.wikipedia.org/wiki/Warthead_sculpin',
  'span': (100, 119),
  'text': u' It is native to the northern.'},
 {'_id': u'en.wikipedia.org/wiki/Warthead_sculpin',
  'source': 'en.wikipedia.org/wiki/Warthead_sculpin',
  'span': (4, 20),
  'text': u'The warthead sculpin ("Myoxocephalus niger").'}]

This is the code that I'm working with:

    def build(self, mentions, idfs):
            m = mentions\
                .map(lambda (source, target, span, text): (target, text))
                .flatMapValues(lambda v: ngrams(v, self.max_ngram))
                .map(lambda v: (v, 1))
                .reduceByKey(add)\

How should the data from the previous step be formulated to resolve this error?? Any help or guidance will be truly appreciated.

I'm using python 2.7 and pyspark 2.3.0.

Thank you,


Solution

  • mapValues can be applied only on a RDD of (key, value) pairs (RDD where each element is a tuple of length equal to 2, or some object that behaves as one - How to determine if object is a valid key-value pair in PySpark)

    You data is a dictionary, so it doesn't qualify. It is not clear what you expect there, but you suspect you want:

    from operator import itemgetter
    
    (mentions
      .map(itemgetter("_id", "text"))
      .flatMapValues(lambda v: ngrams(v, self.max_ngram))
      .map(lambda v: (v, 1)))