Search code examples
apistreamingpysparksuds

Pyspark: how to streaming Data from a given API Url


I was given an API url, and a method getUserPost() which returns the data needed for my data processing function. I am able to get the data by using Client from suds.client as follow:

from suds.client import Client
from suds.xsd.doctor import ImportDoctor, Import

url = 'url'
imp = Import('http://schemas.xmlsoap.org/soap/encoding/')
imp.filter.add('filter')
d = ImportDoctor(imp)
client = Client(url, doctor=d)
tempResult = client.service.getUserPosts(user_ids = '',date_from='2016-07-01 03:19:57', date_to='2016-08-01 03:19:57', limit=100, offset=0)

Now, each tempResult will contain 100 records. I want to stream the data from given API url to RDD for parallelized processing. However, after reading the pySpark.Streaming documentation I can't find a streaming method for customized data source. Could anyone give me an ideal how to do so?

Thank you.


Solution

  • After a while digging, I found out how to solve the problem. I employed the use of Kafka Streaming. Basically you need to create a producer from given API, specify topic and Port for communication. Then a consumer to listen to that specific topic and Port to start streaming the data.

    Note that the Producer and Consumer must be working as different threads in order to archive real-time streaming.