Search code examples
amazon-web-servicestwitterstreamingamazon-ecs

Distribute an incoming data stream into separate containers within the same network (Twitter API & AWS ECS)


I am building a data pipeline on AWS which streams the data from a Twitter's v1.1 POST statuses/filter. A streamer app sits within an ECS (i.e. docker) container from which it sends the initial POST request. The app then sends the tweets to an AWS Kinesis Firehose stream (so it's possible to send the data to the same KF stream from different places/agents).

I am using a VPC, hence it's possible to run multiple containers within the same network.

The question is: Is it possible to distribute an incoming (Twitter) data stream into multiple containers within the same network (VPC)? And if yes, any hints how to do that?

UPD. My pipeline is Twitter API -> [ECS container] streamer app -> S3 -> Lambda (predictions) -> Elasticsearch, and I'm talking about the streamer app part.

The ultimate goal here is to be able to scale depending on the intensity of the stream. E.g., have one small (memory, CPU) container when the traffic from Twitter is low and spin up more of them when the stream is more intense.


Solution

  • That is possible (equating one streamer to one Twitter API connection), but that will give you the exact same stream returning the same Tweets in each streamer instance.

    If you're tracking a static set of keywords, a good approach would be to autoscale to a single larger container, instead of having additional container ingesting the exact same stream in parallel.