Search code examples
twittertwitter4j

Some general Twitter4J questions


I'm trying to do a write up of Twitter4J for part of a uni project, but I'm getting hung up on a few things. From the Twitter4J api:

void sample()
Starts listening on random sample of all public statuses. The default access level provides a small proportion of the Firehose. The "Gardenhose" access level provides a proportion more suitable for data mining and research applications that desire a larger proportion to be statistically significant sample.

This implies that by default, a "default access" is provided to the stream, but another type of access, "Gardenhose access" is available. Is this correct? And if so, how do you access the higher Gardenhose access?

I'm asking as I've seen some answers on SO suggest that there is only one level of access - the Gardenhose, and I'm trying to clear this up once and for all.

In addition to this, I would like a reference (if possible) to the number of tweets the sample stream allows access to. I've read lots of people cite 1% for "default access" and 10% for "gardenhose access" - but I can't find this anywhere in the API.

So to sum up, two questions:

  1. Does the sample stream have a "default access" and a "gardenhose access", or just one of those?
  2. How much of the Twitter firehose stream can these levels of access gain?

If replying, please have links to reference-able API where possible.


Solution

  • The gardenhose is different from the default sample stream, you would have had to request access from Twitter in order to use it.

    However, I am not sure if Twitter still allows access to the gardenhose, or even if it still exists. It seems the current mechanism may be to use one of Twitter's preferred data partners:

    Using the Streaming API?

    Every Twitter account can connect to a small sampling of the Streaming API. Accounts that need increased access for data gathering or analytical reasons should check out our preferred partners page.

    (source)

    It may be different for students or educational instutions and that the gardenhose is still available to you. Previously you would have to either e-mail [email protected] or you could use the following form, but I have no idea if these methods work still - the post is quite old.

    As for the percentage of Tweets that the default sample stream allows access to, the best reference I could find was a comment made by a Twitter employee on the developer forums - emphasis mine:

    I would recommend just using the 1% sample stream from https://stream.twitter.com/1/statuses/sample.json that you can connect to with your Twitter account. It's unlikely that you'll be in a situation where you can access all of the data and will have to make do with a sample. At about 230 million tweets a day, you'd still be theoretically getting 2.3 million tweets a day.

    (source)

    Although, again this is an old post.

    Regarding the firehose stream, as specified by the documentation you need to be granted permission to access it, I believe very few people have full access to this stream:

    GET statuses/firehose

    This endpoint requires special permission to access.

    Returns all public statuses. Few applications require this level of access. Creative use of a combination of other resources and various access levels can satisfy nearly every application use case.

    Overall documentation is scarce on the different access levels and what they offer, I suggest contacting Twitter directly to discuss your requirements or contacting one of their data partners.

    Apologies if this wasn't as concrete as you would have liked, good luck with your research.