While trying to implement the tutorial from the series How-to: Analyze Twitter Data with Apache Hadoop I stumbled upon two issues:
- CDH installed using parcels, which was the recommended method. The tutorial assumes that the installation was performed using packages. As a consequence, most of the libraries and programs are installed differently.
- Because of CDH5 major libraries upgrade, to compile the Flume Twitter source the maven configuration file has to be updated accordingly.
Fortunately, the CDH5 release notes mentions:
The Twitter source is now out-of-the-box.
Let’s set up Flume as follows:
The approach is to
- Create a Twitter Application. To establish a connection to Twitter, generate an application and the related security keys (oAuth) that will authenticate your Twitter account.
- Create a Flume agent. Configure the Flume agent with one source,
TwitterSource, and one sink (target),
- Prepare HDFS. Create and configure the HDFS user and directory where the tweets will be stored.
Create a Twitter Application
Sign-in the Twitter development site.
Mouse-over your portrait (on the top right side) and select ‘My Applications’ from the menu.
Create the application by going through all the steps. Retrieve the 4 keys required to set up the source of the Flume agent.
The keys are named:
These keys can be found once the application is created:
These keys won’t change unless you decide to. You can always come back to this application screen and copy the keys again.
Create a Flume Agent
In its simplest form, a Flume agent is composed of a Source, a Sink (the destination) and a Channel. Data is collected by the source, transmitted through the channel to the destination:
In our case,
TwitterSource (the source) collects the tweets. These tweets are sent to
HDFS (the Sink) through
memoryChannel (the memory channel).
The Flume service is installed, but no instances of this service are created. To create an instance:
- Go in the main screen of Cloudera Manager,
- From the Cluster drop down button, select “Add a service”
From the list of available service, select Flume:
Select from the top menu Cluster the Flume service:
To modify the Flume service agent configuration,
- Select the menu Configuration -> View/Edit.
- Select the Agent Base Group.
- Change the agent name to ‘TwitterAgent’.
- Copy the configuration code provided below, replacing the
[xxx]with the Twitter key values
The configuration of the Flume agent is as follows:
TwitterAgent.sources = twitter TwitterAgent.channels = memoryChannel TwitterAgent.sinks = HDFS TwitterAgent.sources.twitter.type = org.apache.flume.source.twitter.TwitterSource TwitterAgent.sources.twitter.consumerKey = [xxx] #insert your key here TwitterAgent.sources.twitter.consumerSecret = [xxx] #insert your key here TwitterAgent.sources.twitter.accessToken = [xxx] #insert your key here TwitterAgent.sources.twitter.accessTokenSecret = [xxx] #insert your key here TwitterAgent.sources.twitter.maxBatchDurationMillis = 200 TwitterAgent.sources.twitter.channels = memoryChannel TwitterAgent.channels.memoryChannel.type = memory TwitterAgent.channels.memoryChannel.capacity = 10000 TwitterAgent.channels.memoryChannel.transactionCapacity = 1000 TwitterAgent.sinks.HDFS.type = hdfs TwitterAgent.sinks.HDFS.channel = memoryChannel TwitterAgent.sinks.HDFS.hdfs.path = hdfs://hadoop1.example.com:8020/user/flume/tweets/%Y/%m/%d/%H/ TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000 TwitterAgent.sinks.HDFS.hdfs.rollSize = 0 TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000 TwitterAgent.sinks.HDFS.hdfs.useLocalTimeStamp = true
Note that the
hdfs.useLocalTimeStamp option has been set to
true to avoid name conflicts while saving files in the HDFS directory.
Once this modifications are performed, click on the “Save Changes” button.
The agent is created, but it needs to be assigned to a cluster node to be able to run.
- Select the “Instances” tab, and
- Click on the “Add” button.
- Select one node and validate.
The last step is to create the flume user in HDFS. In Hue, go to the Admin -> Manage Users menu. Add a flume user (leave the create home directory selected) and a flume group.
From a command prompt, type the following command:
$> su - hdfs $> hadoop fs -chmod -R 770 /user/flume
Start the Flume Agent from the Cloudera Manager home screen.
you can monitor the progress and operations of the Flume agent:
In Hue, verify that the tweets are now written in the HDFS directory
A Firehose or Garden Hose?
After a 24 hours, there are 1.6GB worth of tweets in HDFS
$> hdfs dfs -du -h /user/flume 1.6 G /user/flume/tweet
TwitterSource processed only 4,374,100 tweets, this compared to the 500+ millions tweets that Twitter claims to generate daily. It looks like my firehose is in fact a garden hose. The reasons are somewhat explained in the development site of Twitter. There are vendors reselling the Twitter stream API that are not subject to the rate limiting restrictions. The 3 certified providers are: DataSift, Gnip and NTT Data. Topsy was also a data reseller, but it has been recently acquired by Apple.
A Twitter Firehose implemented without writing a single line of code, that’s quite impressive. It is now possible to analyze all these tweets with other mechanism such as Hive, Solr, etc.
All comments and remarks are welcome.