Many of our customers receive thousands of mentions per day — far more than can be read and understood in aggregate. The recently launched Trends Report was created to provide an easily consumable digest of this data. The report displays who is talking about your brand and what they are talking about.
To process the incoming Twitter mention data, we created a finagle or thrift service that extracts topics and supporting words from tweet text with a natural language processing algorithm. To aggregate the data we leveraged our Hadoop cluster and wrote several pig scripts that run each night, and to serve the data we added functionality to our existing Django API engine.
The core of the project was parsing tweet text into topics, and words that supported them. If you tweeted “@SproutSocial, Your software is awesome,” we wanted “software” to be a topic.
Further, we wanted to know what people are saying about the topics and hashtags, so we want to capture “awesome” as a “cloud word” — meaning it appears in a word cloud around the topic. It was important that this algorithm extend to any line of business, so if someone is discussing a menu item at a restaurant, or a particular player on an NFL’s roster, these are captured as topics.
Our approach (spearheaded by our data scientist @JoelB82) was built on the identification of parts of speech within the tweet. We used a Markov-backed model that analyzed a tweet’s text and output the most likely sequence of tags. It’s a probability-based strategy. It asks questions such as, “Given that the first word is ‘I,’ what part of speech is the second word most likely to be?” This is a simplistic example, and it gets much more complicated.
This strategy works well for tagging text data where the text could be anything. Companies could launch a new product line using a word that had never before existed, and we wanted to make sure our tagging could capture that.
On top of this tagging algorithm, we built in special handling for twitter idioms, like “RT” and “via.” Nouns (like “software”) and noun phrases (like “social media software”) became topics. Accompanying adjectives and adverbs became “cloud words.” This generated metadata was stored in an hbase table along with the tweet data.
Aggregating With Hadoop
Having identified the topics and cloud words (along with hashtags, which are baked into Twitter’s data), we needed a way to calculate the top topics for each individual Twitter handle. A given customer might receive hundreds of thousands of mentions over the course of a month, and it would be difficult, and very slow, to group and then count the top topics in memory on a modest server.
So we leveraged our Hadoop cluster to run a job nightly. This job groups all of the incoming mentions by mentioned user, and then further groups by topic, taking the most common 15 topics and then the common 15 associated words for each topic. It does similar work with people who mention you and who are mentioned along with you. The data is serialized as JSON and stored to Cassandra. Lastly, for each topic, it stores all tweets that included that topic in another Cassandra column family. This is what drives the ability to drill into a given topic or hashtag.
This morning, the job processed 27,366,131 @mention tweets.
Serving Via Django
One of the technical challenges of the project was to present data in a way that is as relevant as possible to the end user. If a company promoted a hashtag in a commercial that aired at 9 PM EST, that hashtag would end up showing up on the report for the following day, as 9 PM EST is already the following day in UTC.
So to allow that hashtag to show up on the day that makes sense, we actually run the aggregation pig job for the UTC month plus twelve hours on either side. When a user requests a report, our api servers read all of the json into memory and filter out the data that was not actually part of the month for the user making the request.