Overview

In this document, we are going to perform sentiment analysis on tweets about Mylan Cable. The tweets were pulled on April 24, 2015, in the midst of a series of bids and hostile takeover attempts between Mylan, Teva, and Perrigo.

We’re going to need several packages to interact with the Twitter REST APIs, perform the sentiment analysis, and visualize the tweets.

library(sentiment)
library(twitteR)
library(ggplot2)
library(wordcloud)
library(ggmap)
library(RColorBrewer)

Pulling the Tweets

Authorizing the Request

To get the Twitter data, we need to have a developer account set up with Twitter. The twitteR package accesses the API using the credentials provided by Twitter.

consumer.key <- '**********'
consumer.secret <- '**********'
access.token <- '**********'
access.secret <- '**********'
setup_twitter_oauth(consumer.key, consumer.secret,
                    access.token, access.secret)

Searching

Once we authorize our account, we can search recent and popular tweets using a search string, language, location, and several other parameters.

We’ll search recent tweets for the word Twitter and transform it into a data frame for easier manipulation.

mylan.tweets <- searchTwitter(searchString = "#mylan", n = 1500,
                                lang = "en")
mylan.df <- data.frame(t(sapply(mylan.tweets, as.data.frame)))

From here, we need to clean and process our data for sentiment analysis.

mylan.text <- unlist(mylan.df$text)

Cleaning the Data

First, we clean the data by converting the encoding to UTF-8 and removing any non-text characters (links, special characters, etc.).

mylan.clean <- sapply(mylan.text, iconv, to='UTF-8', sub='byte', USE.NAMES = FALSE)
mylan.clean <- sapply(mylan.clean, function(i) gsub("<.{2}>", "", i),
                      USE.NAMES = FALSE)
mylan.clean <- sapply(mylan.clean, function(i) gsub("'s\\b", "", i),
                      USE.NAMES = FALSE)
mylan.clean <- sapply(mylan.clean, function(i) gsub("\\bhttps*://.+\\b", "", i), USE.NAMES = FALSE)
mylan.clean <- sapply(mylan.clean, tolower, USE.NAMES = FALSE)
mylan.clean <- sapply(mylan.clean, function(i)
    gsub("([mylan|teva|perrigo])[[:punct:]]'*s\\b", "\\1", i),
                      USE.NAMES = FALSE)
mylan.clean <- sapply(mylan.clean, function(i) gsub("\\n", " ", i), USE.NAMES = FALSE)
mylan.clean <- sapply(mylan.clean, removePunctuation,
                      USE.NAMES = FALSE)

Finally, since we are interested in the sentiment, we remove function words (preposition, conjunctions, etc.) so we can focus on the content words.

mylan.clean <- sapply(mylan.clean, function(i) removeWords(i, stopwords("SMART")),
                    USE.NAMES = FALSE)

Visualizing the Tweets

Now that we have a clean data set, we can create a corpus of the words and try to visualize the topics being discussed. For this, we turn to a word cloud. The most frequently used words are plotted in a field, and we color and scale them by how frequent they are.

mylan.corpus <- Corpus(VectorSource(mylan.clean))
color.vec <- rev(brewer.pal(8, "Set2"))[-1]
color.vec <- rev(c("#00B5E6", "#FFD92F", rep("#000000", 8), "#787980"))
wordcloud(mylan.corpus, scale = c(3, 1), min.freq = 4, color = color.vec,
          max.words = 120)

The terms dominating the word cloud are centered around the bidding war: “Mylan,” “Teva,” “Perrigo,” “bid,” “offer,” etc. We also see words that would be used to describe these companies like “generic/generics” and “drug/drugmaker.”

The next largest group of related terms is referring to the Mylan Relay for Hope, which kicked off the same day we pulled the tweets.

Sentiment Analysis

Now that we’ve seen the common words in tweets containing “Mylan,” we can do some basic sentiment analysis. It should be noted that this should be done with a grain of salt when we use Twitter data, since each tweet is a rather small sample of words.

Adding to the difficulty is the nature of the news – tweets about the bidding wars are likely to have been written by financial reporters or analysts and lacking in emotional content.

Additionally, the topics at hand may add extra difficulties for prediction. For example, a tweet reporting on a hostile takeover might get classified as expressing negative or strong emotions because of the word “hostile.”

Polarity

Our first step will be to try and identify the polarity of the tweets. The sentiment package in R uses a naive Bayes classify to predict the likelihood of a text being positive, negative or neutral. A tweet will then be classified according to which one has the highest likelihood.

mylan.polar <- classify_polarity(mylan.clean)[,4]
mylan.polar[is.na(mylan.polar)] <- "unknown"
mylan.df <- data.frame(mylan.df,
                      Polarity = mylan.polar)
ggplot(mylan.df, aes(x = Polarity)) + geom_bar(aes(y=..count.., fill=Polarity)) +
        scale_fill_brewer(palette="Dark2") +
        ggtitle("Polarity Counts in Mylan Tweets")

Most of the tweets are expressing a positive sentiment, and we see very few neutral cases. Given the nature of the data, we’ll want to take a closer look and see what words are associated with each class.

l.polar <- levels(mylan.df$Polarity)
n.polar <- length(l.polar)
by.polar <- sapply(l.polar, function(polar)
   paste(mylan.clean[mylan.polar == polar], collapse = " ")
)
by.polar.corpus <- Corpus(VectorSource(by.polar))
polar.doc.mat <- as.matrix(TermDocumentMatrix(by.polar.corpus))
colnames(polar.doc.mat) <- l.polar

comparison.cloud(polar.doc.mat, colors = brewer.pal(3, "Dark2")[c(2,3,1)],
  random.order = FALSE, scale = c(3, 1), title.size = 2,
                 max.words = 150)