U.S. study shows if worldwide tweets reflect votes, Jessica Sanchez won American Idol worldwide
“Beating the news using Social Media: the case study of American Idol” By: Fabio Ciulla, Delia Mocanu, Andrea Baronchelli, Bruno Goncalves, Nicola Perra (all from the Department of Physics, College of Computer and Information Sciences, Department of Health Sciences, Northeastern University, Boston MA 02115 USA) and Alessandro Vespignani (Department of Physics, College of Computer and Information Sciences, Department of Health Sciences, Northeastern University, Boston MA 02115 USA; Institute for Scientific Interchange Foundation, Turin 10133, Italy; Institute for Quantitative Social Sciences, Harvard University, Cambridge, MA, 02138). May 23, 2012.
(Blog admin’s note: Actually, this study is more than about the “tweet votes” for Jessica Sanchez; it posits that a systematic survey of tweets can be used as source of data to gauge public opinion and therefore predict the outcome of electoral events, thereby “beating” news reporters and news analysts in making a prognosis, or even scooping journalists by forecasting how social events unfold or turn due to the outpouring of sentiments as shown by massive Twitter activity… do you see the implications? i think that big picture requires more study.)
(Excerpted by blog admin)
(from http://www.mobs-lab.org/uploads/6/7/8/7/6787877/american_idol_finale.pdf as storified & linked by InterAksyon.com , TV Channel 5 online news)
(Some excerpts by blog admin: “Jessica Sanchez related Tweets are 45% of the total if only U.S. is considered, while it rises to 64% if the whole World is considered.xxx Filipino-restricted Twitter activity concerning Jessica is strongly peaked in the two voting sessions of American Idol for the East and West timezones, and that numerous websites explicitly address the issue of ”voting tunnels”: “How to Vote for Jessica Sanchez from the Philippines and Other Non-US Countries”.xxx Our fundamental, and somehow naive, assumption is that the number of votes each contestant receives is proportional to the number of tweets that mention her. xxxIt is important to note that this is a very simple measure, and that we deliberately choose not to take into account many of the factors that in principle might affect the results, such as the presence of negative or neutral tweets, or attempts to directly affect the counts by spamming the system with automatically generated tweets. xxx If we consider the whole of our dataset, as we have done in the previous analysis, Jessica turns out to have been the most popular in Twitter in our time window.xxx Jessica is the only contestant that has a strong Twitter signal originating from outside of the U.S. (and in particular from the Philippines), with an increasing trend after the show on April 19.xxx However, the data show that the advantage of Phillip in the U.S. is remarkably smaller than the one of Jessica in the aggregated dataset xxx”)
Abstract: “We present a contribution to the debate on the predictability of social events using big data analytics. We focus on the elimination of contestants in the American Idol TV shows as an example of a well defined electoral phenomenon that each week draws millions of votes in the USA. We provide evidence that Twitter activity during the time span defined by the TV show airing and the voting period following it, correlates with the contestants ranking and allows the anticipation of the voting outcome. Twitter data from the show and the voting period of the season finale have been analyzed to attempt the winner prediction at 10.00 am of May the 23rd ahead of the airing of the official result. Furthermore, the fraction of Tweets that contain geolocation information allows us to map the fanbase of each contestant, both within the US and abroad, showing that strong regional polarizations occur. Although American Idol voting is just a minimal and simplified version of complex societal phenomena such as political elections, this work shows that the volume of information available in online systems permits the real time gathering of quantitative indicators anticipating the future unfolding of opinion formation events.”
Excerpts: “xxx(S)earch engine queries or posts on microblogging systems such as Twitter have been used to forecast epidemics spreading , stock market behavior  and election outcomes[3–6] with varying degrees of success. However, as many authors have pointed out, there are several challenges one must face when dealing with data of this nature: intrinsic biases, uneven sampling across location of interest etc. [7–10]. In this paper we intend to assess the usefulness of open source data by analyzing in depth the microblogging activity surrounding the voting behavior on the contestants in American Idol, one of the most viewed American TV Shows.
“ xxx The xxx time frame (xxx a few hours) and frequency (every week) over an extended period (an entire TV Season) provides a close to ideal test ground for the study of electoral outcomes xxx In particular, we assume that: 1) The demographics of users tweeting about American Idol are representative of the voting pool; 2) The self-selection bias, according to which the people discussing about politics on Twitter are likely to be activists scarcely representative of the average voter, seems to become almost a positive discrimination factor in the case of a TV show where the voters are by definition self-selected; 3) Voting fans are the most motivated subset xxx, they are allowed to vote multiple times; 4) Users are not malicious, and engage only in conversations he or she has a particular interest in; 5) The influence incumbency, which strongly affects the outcome of political elections, is not a factor determining the outcome of American Idol.
“For the above reasons we can consider TV show competitions as a case study for the use of open source indicators to achieve predictive power, or simply beating the news, about social phenomena. xxx
“I. RULES AND VOTING SYSTEM. “ xxx Voting can take one of three forms: toll-free phone calls, texting and online voting. The rules of the competition only allow for votes casted by the residents of the U.S., Puerto Rico and U.S. Virgin Islands. There is no limit to the number of messages or calls each person can make, while the online votes are limited to 50 per computer as identified by its unique IP address. xxx”
“II. DATA. Our fundamental assumption is that the attention received by each contestant in Twitter is a proxy of the general preference of the audience. To validate this assumption, we collected tweets containing a list of 51 #tags, usernames and strings related to the show. The main dataset was obtained by extracting matching tweets from the raw Twitter feed used by Truthy  for the entire duration of the current season of American Idol. The feed is a sample of about 10% of the entire number of tweets that provides a, statistically significant, real time view of the topics discussed within the Twitter ecosystem. This allowed us to make a postevent analysis of the last 9 eliminations. This dataset was further complemented by the results of automatically querying the Twitter search API every 10 minutes for tweets containing one or more of the keywords we identified as related to American Idol. The search API data cover the period since May 16, giving us a more detailed view of the last elimination before the season’s finale.
“III. A CARTOGRAPHY OF THE FANBASE. Tweets in our dataset often contain georeferenced location information that allows us to analyze the spatial patterns in voting behavior. Figure 1
shows a strong geographical polarization in the U.S. towards different candidates. In the weeks preceding the Top 3 show [panels (B) and (C)], for example, Phillip Phillips gathers most of the attention in the Midwest and South, while Jessica Sanchez appears to be popular particularly on the West Coast as well as in the large metropolitan areas across all of the country, and Joshua Ledet is strong in Louisiana. The Top 3 week analysis [panel (A)] shows a disturbance from the previous geographical distribution, perhaps due to the performance of the candidates. As expected, the audience reacts to the events occurring on Wednesday night. On the other hand, and perhaps not surprisingly, the attention basins of each of the three participants always include their origin city (Phillips was born and raised in Georgia, Sanchez is from Chula Vista, California, and Ledet from the Lake Charles metropolitan area in Louisiana). The geolocalized data also allows for a unique view of the attention devoted to American Idol in the rest of the world. Although one might naively expect interest to be limited to the US, Figure 2
shows that the show is also popular in several foreign countries and particularly in the Philippines. This can be understood by noting that one of the contestants is of Filipino origin. Jessica Sanchez’s mother is originally Filipino, having been born in the Bataan province . Participation in American Idol has made Sanchez so popular in her mothers native country that on May 16 the Philippine President Benigno Aquino III congratulated the singer for her performance and stated, “Hopefully she really reaches the top.” . Table I quantifies this intuition.
Contestant U.S.A. World Philippines
Jessica 45 +/- 4 64:2 +/- 2:2 92:8 +/- 1:9
Joshua 15 +/- 3 9:8 +/- 1:3 1:4 0:9
Phillip 40+/ 4 26+/ 2:0 5:8 +/- 1:7
“TABLE I: Popularity basins. Data concerns the entire American Idol season up to the morning of May 17 (before the two finalists were announced), and refers to the percentage (%) of popularity within U.S., the whole World and the Philippines. The geo-localized database for the three candidates contains 3251 data points. Errors represent the normal confidence interval with a confidence level of 99%.
xxx “ Jessica Sanchez related Tweets are 45% of the total if only U.S. is considered, while it rises to 64% if the whole World is considered. Officially, Sanchez’s popularity abroad should not have any impact on voting, since, as mentioned above, only the U.S. based audience is allowed to take part into the election procedure. However, it is interesting to note that the Filipino-restricted Twitter activity concerning Jessica is strongly peaked in the two voting sessions of American Idol for the East and West timezones, and that numerous websites explicitly address the issue of ”voting tunnels”: “How to Vote for Jessica Sanchez from the Philippines and Other Non-US Countries” . Although we have no proof of any irregular voting activity, tweets analysis clearly points out to a possible anomaly that may be a concern.
“IV. POST-EVENT ANALYSIS. Our fundamental, and somehow naive, assumption is that the number of votes each contestant receives is proportional to the number of tweets that mention her. In other words, the larger the number of tweets referred to a contestant – the twitter volume – the larger the number of votes she will get. This gives a natural measure to rank each contestant. It is important to note that this is a very simple measure, and that we deliberately choose not to take into account many of the factors that in principle might affect the results, such as the presence of negative or neutral tweets, or attempts to directly affect the counts by spamming the system with automatically generated tweets. In fact, one of the goals of this paper is to test whether or not a minimal set of measures applied to Twitter data can be good indicators of the actual voting outcome. Past attempts have met with ambivalent results and we are interested in testing the limits of this naive approach by building an unsophisticated prediction system assembled in less than one week. xxx In order to minimize the noise that might be introduced by discussions after the voting time and especially after the elimination, we considered the number of tweets generated on a specific time window: 8:00 PM – 3:00AM EST each Wednesday. The show airs at 8 PM EST. The votes can be submitted until midnight in the West coast which translates to 3:00 AM in the east.xxx For each of the last 9 weeks, we have integrated the number of tweets related to each user in the show+voting time window. We then ranked the contestants in decreasing order. The last 3 count as the bottom three and the last contestant is the most likely to be eliminated. We confront our prediction with the real outcomes. xxx Twitter data serves as a correct indicator for the last three eliminations and identifies correctly most of the bottom three/two contestants. Twitter signal indications were wrong two times, and we have other four cases in which the confidence intervals in the ranking could not allow to make a prediction (too close too call). In Figure 4
it is possible to notice that, as expected, when the number of contestants reduces and the fan base solidifies, the differences between ranks become much clearer and separated.
“V. AND THE WINNER IS…The analysis of the season finale is based on the data collected between the beginning of the show in the East at 8:00 P.M. EST and the end of the voting period in the west, at 4:00 A.M. EST. The histogram of Figure 5
has a twofold interpretation. If we consider the whole of our dataset, as we have done in the previous analysis, Jessica turns out to have been the most popular in Twitter in our time window. Henceforth, the analysis used for the elimination shows lead us to predict that Jessica will be the winner of the show.
“However, there is an important caveat. As we pointed out before, Jessica is the only contestant that has a strong Twitter signal originating from outside of the U.S. (and in particular from the Philippines), with an increasing trend after the show on April 19. Given that the voting is restricted to the U.S. only, it is helpful to have a closer look at the data, and consider the subset of Tweets that come with geographical metadata. Although the geolocalized data are a much smaller subset of the total signal, this dataset allows us to provide the contestants’ standing restricted to the USA Twitter population. In the US, Phillip appears to have the largest fanbase of the two contestants (see also the cartogram of Figure 6).
If the possibility of votes coming from abroad is discarded, using the available data, we could then claim that Phillip is going to be the winner of the 11th edition of American Idol. However, the data show that the advantage of Phillip in the U.S. is remarkably smaller than the one of Jessica in the aggregated dataset, and the voting coming from abroad might have a crucial role in determining the outcome of the finale.
“VI. CONCLUSION. We have shown that the open source data available on the web can be used to make educated guesses on the outcome of societal events. Specifically, we have shown that extremely simple measures quantifying the popularity of the American Idol participants on Twitter strongly correlate with their performances in terms of votes. A post-event analysis shows that the less voted competitors can be identified with reasonable accuracy xxx looking at the Twitter data collected during the airing of the show and in the immediately following hours. It is worth noting that our analysis aims to be extremely simple in order to establish a valid baseline on what it is possible to deduce by Social Media. As such, we purposefully do not consider a number of refinements and techniques that could improve the accuracy of our predictions. Distortions due to overactive users can be controlled by evaluating the number of unique users tweeting on each contestant. The text of the tweets could be scrutinized by using sentiment analysis techniques to select and compare only specific positive or negative tweets as a proxy for success/failure. Corrections to the demographic representations of Twitter users could be considered. All these techniques have been or are being developed in the analysis of a wealth of social phenomena and could be tested in a very clear and simple setting such as those of American Idol or similar shows. Furthermore, we have illustrated that open source data can provide a deeper insight into the composition of the audience, with the eventual possibility of pointing out possible sources of anomalous behaviors. A geographical projection of the data reveals a non-uniform distribution of the basins of fans, and likely of voters, for the different participants. Interestingly, the same inspection highlights also that a strong activity concerning some of the candidates may come from non-U.S. countries, whose audience are officially forbidden to vote. Finally, our work casts a word of warning on the possible feedback between competitive TV shows and social media. Indeed, while the former rely more and more on the online voting of the audience, and the votes are kept secret and revealed only at the end of the show, all of the data necessary to monitor and even forecast the outcome of these shows is publicly available on the web. Given the large economic interests that lay behind such programs, such as the revenues of betting agencies and the major contracts of the show participants, it is obvious that this situation can lead to a number of undesirable outcomes. For example, the audience could be induced to alter their behavior in function of the situation they observe, and the job of betting agencies could be dramatically simplified. On a more general basis, our results highlight that the aggregate preferences and behaviors of large numbers of people can nowadays be observed in real time, or even forecasted, through open source data freely available in the web. The task of keeping them private, even for a short time, has therefore become extremely hard (if not impossible), and this trend is likely to become more and more evident in the future years.”
(Citations removed by blog admin due to space constraints.)