Fluke or Flue: Making Sense of Big Social Data

HedonometerThe promise of Big Data is that we can capitalize on the ever increasing amount of data. Perhaps the biggest pot of gold is hidden within the social data universe. The usage of the web and more in particular Google, Facebook and Twitter has created a massive and still growing pool of user generated content. This content could help create a better understanding of individual and consumer behavior. Every click and swipe on the web, every tweet launched, every comment made and every picture posted reveals more about an individual. This is exactly the reason why integrating social data into a Big Data environment is important to organizations. It will help them understand what makes clients and prospects tick.

But analyzing social data is easier said than done. It still is largely uncharted territory. The analytic tools that should help us understanding social data are still under development and research experience with analyzing social data is a trial-and-error process for which a manual is missing. Although capabilities to analyze social data are progressing quickly it will be wise to keep a critical eye on the outcomes of social data analysis as it is still in its infancy and can easily lead to wrong conclusions.

The most famous fluke this year so far was the overinflated flue predictions of Google Flue Trends in January. Google Flue Trends mines data with flue related search terms in Google search. The basic assumption is that  when people are not feeling well and think they have the flue they will try to find confirmation on their symptoms. Most people will do this even before consulting a physician. As a result Google Flue Trends has a time advantage over the official flue monitors that are based on networks of physicians reporting on real cases of influenza. Over a period of 4 years it has proven to be a reliable and accurate forecaster of flue pandemics. But the flue predictions of Google earlier this year were way off. Their flue estimates were almost twice as high as the official number by the Center for Disease Control and Prevention. Although Google will surely refine its algorithms  to avoid future glitches and become even more accurate and reliable in predicting influenza pandemics, mining flu related searches on the web is not the same thing as diagnosed flu cases. It is and will remain a proxy at most, albeit a powerful one. It’s weakness is that the data used is not context aware. It does not distinguish between people not feeling well and people whose parents are not feeling well or people who are interested in influenza because they read articles on flu pandemics.

But it is not only Google that is trying to capitalize on web content. Millions of tweets are sent every day, containing potential valuable information. As such it represents an uncharted but high potential territory that calls for new types of research. Sentiment analysis is a method that is quickly gaining popularity as it measures a person’s sentiment which in turn can help explain a lot about people’s behavior. An interesting outcome of one of these tweet studies is that Twitter users’ expressed happiness increases logarithmically with distance from an individual’s average location. This is the conclusion from a study of the University of Vermont into 37 million geolocated tweets of 180000 Twitter users in the US. The tweets were analyzed using sentiment analysis in which changes in words were characterized as a function of movement.  Although the methodological part of the study and the amount of data gives it outcome enough credibility it remains difficult to see what the data really tells us, whether the conclusion is correct and, maybe even more important, why people become happier the further they are from home?

First there is the translation from happiness to the words used in tweets. Using the hedonometer to measure people’s happiness is increasingly accepted. But it remains a bit tricky too since words can have different meaning to different individuals. It certainly leaves enough room for interpretation.

There is also the question whether the tweet really resembles the state of mind of an individual. Maybe people just tend to sob alone in silence but share their happiness out loud; nobody wants to look or sound like a loser.

But even if we can translate tweets into happiness and gauge the correct state of mind of an individual it still leaves the question open why people are happier the more they are away from their ‘average location’. We could make an educated guess that people that go on holiday will send positive tweets but we can’t be sure; the analysis doesn’t give us any clue.

The explosion of social data is a blessing for organizations that seek to analyze individual and consumer behavior. But the tools to analyze this type of data are basic and still under development. The key issue with analyzing social data and more in particular with sentiment analysis is the absence of the user context. Without this context it is extremely difficult to draw far reaching conclusions and take appropriate action. Until methods, tools and our understanding of social media analyses improves  we should not get carried away by overhyped results. There is definitely value in social data, we just haven’t found the right key yet.

Share and Enjoy:
  • Facebook
  • Twitter
  • Google Bookmarks
  • email
  • LinkedIn
  • RSS

About Marcel Warmerdam