Under the working title “hate on twitter” I started this project. From german politicians having to manually issue takedowns of false statements (example: Renate Künast), students who still go to school receiving offensive messages on Instagram, because they have the wrong religion when somewhere in the world a conflict escalates to people of public interest becoming victims of cancel culture.
Examples of this unmoderated toxicity within social media sparked my disgust and I started wondering, whether I could make this negativity visible with numbers. Thus the idea of the twitter monitor came about.
In this project I want to take twitters most popular hashtags, apply sentiment analysis to them and visualise the results get an idea of how toxic the conversation is.
The Dashboard
In the following dashboard the sentiments and the amount of tweets per hashtag are displayed. All measures are aggregated per hour. Hint: Change the view from “mobile” to Desktop View” with the second icon from the right at the bottom of the dashboard.
The upper plot shows the overall hourly tweet activity over the past two weeks. The middle plot shows the average hourly polarity from the sentiment analysis of the received tweets. Two graphs are displayed, since a german and an english corpus were used.
The hearty of the Dashboard is the lower half. Again, the two plots represent the average polarity from the german and the english corpus. But this time each dot represents a hashtag and its size the amount of tweets that were obtained from the twitter stream API. Mousing over the dots reveals some interesting additional information, like corresponding standard deviation to the average or the name of the hashtag.
All time periods can be adjusted using the boxes on the top of the plots.
Interlude: Messi
Check out the data from 5. August 2021 at 22:00: A gigantic spike with tweets containing the word ‘Messi’. This football player changed the sports club he is playing for and this had obviously quite the impact on Germans on twitter =D
The Results
Surprisingly – for me at least – the overall polarity of the tweets seems rather positive and especially using the english corpus a clear trend towards positivity can be seen as of the 22nd of June 2021. Negative hashtags are mostly outliers with only few tweets, but from time to time truly negative topics do appear, like ‘#ITASWI’ on the 17th of June 2021 at 3 o’clock.
On the one hand side it always feels bad to be proven wrong, but this time, it seems, I fell victim to the effect of vocal minorities. Negativity and toxicity in social media do exist and are as disgusting as ever, but they seem not to be the norm.
Averages do not tell the whole story
One critical type of information is not displayed properly in the above dashboard: the standard deviation. I was not able to implement this variable in a meaningful way. Yes, one can mouse over a dot and get the stddev information for each data point, but this is a far cry of an understandable visual representation. There exists some insights still to be gained from this data.
In the below dashboard the average polarity versus its standard deviation per hashtag are plotted. The size again corresponds to the number of tweets received. Hint: Change the view from “mobile” to Desktop View” with the second icon from the right at the bottom of the dashboard.
First let us take a look at the plot with the german polarity. It clearly shows that most data points center around a polarity of zero. This was to be expected, since zero is the default value in case no polarity can be calculated. A slight shift into the positive can be seen in the DE polarity, while in the EN polarity this trend is even more prominent and has less scattered data points far away from this visible correlation. This leads to the following interpretations:
- The underlying sentiment of tweets for popular german hashtags has a bias towards the positive, which is bigger in english tweets than it is in german ones. Hence English tweets seem to be more positive in nature than german tweets.
- German tweets have generally a higher standard deviation on their polarity, which in turns means, that german tweets are more volatile and tend to prefer more extreme positions than english tweets.
Whether these statements are the result of differences in quality between the german and english textblob or they say something about the german twitter culture, is up for you you decide.
How this all works
Hashtags are crawled from here at the start of every hour and used to obtain tweets from the twitter streams API. In this step the sentiment analysis using the textblob package is performed the tweets are send to an Azure SQL DB. In this SQL DB a procedure runs hourly (triggered by Azure Data Factory) that aggregates the results to an hourly time slice. Afterwards these aggregated results are sent to google spreadsheets and then used as input for the above Tableau Public dashboard. This way the data in the dashboard os updated once the google sheets is updated.
All the code can be found here.
Afterwards I packaged everything into a docker container, automated all remaining processes using cron jobs and uploaded this container into a private image on docker hub. With Kamatera I found a low cost way of hosting this image and I have my solution hosted and running.