portfolio

Sonntagsfrage

The “Sonntagsfrage” is a german survey where people answer the crucial question:

If you had to vote the german government on this sunday, whom would you choose?

https://www.infratest-dimap.de/umfragen-analysen/bundesweit/sonntagsfrage/
(translated freely)

It is performed by the infratest bimap institute and generally around 1000 – 1500 people take part in the questions. The results are often used as a barometer for the political situation within the country and are regularly presented in one of germanies main news programs, the “Tagesschau”.

In this Post the historical results of the Sonntagsfrage are used as input for a machine learning algorithm to produce a forecast for upcoming Sunday. Hence creating an AI which predicts the answers to the next Sonntagsfrage and thus gives an indication of the political climate in germany right now.

The results for next Sunday can be seen in the following Dashboard. Predictions can be viewed per used model or the average over all models can be displayed.

Check out my Medium Posts, where I explain how I used Google and Azure Cloud services to set this whole thing up.

Solution architecture

For the realisation of this project as well as automatisation and data consumption a combination of services from the Azure Cloud, Google Sheets and Tableau Public were used. The architecture can be seen in the image below.

Historical data is pulled from
https://www.wahlrecht.de/umfragen/dimap.htm
via a webcraller in python. For automation purposes the Microsoft Cloud service Azure was chosen: crawling and cleaning are each realised through Azure Functions, while orchestration is performed by Azure Durable Functions. As the single-point-of-truth storage the Azure SQL database is used.

For more information on this topic, check out my Medium Articles about Azure Functions and Durable Functions. The corresponding code can be found in my GitHub account.

Model training and predicting is performed with Python through the the use of the Azure Machine Learning service. Pipelines and Compute Clusters create an automated and reproducible Data Science Product.

Data consumption is performed via dashboards from the Tableau Public service using Google Sheets as a technical backbone.

Model evaluation

The models I use for machine learning are as follows:

  • DecisionTreeRegressor (sklearn)
  • SGDRegressor (sklearn)
  • GradientBoostingRegressor (sklearn)
  • XGBoost Regressor

As for input parameters right now only temporal features are used. Cyclical passing over the years is encoded into radial coordinates and additionally features like “number of days since the last survey” are being used.

In order to compare the different models I defined the following common performance metrics. Each one is computed over the most recent 12 weeks

  • MAE
  • MSE
  • RMSE
  • r2

The following Dashboard shows these metrics computed per party and model in a heat map.

As can be seen, all models perform similarly poor. Reasons for this are probably the lack of seasonal time series modelling (like ARIMA, prophet, etc.) or the lack of useful features.

Data consumption

Gathering data and fueling some algorithm are only two thirds of the way done. In order to have impact the results of all this magic in the backend have to be consumed by people. Hence the last step of this project: easy to read and easy to access visualisation. This is the part of such data science projects which is actually visible to other people and by which the quality of the project is often judged.

Visualisation of the historical answers to the Sonntagsfrage as well as of the prediction for the next survey are done with the service of tableau public. As data source for the dashboard Google Sheets ist used. This way new calculations are automatically uploaded into a google document and the dashboard refreshes the displayed data each time the document is updated. Check out my Article on Medium for more information on that topic.

The Dashboard with the current predictions can be found at the top of the article, a dashboard with historical values at the bottom.

Conclusion

The Sonntagsfrage gets now predicted weekly and this prediction visualized permanently without the need of any further input: we are now provided with a glimpse into the possible future political climate of germany. Look forward to my posts on Medium, where i go into detail about the essential steps. Also follow me on twitter and take a look into the well documented git repository of this Sonntagsfrage project.

Now Enjoy the dashboards!

Twitter Monitor

Under the working title “hate on twitter” I started this project. From german politicians having to manually issue takedowns of false statements (example: Renate Künast), students who still go to school receiving offensive messages on Instagram, because they have the wrong religion when somewhere in the world a conflict escalates to people of public interest becoming victims of cancel culture.

Examples of this unmoderated toxicity within social media sparked my disgust and I started wondering, whether I could make this negativity visible with numbers. Thus the idea of the twitter monitor came about.

In this project I want to take twitters most popular hashtags, apply sentiment analysis to them and visualise the results get an idea of how toxic the conversation is.

The Dashboard

In the following dashboard the sentiments and the amount of tweets per hashtag are displayed. All measures are aggregated per hour.

The upper plot shows the overall hourly tweet activity over the past two weeks. The middle plot shows the average hourly polarity from the sentiment analysis of the received tweets. Two graphs are displayed, since a german and an english corpus were used.

The hearty of the Dashboard is the lower half. Again, the two plots represent the average polarity from the german and the english corpus. But this time each dot represents a hashtag and its size the amount of tweets that were obtained from the twitter stream API. Mousing over the dots reveals some interesting additional information, like corresponding standard deviation to the average or the name of the hashtag.

All time periods can be adjusted using the boxes on the top of the plots.

The Results

Surprisingly – for me at least – the overall polarity of the tweets seems rather positive and especially using the english corpus a clear trend towards positivity can be seen as of the 22nd of June 2021. Negative hashtags are mostly outliers with only few tweets, but from time to time truly negative topics do appear, like ‘#ITASWI’ on the 17th of June 2021 at 3 o’clock.

On the one hand side it always feels bad to be proven wrong, but this time, it seems, I fell victim to the effect of vocal minorities. Negativity and toxicity in social media do exist and are as disgusting as ever, but they seem not to be the norm.

How this all works

Hashtags are crawled from here at the start of every hour and used to obtain tweets from the twitter streams API. In this step the sentiment analysis using the textblob package is performed the tweets are send to an Azure SQL DB. In this SQL DB a procedure runs hourly (triggered by Azure Data Factory) that aggregates the results to an hourly time slice. Afterwards these aggregated results are sent to google spreadsheets and then used as input for the above Tableau Public dashboard. This way the data in the dashboard os updated once the google sheets is updated.

All the code can be found here.

Afterwards I packaged everything into a docker container, automated all remaining processes using cron jobs and uploaded this container into a private image on docker hub. With Kamatera I found a low cost way of hosting this image and I have my solution hosted and running.

CSV-file to Mailchimp

The NGO

This NGO is an international movement that is committed to safe escape routes, unhindered sea rescue and an end to dying at European borders.

I came into contact with them via the DSSG Berlin organisation.

The Task

Mailing Lists can be obtained from a variety of sources. The central hub used by this NGO for newsletter and campaigning is the web service MailChimp.

Hence the task is set: Use csv-file exports from other services (like FundraisingBox or twingle) to automatically import mailing lists into MailChimp.

The Solution

For this solution I created a python project with a clickable shell script to execute the code. Non-technical users have to work with my solution, so I tried to make it user friendly.

I first created individual preprocessing steps for each source in Python. The goal was to homogenise the different inputs into a consistent format.

The next step used these files and send the contents via the MailChimp API to the MailChimp account.

After running tests, the code was successfully deployed and used in a real world scenario. Now it helps elevate some of the manual work and receives small updates from time to time to time.

The complete code can be found here on GitHub.

Week 24: End of the Week Update

What a week. Breakthroughs. Finally!

Portfolio Project: Twitter Analysis

I managed to find an affordable host for docker containers: Kamatera. At 4$ per month I get my own mini-server with docker installed. So I got my twitter stream working from inside a docker container with cron, pushed it to a private repo on docker hub and deployed the container on my Kamatera server. A minor annoyance is, that I had to manually start cron on the container. Maybe have to add a startup script?

Also created a first dashboard at tableau public. A few more will follow, but only few steps are left until I will be ready to post a new project to my portfolio!

Insta stuff

Settled on theme for my channel: Art & AI.

Read up on Skillshare, created some posts, posted twice. Things are starting, am kind of nervous. Took some pictures from Memo Akten and posted them. Send Memo a mail asking for permission, but got no answer since he is on holiday. Let’s hope, I don’t get trouble.

Settled on Plannthat for the time as feed planning App.

Style Transfer

Also got some pictures going using style transfer with tensorflow lite. Here are some examples:



Initially I planned using them for some insta posts, but maybe I can create also a portfolio page from them. Let’s see how much I can milk that cow =D

Finishing Words

Was finally a nicely productive week. Sometimes one just has to push through the tough times to reap some rewards. Keep it up guys, results are just around the corner!

Week 19: End of the Week Update

This weeks learnings come from starting the second use case for the portfolio page, working for Seebrücke and getting down some bigger picture plans.

learnings

  • Requirements engineering: While assessing the requirements and performing a first evaluations of the solution space I realised that BI Solutions are quite expensive overall. The cheapest solutions with a complete feature set came at about 70 Euro per month per user.
    This made also way for the next learning: For small scope problems there exist way easier solutions than enterprise grade ones.
  • Connecting twitter to azure requires a running server somewhere. With a configured twitter app from the developer account there still needs to be a process somewhere that extracts (consumes) the tweets and sends them somewhere. This process needs to be running permanently for aggregation. Maybe this site can be used? Further investigation is required.
  • Sentiment analysis in German is tough to find, but possible. I did find 2 – 3 sites and python packages that offer this feature and aren’t too outdated (two years).
  • In order to extract information about hate in tweets sentiment analysis is not enough. Azures content moderation API would more useful, but limited in the number of free monthly requests (ca. 167 per day). Paid request would end up very expensive.
    Python packages performing similar tasks were not findable in German.
  • Need to get clear plans: What do I want broad stroke wise? I want build portfolio projects to learn, I want to convert them somehow into income streams and I want to get away from working for money to making money work for me.
    I need to find a management solution for overview. Using a small notebooks is not a working solution for me.