About

Chapter last updated: 10 Februar 2025

This website serves as slides and script for the GESIS workshop Applied Machine Learning (2024, 2025), for the seminar Künstliche Intelligenz (KI) für die Sozial- und Politikwissenschaften (Universität Freiburg, 2024) as well as for workshops in shorter formats taught by Paul C. Bauer .1 Original material is licensed under a Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license. Where I draw on other authors material (I do so extensively) other licenses may apply. Please note the references in the syllabus as well as the citations and links in the script. For potential future versions of this material see www.paulcbauer.de. If you have feedback or discover errors/dead links please let me know (mail@paulcbauer.de). Before you use the script please make sure to install all the necessary packages (see Section 9) before you use the script.

1 Versions: Workshops & Seminars

  • Gesis: 4 days (18h in total): This GESIS version lasts 18 ours across 4 days and cover almost all of the content in the script. Different foki are set depending on participants demands.
    • Hours: 10:00 - 12:00 and 13:00 - 16:00
    • Dates, Contact and Outline: See syllabus (received link via email).
  • Seminar at University of Freiburg (18h teaching hours in total): Siehe Syllabus.

2 Instructor: About me

3 Your turn

  • Let’s check our the survey results…
  1. Name?
  2. Affiliation? Country?
  3. What are you interests? What do you want to use machine learning for? (research questions? outcomes?)
  4. What prior knowledge do you have? (seminars?)

4 Contact & Outline & Dates

  • See syllabus.

5 Script & material

  • Literature: See syllabus.
  • Website/script: https://paulcbauer.github.io/workshop_applied_machine_learning/
    • Find it: Google “paul bauer applied machine learning”
    • Document = slides + script (Zoom in/out with STRG + mousewheel)
    • Code: can all be found in the script
    • Data: can usually be downloaded over links in the script. If not we’ll share the files.
    • Full screen: F11
    • Navigation: TOCs on left and right
    • Search document (upper left)
    • Lots of information in footnotes!2
    • Document generated with quarto
  • Motivation: Have a go-to script for participants (and ourselves!)
  • Content: Mixture of theory, lab sessions, exercises and pure code examples for discussion

6 Strategy & Goals/Learning outcomes

  • Strategy: From the simple to the complex, slowly diving into the logic of machine learning using building blocks that we already know

  • Goals: By the end of the course participants will..

    • ..understand key concepts underlying machine learning.
    • ..be able to interpret and evaluate machine learning models.
    • ..be able to critically assess model performance.
    • ..be able to use various machine learning models for predictive and classification purposes.
    • ..have learned how to use the tidymodels framework for machine learning in R.
    • ..have learned how to evaluate and visualize model performance using packages like ggplot2.
  • Difficulty: How to choose out of many methods & models!

7 Online vs. offline

  • Negative
    • Screen fatigue
    • Can’t run around to check your code
    • Less engaging, less social
    • Voice
    • Screen sharing & less screen space than classroom
  • Positive
    • Participation from everywhere
    • That’s how we interact more and more
  • Rule(s):
    • Please keep your camera online!
      • Distracting animals/children/partners are a welcome distraction!
      • Yawning, leaving, looking bored etc. allowed!
      • Use a virtual background if you like!
      • Leaving, returning allowed!
    • There no stupid questions3

9 Software we will use

  • Open-source software! (Q: Why?)
  • R (R Core Team 2023)4
    • only viable competitor is Python
    • Install the necessary packages using the code below.
# install.packages('pacman')
library(pacman)
p_load('reticulate', 'keras',
'tidyverse', 'lemon', 'knitr', 'kableExtra', 'dplyr', 'plotly', 'randomNames',
'stargazer', 'tidymodels', 'gghighlight', 'gt', 'colorspace', 'patchwork',
'latex2exp', 'skimr', 'modelsummary', 'ggplot2', 'DataExplorer', 'visdat',
'haven', 'probably', 'DALEX', 'DALEXtra', 'fairmodels', 'glmnet', 'naniar',
'rpart', 'rattle', 'magick', 'vip', 'xgboost', 'tm', 'rsample', 'explore',
'ExPanDaR', 'sjlabelled', 'profvis', 'rsconnect', 'whereami', 'DT', 'tidytext')
  • Ggplot25 (Wickham 2016)
  • Maybe: Plotly6 (Sievert 2020)
  • Note: Ideally cite the software you use in your research especially when it is open-source (e.g., run citation("ggplot2"))

10 Data we will use

  • Can be downloaded here
  • European Social Survey (ESS) [Round 10 - 2020. Democracy, Digital social contacts]
    • Outcome: Life satisfaction [0-10]
    • The ESS contains different outcomes amenable to both classification and regression as well as a lot of variables that could be used as features (~580 variables).
    • Research: Shen, Yin, and Jiao (2023); Collins et al. (2015); Kaiser, Otterbach, and Sousa-Poza (2022); Çiftci and Yıldız (2023); Pan and Cutumisu (2023); Prati (2022)
  • COMPAS
    • Outcome: Reoffense/Recidvism [0,1]
    • We will be using the dataset published by propublica (Github) that is described by Angwin et al. (2016; and Lee, Du, and Guerzhoy 2020; James et al. 2013)
    • The data is based on the COMPAS risk assessment tools (RAT). RATs are increasingly being used to assess a criminal defendant’s probability of re-offending.
  • Datasets were prepared/adapted for the workhop7
    • Workshop leaves out tedious data management tasks (recoding, renaming, etc.)
    • I added missings on outcomes (life satisfaction, recidvism) to have some unseen data
  • The self-learning material introduced those datasets.
Overview of Compas dataset variables
  • id: ID of prisoner, numeric
  • name: Name of prisoner, factor
  • compas_screening_date: Date of compas screening, date
  • decile_score: the decile of the COMPAS score, numeric
  • is_recid: whether somone reoffended/recidivated (=1) or not (=0), numeric
  • is_recid_factor: same but factor variable
  • age: a continuous variable containing the age (in years) of the person, numeric
  • age_cat: age categorized
  • priors_count: number of prior crimes committed, numeric
  • sex: gender with levels “Female” and “Male”, factor
  • race: race of the person, factor
  • juv_fel_count: number of juvenile felonies, numeric
  • juv_misd_count: number of juvenile misdemeanors, numeric
  • juv_other_count: number of prior juvenile convictions that are not considered either felonies or misdemeanors, numeric

11 Tools & software

11.0.1 R: Why use it?

  • Free and open source (think of science in developing countries)
  • Good online-documentation
  • Lively community of users (forums etc.)
  • Pioneering role
  • Visualization capabilities
  • Intuitiv
  • Cooperates with other programs
  • Used across wide variety of disciplines
  • Object-oriented programming language
  • Popularity (See popularity statistics on books, blogs, forums)
  • RStudio as powerful integrated development environment (IDE) for R
    • Evolves into a scientific work suite optimizing workflow (replication, reproducability etc.)
  • Institutions/people (Gary King, Andrew Gelman etc.)
  • Economic power (Revolution Analytics, Microsoft R Open)
  • Python is only real “competitor”.. can be used from R (e.g. reticulate package!)8

11.0.2 R: Where/how to study?

If you haven’t used R sofar it’s necessary that you learn some basics in R. As a participant of the seminar you get 6 months access to all the courses on DataCamp. DataCamp has become the go-to site for self-studying various data science skills (mostly software).

11.0.3 R: Installation and setup

Below some notes on the installation and setup of R and relevant packages on your own computer:

  1. Install Rtools for Windows machines from CRAN (https://cran.r-project.org/bin/windows/Rtools/). If you are using OS X, you will need to to install XCode, available for free from the App Store. This will install a compiler (if you don’t have a compiler installed) which will be needed when installing packages from GitHub that require compilation from C++ source code.
  2. Install the latest version of R from CRAN (https://cran.r-project.org/).
  3. Install the latest version of RStudio (https://www.rstudio.com/products/RStudio/). RStudio is the editor we’ll rely on, i.e. we’ll write code in RStudio which is subsequently sent to and run within R.
  4. Start RStudio and install & load the latest versions of various packages that we need (see Section 9 for code on how to install the packages).
  5. You may also read up on how to create and “knit” Quarto files. Essentially, such files allow you to integrate the analyses you conduct with the text you write which is ideal for reproducability. Here is an intro to the concept and a simple example: https://quarto.org/docs/get-started/hello/rstudio.html.

11.0.4 Datacamp

References

Angwin, Julia, Jeff Larson, Lauren Kirchner, and Surya Mattu. 2016. “Machine Bias.” https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing.
Bauer, Paul C, and Bernhard Clemm von Hohenberg. 2021. “Believing and Sharing Information by Fake Sources: An Experiment.” Political Communication 38 (6): 647–71.
Çiftci, Necmettin, and Metin Yıldız. 2023. “The Relationship Between Social Media Addiction, Happiness, and Life Satisfaction in Adults: Analysis with Machine Learning Approach.” International Journal of Mental Health and Addiction 21 (5): 3500–3516.
Collins, Susan, Yizhou Sun, Michal Kosinski, David Stillwell, and Natasha Markuzon. 2015. “Are You Satisfied with Life?: Predicting Satisfaction with Life from Facebook.” In Social Computing, Behavioral-Cultural Modeling, and Prediction, 24–33. Springer International Publishing.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning: With Applications in R. Springer.
Kaiser, Micha, Steffen Otterbach, and Alfonso Sousa-Poza. 2022. “Using Machine Learning to Uncover the Relation Between Age and Life Satisfaction.” Scientific Reports 12 (1): 5263.
Landesvatter, Camille, and Paul C Bauer. 2024. “How Valid Are Trust Survey Measures? New Insights from Open-Ended Probing Data and Supervised Machine Learning.” Sociological Methods & Research.
Landesvatter, Camille, Jan Behnert, and Paul C Bauer. 2023. “Comparing Speech-to-Text Algorithms for Transcribing Voice Data from Surveys.” SocArXiv.
Lee, Claire S, Jeremy Du, and Michael Guerzhoy. 2020. “Auditing the COMPAS Recidivism Risk Assessment Tool: Predictive Modelling and Algorithmic Fairness in CS1.” In Proceedings of the 2020 ACM Conference on Innovation and Technology in Computer Science Education, 535–36. Association for Computing Machinery.
Pan, Zexuan, and Maria Cutumisu. 2023. “Using Machine Learning to Predict UK and Japanese Secondary Students’ Life Satisfaction in PISA 2018.” The British Journal of Educational Psychology.
Prati, Gabriele. 2022. “Correlates of Quality of Life, Happiness and Life Satisfaction Among European Adults Older Than 50 Years: A Machine‐learning Approach.” Archives of Gerontology and Geriatrics 103: 104791.
R Core Team. 2023. R: A Language and Environment for Statistical Computing.”
Schuette, Lukas, and Paul Cornelius Bauer. 2025. “Predicting Populism with Machine Learning: A Cautionary Tale.” Working Paper.
Shen, Xiaofang, Fei Yin, and Can Jiao. 2023. “Predictive Models of Life Satisfaction in Older People: A Machine Learning Approach.” International Journal of Environmental Research and Public Health 20 (3).
Sievert, Carson. 2020. Interactive Web-Based Data Visualization with R, Plotly, and Shiny. Chapman; Hall/CRC.
Wickham, Hadley. 2016. ggplot2: Elegant Graphics for Data Analysis. Springer.

Footnotes

  1. The first iteration of the workshop was taught in May 2020.↩︎

  2. The popup on mousover!↩︎

  3. Please ask no matter if you think the question is stupid or you have missed something in the course. Any question is valuable and any repetition of content is useful!↩︎

  4. Creators: Core contributors and thousands of package authors.↩︎

  5. Creators: https://github.com/tidyverse/ggplot2↩︎

  6. Creators: https://github.com/plotly/plotly.js; https://github.com/ropensci/plotly↩︎

  7. e.g., rename variables, data preparation↩︎

  8. The seminar consists of a mix of theoretical and applied sessions. For the applied session we will rely on the software R. While there are various programs one could use, the reasons mentioned above speak for R (my personal view). The only real contendor for data science is Python. See here for a nice overview of the differences between the two.↩︎