class: center, middle, inverse, title-slide .title[ # Lecture: Research Design ] .subtitle[ ## Session 1: Introduction and statistical foundations ] .author[ ###
Dr. Paul C. Bauer
.font75[Chair of Statistics and Methodology, Prof. Florian Keusch
Mannheim Centre for European Research (MZES), University of Mannheim]
mail@paulcbauer.de
/
@p_c_bauer
/
www.paulcbauer.eu
License for original material
] .date[ ### Mannheim, September 8, 2022 ] --- class: small-font # Introduction to the lecture (1) * Paul Bauer + BA/MA in Konstanz/Barcelona (Political Science) + Phd in Bern (2015) + Postdoc Fellow at the European University Institute, Florence (2015-2017) + Since 2017 at the MZES + Research: trust; polarization; fake news; social media; quantitative methods (causal inference, ML) * Daria Szafran & Danielle Martin will introduce themselves next week. --- # Introduction to the lecture (2) * When, where, requirements, contact, grading: See syllabus + Week 3/22.9: Replace with Friday (23.9) or Monday (26.9)? + [Uni Mannheim COVID policy](https://www.uni-mannheim.de/en/academics/during-your-studies/coronavirus-information-for-students/) * Practice groups (PGs): Scheduling conflicts? 5 need to switch (13/13)! - Link: https://paulcbauer.github.io/research_design_2022/lecture_*.html - e.g., https://paulcbauer.github.io/research_design_2022/lecture_1.html (printable!) - Content is a fusion of... - ...book project *Applied Causal Analysis (with R)* (with Denis Cohen) - ...past lectures & seminars (Kreuter, Bach, Bauer etc.) - ...different books, articles and material from different disciplines (*see citations*) - Sociology of research methodology (Where did you study?) --- # Who are you? * Let's check the survey results! * Course of studies (Sociology, Data Science, ...)? * Why Mannheim? (because the city is so beautiful?) * Any previous experience with experiments/causal inference? * Programming experience/statistical software? --- # Today's objectives * Discuss components of a research design * Definition * Research questions, hypotheses, Population & sampling method, conceptualization & measurement, observations & data --- # Research design: Definition * [Wikipedia](https://en.wikipedia.org/wiki/Research_design) (careful!) on research design (RD) + A framework that has been created to find answers to research questions + A set of methods and procedures used in collecting and analyzing measures of the variables specified in the research question * Wrong RD → wrong answer to RQ → potentially drastic consequences (e.g., medicine) * Criteria for "good research design" change over years/decades * Replication crisis (Q: [Retraction watch](https://retractionwatch.com/)?) (Open Science Collaboration 2015) + Reasons: Research design errors, fraud and coding errors (e.g., [Psychoticism](https://www.thecut.com/2016/07/why-it-took-social-science-years-to-correct-a-simple-error-about-psychoticism.html)) + [What a massive database of retracted papers reveals about science publishing’s ‘death penalty’](https://www.sciencemag.org/news/2018/10/what-massive-database-retracted-papers-reveals-about-science-publishing-s-death-penalty#sidebar2) + Increasing calls for [open science](https://en.wikipedia.org/wiki/Open_science) to foster replicability and reproducability ([terminology!](http://thomasleeper.com/2015/05/open-science-language/)) --- class: small-font # Research design: Components by Babbie (2015, 114) <center><img src="lecture_1-babbie-2015-114.jpg" alt="Babbie (2015, 114)" width= "45%"></center> --- # Research design (RD): Steps 1. **Formulate a research question and hypotheses** 2. Specify target population (e.g., humans) and sampling method (e.g., random sample) 3. Specify concepts (conceptualization) and their measures (operationalization) 4. Choose a research method, e.g., a randomized experiment 5. Collect data or use data that has been collected (observations) 6. Analyze data (statistical modelling) * Important: Steps may overlap and order may change during research process (see Babbie's graph) --- # Research questions: Types * **Empirical analytical (positive) vs. normative** + **Should** men and women be paid equally? **Are** men and women paid equally (and why?)? + Q: Which one is empirical-analytical, which one normative? Can we derive hypotheses for normative questions? * **Y-based, X-based and y = f(x)-based** (Plümper 2014, 22) + What causes differences in income (Y)? + What are the consequences of differences in education (X), i.e., how does it impact other outcome variables? + Do differences in education (X) cause differences in income (Y)? (Gerring 2012, 646-648) * **What? vs. Why?** (Gerring 2012, 722-723) + Describe aspect of the world vs. causal arguments that hold that one or more phenomena generate change in some outcome (imply a counterfactual) + My personal preference: **descriptive** vs. **causal questions** --- # Research questions: What? (descriptive) * Measure:‘*Would you say that most people can be trusted or that you can’t be too careful in dealing with people, if 0 means "**Can’t be too careful**" and 10 means "**Most people can be trusted**"?*’ * RQ: What is the average level of trust (Y)? How are individuals distributed? (univariate) <table style='width:95%;margin: auto;'> <caption>Univariate distribution of trust (2006)</caption> <thead> <tr> <th style="text-align:right;"> 0 </th> <th style="text-align:right;"> 1 </th> <th style="text-align:right;"> 2 </th> <th style="text-align:right;"> 3 </th> <th style="text-align:right;"> 4 </th> <th style="text-align:right;"> 5 </th> <th style="text-align:right;"> 6 </th> <th style="text-align:right;"> 7 </th> <th style="text-align:right;"> 8 </th> <th style="text-align:right;"> 9 </th> <th style="text-align:right;"> 10 </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 303 </td> <td style="text-align:right;"> 42 </td> <td style="text-align:right;"> 172 </td> <td style="text-align:right;"> 270 </td> <td style="text-align:right;"> 369 </td> <td style="text-align:right;"> 1281 </td> <td style="text-align:right;"> 853 </td> <td style="text-align:right;"> 1344 </td> <td style="text-align:right;"> 1295 </td> <td style="text-align:right;"> 353 </td> <td style="text-align:right;"> 356 </td> </tr> </tbody> </table> * We can add as many variables/dimensions as we like (e.g. gender, time) → multivariate + Q: What would the table above look like when we add gender as a second dimension? * **Descriptive questions** (multivariate) + RQ: Do females have more trust than males? (multivariate) + RQ: Did trust rise across time?(multivariate) --- # Research questions: Why? (causal) <table style='width:95%;margin: auto;'> <caption>Joint distribution of trust and victimization (2006, N = 6633)</caption> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> 0 </th> <th style="text-align:right;"> 1 </th> <th style="text-align:right;"> 2 </th> <th style="text-align:right;"> 3 </th> <th style="text-align:right;"> 4 </th> <th style="text-align:right;"> 5 </th> <th style="text-align:right;"> 6 </th> <th style="text-align:right;"> 7 </th> <th style="text-align:right;"> 8 </th> <th style="text-align:right;"> 9 </th> <th style="text-align:right;"> 10 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> no victim </td> <td style="text-align:right;"> 259 </td> <td style="text-align:right;"> 36 </td> <td style="text-align:right;"> 135 </td> <td style="text-align:right;"> 214 </td> <td style="text-align:right;"> 320 </td> <td style="text-align:right;"> 1142 </td> <td style="text-align:right;"> 782 </td> <td style="text-align:right;"> 1228 </td> <td style="text-align:right;"> 1193 </td> <td style="text-align:right;"> 326 </td> <td style="text-align:right;"> 331 </td> </tr> <tr> <td style="text-align:left;"> victim </td> <td style="text-align:right;"> 44 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 37 </td> <td style="text-align:right;"> 56 </td> <td style="text-align:right;"> 48 </td> <td style="text-align:right;"> 139 </td> <td style="text-align:right;"> 70 </td> <td style="text-align:right;"> 114 </td> <td style="text-align:right;"> 101 </td> <td style="text-align:right;"> 27 </td> <td style="text-align:right;"> 25 </td> </tr> </tbody> </table> * **Descriptive RQs**: Do victims have a different/lower level of trust from/than non-victims? + Mean Non-victims: 6.2; Mean Victims: 5.48 * **Why?-questions** start with difference(s) and, then, seek to explain why those difference(s) occured + Why does this group of people have a higher level of trust? * **Causal questions**: Is there a *causal effect* of victimization on trust? (We'll define *causal effect* later) * Insights + Data underlying descriptive & causal questions is the same + Causal questions link Y to one (or more) explanatory causes X (or D) --- # Research questions: Precision * Q: Below you find different versions of the same research question. What is the most precise question and why is it precise? 1. Is there a causal effect of victimization on trust in a *sample of Swiss citizens*? 2. What is the impact of negative experiences on trust? 3. Is there a causal effect of victimization on generalized trust in a sample of Swiss *aged from 18 to 98 in 2010*? 5. What is the impact of *victimization* on *generalized* trust? --- # Research questions → hypotheses * Hypotheses = expectations we have for the answers to our research question (descriptive or causal) * RQ: Does smoking increase the probability of/cause cancer? + Q: What hypotheses could we formulate? -- * Hypotheses: + Smoking has an/no effect on the probability of getting cancer! + Smoking has a positive effect on (increases the probability of) getting cancer! (the higher X, the higher Y) + Smoking has a negative effect on (decreases the probability of) getting cancer! (the higher X, the lower Y) --- # Hypotheses: Precision (1) <center><img src="lecture_1-Taagepera2008-73.png" alt="Intersubjective agreement" height="45%"></center> * Hypotheses are not created equal (Taagepera 2008, 71-81) 1. **Null hypothesis**: Focus on "disproving that something is due to chance" 2. **Directional hypothesis**: Hold that increasing X will increase or decrease Y (we are here...) 3. **Emp. based quant. hypothesis**: Focusses on the shape of the function y = f(x) [Ceteris paribus] 4. **Log. based quant. hypothesis**: Formal theoretical model that makes a prediction, i.e., generates a hypothesis --- # Hypotheses: Precision (2) * Q: Which of the hypotheses below is a null/directional/emp. based quant. hypothesis? * Smoking increases the probability of getting cancer by 1% per 100 cigarettes * Smoking has an effect on the probability of getting cancer * Smoking increases the probability of getting cancer -- * **Most sociological research studies *null* or *directional hypotheses***. --- # Research design (RD): Steps 1. Formulate a research question and hypotheses 2. **Specify target population (e.g., humans) and sampling method (e.g., random sample)** 3. Specify concepts (conceptualization) and their measures (operationalization) 4. Choose a research method, e.g., a randomized experiment 5. Collect data/sample or use data that has been collected (observations) 6. Analyze data (statistical modelling) --- # Population & sampling method (1) <center><img src="lecture_1-population-sample.png" alt="Quantification"></center> * Q: What do the concepts *internal/external validity* describe? -- * Internal validity: Validity of a study's empirical findings (within its own context) + e.g., Estimate true effect of treatment among students in this classroom * External validity: Generalizability of a study's empirical findings to new environments, settings or populations (Pearl 2014) --- # Population & sampling method (2) * Sampling: Select subset of units from population to estimate characteristics of that population * Steps: Researcher (we)... + ...define *(target) population*. + ...create a *sampling frame* = list of population members to sample from + Q: Can you imagine a situation where population `\(\neq / =\)` sampling frame? + ...choose sampling method & units that should be in the sample. * Q: Are the above steps necessary when we work with secondary data (e.g., ALLBUS)? -- * No.. but we should still evaluate *representativeness* (statistical inference). --- # Population & sampling method (3) * Q: Imagine we are interested in estimating population averages/prop.. Below you find pairs of (target) population and sample: Are these good or bad samples? Any bias? + **Income:** Population: Mannheim university students; Sample: Students in this course + **Income**: Population: Immigrants in Germany; Sample: Turkish immigrants + **Age**: Population: Whatsapp users; Sample: Random sample of Whatsapp users + **Racist comments**: Population: Tweets; Sample: Random sample of tweets provided by Twitter * Q: What might be the problem with secondary data as opposed to data that you collect yourself? --- # Population & sampling method (4) * Q: What is the difference between the following sampling methods (strengths? weaknesses?)? + Simple random-, stratified-, quota-, and snowball-sampling + See Sudman (1986) and Salganik (2004) on sampling special/hidden populations --- class: small-font # Population & sampling method (5) * Sampling techniques (Cochran 2007) * **Simple random sampling**: (1) Units in the population are numbered from 1 to N; (2) Series of random numbers between 1 and N is drawn; (3) Units which bear these numbers constitute the sample (ibid, 11-12) → Each unit has same probability of being chosen * **Stratified random sampling**: (1) Population divided into non-overlapping, exhaustive subpopulations (*strata*); (2) Simple random sample is taken in each stratum (ibid, 65f) * **Quota sampling**: Decide about N units that are wanted from each stratum (e.g., age, gender, state) and continue sampling until the neccessary "quota" has been obtained in each stratum (ibid, 105) * **Snowball sampling**: (1) Locate members of special population (e.g., drug addicts); (2) Ask them to name other members of population and repeat this step (Sudman 1986, 413) → use snowballing to create sampling frame, then sample + "Relaxed" version: Interview those named until sample size is reached + Q: Does this create weaker or stronger bias? (ibid, 413) --- # Research design (RD): Steps 1. Formulate a research question and hypotheses 2. Specify target population (e.g., humans) and sampling method (e.g., random sample) 3. **Specify concepts (conceptualization) and their measures (operationalization)** 4. Choose a research method, e.g., a randomized experiment 5. Collect data or use data that has been collected (observations) 6. Analyze data (statistical modelling) --- # Measurement & variables (1) * "[the most important thing in statistics that’s not in the textbooks](https://statmodeling.stat.columbia.edu/2015/04/28/whats-important-thing-statistics-thats-not-textbooks/)" (Andrew Gelman, April, 2015) * Theories (and the **hypotheses** they imply) (Moore 2013, 3-4) + Concern **relationships among abstract concepts** + **Variables** are the **indicators** we use to measure our concepts * A **[theoretical]** variable has different "levels" or "values" (Jaccard and Jacoby 2019, 13) + e.g., gender can be conceptualized as a variable that has two levels or values (or not) * **Value of a variable** for a given unit u (u<sub>i</sub>): the **number assigned by some measurement process to u** (Holland 1986, 945), e.g., male (0) or female (1) * **Random variables**: "If we have beliefs (i.e., probabilities) attached to the possible values that a variable may attain, we will call that variable a random variable." (Pearl 2009, 8) --- # Measurement & variables (2): Quantifying the world - Albert Einstein: "the whole of science is nothing more than an extension of everyday thinking" (Jaccard and Jacoby 2010, ch. 2) - Concepts (gender, education, race etc.) are the foundations stones of thinking in daily life & science .pull-left[ <center><img src="lecture_1-meme2.jpg" alt="Quantification"></center> ] .pull-right[ * Quantification = assignment of variable values to real-world objects | id | gender | age | degree | subject | |:-------:|:------:|:------:|:------:|:------:| | John | M | 25 | bachelor | sociology | | Petra | F | 30 | master | physics | | Hans | M | 29 | master | biology | ] * Social scientists identify and classify individuals, countries etc. recurring to different concepts --- # Measurement & variables (3) | names | id | gender | income | education | happiness | age | | ------|:---:|:------:|:------:|:---------:|:----------:|:-------:| | Hans | 1 | male | 1000 | 0 | 5 | 30 | | Peter | 2 | male | 5000 | 3 | 10 | 30 | | Julia | 3 | female | 500 | 1 | 3 | 30 | | Andrea| 4 | female | 1600 | 3 | 7 | 30 | | Feli | 5 | female | 1600 | 3 | 7 | 30 | * Columns = variables * Rows = observations (often *observations* = *units* but not always) * Q: Which type of dataset has more observations (rows) than units? (Tip: Pa...) * Q: What are the *theoretical* and *observed (empirical) values* of *happiness* and *age*? * Q: Which are constants and which are variables in the above data frame? What is the difference? + Idea of constant relevant later on (**holding things constant!**) --- <br><br><br><br> <center><h1>Week/session 2</h1></center> --- # Research design (RD): Steps 1. Formulate a research question and hypotheses 2. Specify target population (e.g., humans) and sampling method (e.g., random sample) 3. Specify concepts (conceptualization) and their measures (operationalization) 4. *Choose a research method, e.g., a randomized experiment (next weeks!)* 5. **Collect data or use data that has been collected (observations)** 6. *Analyze data (statistical modelling) (next weeks!)* --- # Data collection (1): Measurement error * Q: What is measurement (or observational) error? Can you give an example? -- * Difference between a measured value of a variable and its true value + e.g., difference between Peter's measured and his real income -- * Q: What is random and systematic measurement error? Can you give an example? -- * **Systematic**: + If men systematically provide income values that are above the real values + If victims systematically under-report their victimization (Me Too movement) * **Random**: In repeated measures, a scale randomly deviates from your true weight * Partly conceptual confusion around terms such as reliability, repeatability (Bartlett 2008) (Q: Validity? Reliability?) <!-- American actress Alyssa Milano posted on Twitter, "If all the women who have been sexually harassed or assaulted wrote 'Me too' as a status, we might give people a sense of the magnitude of the problem," saying that she got the idea from a friend --> --- # Data collection (2): Meas. error example .pull-left[ <center><img src="lecture_1-The_Dress_(viral_phenomenon).png" alt="Intersubjective agreement" width="100%" height="100%"></center> ] .pull-right[ <br><br> * Q: Which colors does this dress have? + Please give an answer: https://www.menti.com/msg5qk5bb4 + After the vote: What do you think will be the result? ] --- # Data collection (3): Meas. error example <iframe src="https://www.mentimeter.com/embed/2bbb30846b1237e738c669723944a592/8b87c07837b7" width="80%" height="80%" frameborder="0" marginheight="0" marginwidth="0">Loading...</iframe> * Q: What can we learn from this example? (intersubjective agreement) --- # Data collection (4): Meas. equivalence (4) * **Concept**: Left-right ideology; **Survey measure**: "In politics people sometimes talk of 'left' and 'right'. Using this card, where would you place yourself on this scale, where 0 means the left and 10 means the right?" (ESS 2012) + Q: What does the graph below illustrate? (Bauer et al. 2017, 558) <center><img src="lecture_1-measurement-equivalence.png" alt="Quantification" height = "40%" width = "40%"></center> -- * Measurement inequivalence: When a measure provides different values for units with the same underlying true value (= form of measurement error) --- # Data (1) * After decisions about *research question*, *population*, *sample*, *concepts*, and *measures* we finally choose a *research method* and have collected *data* * Lecture focuses on causal inference using experimental/observational data so let's quickly reiterate what data is! * **Data**: Units' observed values on different variables (observations) + Time is just another *variable* * **Variables**: Dimensions of the data space * **Empirical observations** are distributed across those dimensions, i.e., across (theoretical) values of those variables --- # Data (2) * Data example: *'Negative Experiences and Trust: A Causal Analysis of the Effects of Victimization on Generalized Trust'* (Bauer 2015) + What is the causal effect of being victimized (= threatened) on trust? + **Population**: People living in Switzerland + **Units**: Individuals + **Data**: + **Sample**: 6633 Individuals (Switzerland!) ([Sampling method](https://forscenter.ch/wp-content/uploads/2018/08/shp_user_guide_w18.pdf)) + **Variables**: Victimization/Threat (0,1); Education (0-10) → Trust (0-10); Age (0-94) + **Observations**: Here observed once in 2006 (+ more years) * Q: How do we normally show/look at data? --- # Data: Table format <table style='width:95%;margin: auto;'> <caption>Data in table format</caption> <thead> <tr> <th style="text-align:left;"> Name </th> <th style="text-align:left;"> trust2006 </th> <th style="text-align:left;"> threat2006 </th> <th style="text-align:left;"> education2006 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Aseela </td> <td style="text-align:left;"> 4 </td> <td style="text-align:left;"> 0 </td> <td style="text-align:left;"> 8 </td> </tr> <tr> <td style="text-align:left;"> Dominic </td> <td style="text-align:left;"> 5 </td> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> 1 </td> </tr> <tr> <td style="text-align:left;"> Elshaday </td> <td style="text-align:left;"> 0 </td> <td style="text-align:left;"> 0 </td> <td style="text-align:left;"> 0 </td> </tr> <tr> <td style="text-align:left;"> Daniel </td> <td style="text-align:left;"> 5 </td> <td style="text-align:left;"> 0 </td> <td style="text-align:left;"> 9 </td> </tr> <tr> <td style="text-align:left;"> Sulaimaan </td> <td style="text-align:left;"> 7 </td> <td style="text-align:left;"> 0 </td> <td style="text-align:left;"> 4 </td> </tr> <tr> <td style="text-align:left;"> Peyton </td> <td style="text-align:left;"> 5 </td> <td style="text-align:left;"> 0 </td> <td style="text-align:left;"> 1 </td> </tr> <tr> <td style="text-align:left;"> Mudrik </td> <td style="text-align:left;"> 2 </td> <td style="text-align:left;"> 0 </td> <td style="text-align:left;"> 4 </td> </tr> <tr> <td style="text-align:left;"> Alexander </td> <td style="text-align:left;"> 7 </td> <td style="text-align:left;"> 0 </td> <td style="text-align:left;"> 5 </td> </tr> <tr> <td style="text-align:left;"> .. </td> <td style="text-align:left;"> .. </td> <td style="text-align:left;"> .. </td> <td style="text-align:left;"> .. </td> </tr> </tbody> </table> * Q: How many rows should this table have if N = 6633? How many dimensions? * Aseela[4, 0, 8]: Position of Aseela in the multi-dimensional space <!-- Das sind gleichzeitig Vektoren = Zahlenreien --> --- class: my-one-page-font # Data: Univariate distribution(s) .pull-left[ <table style='width:95%;margin: auto;'> <caption>Univariate distribution of trust (2006, N = 6633)</caption> <thead> <tr> <th style="text-align:right;"> 0 </th> <th style="text-align:right;"> 1 </th> <th style="text-align:right;"> 2 </th> <th style="text-align:right;"> 3 </th> <th style="text-align:right;"> 4 </th> <th style="text-align:right;"> 5 </th> <th style="text-align:right;"> 6 </th> <th style="text-align:right;"> 7 </th> <th style="text-align:right;"> 8 </th> <th style="text-align:right;"> 9 </th> <th style="text-align:right;"> 10 </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 303 </td> <td style="text-align:right;"> 42 </td> <td style="text-align:right;"> 172 </td> <td style="text-align:right;"> 270 </td> <td style="text-align:right;"> 368 </td> <td style="text-align:right;"> 1281 </td> <td style="text-align:right;"> 852 </td> <td style="text-align:right;"> 1342 </td> <td style="text-align:right;"> 1294 </td> <td style="text-align:right;"> 353 </td> <td style="text-align:right;"> 356 </td> </tr> </tbody> </table> <br> <table style='width:95%;margin: auto;'> <caption>Univariate distribution of threat (2006, N = 6633)</caption> <thead> <tr> <th style="text-align:right;"> 0 </th> <th style="text-align:right;"> 1 </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 5966 </td> <td style="text-align:right;"> 667 </td> </tr> </tbody> </table> <br> <table style='width:95%;margin: auto;'> <caption>Univariate distribution of education (2006, N = 6633)</caption> <thead> <tr> <th style="text-align:right;"> 0 </th> <th style="text-align:right;"> 1 </th> <th style="text-align:right;"> 2 </th> <th style="text-align:right;"> 3 </th> <th style="text-align:right;"> 4 </th> <th style="text-align:right;"> 5 </th> <th style="text-align:right;"> 6 </th> <th style="text-align:right;"> 7 </th> <th style="text-align:right;"> 8 </th> <th style="text-align:right;"> 9 </th> <th style="text-align:right;"> 10 </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 380 </td> <td style="text-align:right;"> 806 </td> <td style="text-align:right;"> 194 </td> <td style="text-align:right;"> 89 </td> <td style="text-align:right;"> 2182 </td> <td style="text-align:right;"> 324 </td> <td style="text-align:right;"> 687 </td> <td style="text-align:right;"> 474 </td> <td style="text-align:right;"> 195 </td> <td style="text-align:right;"> 425 </td> <td style="text-align:right;"> 877 </td> </tr> </tbody> </table> ] .pull-right[
<br>
<br>
] <br> * Q: Where are most individuals located on the trust2006 and threat2006 scales? --- # Data: Joint distribution(s) [1] .pull-left[ <center>
</center> ] .pull-right[ * Measures on several variables → multivariate joint distribution * 3 variables: Victimization/Threat (0,1); Education (0-10) → Trust (0-10) * Q: How many dimensions? How many theoretical value combinations? <!-- Answer: 3 dimensions; 2*11*11 = 242 theoretical value combinations--> ] --- # Data: Joint distribution(s) [2] .pull-left[ <center>
</center> ] .pull-right[ * Units are grouped on three variables: **Trust (Y)**, **Threat (D)** and **Education (X)** * Q: What would the corresponding dataset/dataframe look like? * Q: What would a joint distribution with 4 variables look like? * Q: What is a conditional distribution? * Q: What does the joint distribution of two perfectly correlated variables look like? * **Important**: Often we can only make a causal claims for a subset of our data (e.g., education = 4) ] --- # Data: One more joint distribution
--- # Associational vs. causal inference * **Joint distribution** is basis for any quantitative analysis (Holland 1986, 948) of variables used in the design/analysis * **Associational inference (descriptive questions, what?)**: + Summarize joint distribution with statistical model (e.g., regression model) + Does not tell us anything about causality, e.g., coefficient represents effect in both directions (Trust ↔ Threat) * **Causal inference (causal questions, why?)**: + Summarize joint distribution with statistical model (e.g., regression model) + Add assumptions + Give causal interpretation to coefficients! --- # Research design (RD): Steps with examples 1. Formulate a research question and hypotheses + RQ: Is there a causal effect of victimization on trust (Bauer 2015)? + Hypothesis: Yes there is a positive effect! 2. Specify target population (e.g., humans) and sampling method (e.g., random sample) + Target population: Swiss population; Sampling method: Random sample of households 3. Specify concepts (conceptualization) and their measures (operationalization) + Define victimization & trust and choose survey questions 4. *Choose a research method, e.g., a randomized experiment (next weeks!)* + Use survey with repeated observations (panel data) 5. Collect data or use data that has been collected (observations) + Take data from the Swiss Household Panel (SHP) 6. Analyze data (statistical modelling) + Use matching + difference-in-differences --- # Quiz * Please do the quiz under the following link: https://forms.gle/eLL4FCfSg2Y14m2x5. --- class: my-one-page-font-supersmall # References Bartlett, J. W. and C. Frost (2008). "Reliability, repeatability and reproducibility: analysis of measurement errors in continuous variables". En. In: Ultrasound Obstet. Gynecol. 31.4, pp. 466-475. Bauer, P. C, P. Barberá, K. Ackermann, et al. (2017). "Is the Left-Right Scale a Valid Measure of Ideology?" In: Political behavior 39.3, pp. 553-583. Cochran, W. G. (2007). Sampling techniques. John Wiley & Sons. Gerring, J. (2012). "Mere Description". In: Br. J. Polit. Sci. 42.4, pp. 721-746. Holland, P. W. (1986). "Statistics and causal inference". In: J. Am. Stat. Assoc. 81.396, pp. 945-960. Jaccard, J. and J. Jacoby (2019). Theory Construction and Model-Building Skills, Second Edition: A Practical Guide for Social Scientists. En. Guilford Publications. Moore, W. H. and D. A. Siegel (2013). A Mathematics Course for Political and Social Research. En. Princeton University Press. Pearl, J. and E. Bareinboim (2014). "External Validity: From Do-Calculus to Transportability Across Populations". In: Stat. Sci. 29.4, pp. 579-595. Plümper, T. (2014). Effizient schreiben: Leitfaden zum Verfassen von Qualifizierungsarbeiten und wissenschaftlichen Texten. De. Walter de Gruyter GmbH & Co KG. Salganik, M. J. and D. D. Heckathorn (2004). "Sampling and Estimation in Hidden Populations Using Respondent-Driven Sampling". In: Sociol. Methodol. 34.1, pp. 193-240. Sudman, S. and G. Kalton (1986). "New Developments in the Sampling of Special Populations". In: Annu. Rev. Sociol. 12.1, pp. 401-429. Taagepera, R. (2008). Making Social Sciences More Scientific: The Need for Predictive Models. En. OUP Oxford.