Appendix C: Data examples & more stuff

Chapter last updated: 19 March 2025

1 Data examples

1.1 Long-format for visualization

Longformat lends itself to ggplot2 visualization
- Each mapping (x, y, shape, color) would correspond to one variable
Discuss example here..

2 More stuff

2.1 Summary tables with sparklines (CHECK!)

Fascinating more recent stuff¹: Sparklines
Sparklines in tables
- The datasummary_skim in the modelsummary package

library(tidyverse)
library(modelsummary)
# data_twitter_influence.csv
# data <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
#                                 "1dLSTUJ5KA-BmAdS-CHmmxzqDFm2xVfv6"))

data <- readr::read_csv("data/data_twitter_influence.csv",
                 col_types = cols())
# Numeric data
datasummary_skim(data, type = "numeric", output = "html")

	Unique	Mean	SD	Min	Median	Max
n_retweets	237	429.4	2332.7	0.0	36.0	48568.0
followers_count	485	13184.0	51672.6	12.0	2647.5	693125.0
account_age_months	504	84.1	40.7	5.0	87.3	143.6
account_age_years	504	7.0	3.4	0.4	7.3	12.0
female	2	0.3	0.5	0.0	0.0	1.0

# Categorical data (we had to create)
datasummary_skim(data %>% mutate(party = factor(party),
                                 female = factor(female)), 
                 type = "categorical", output = "html")

		N	%
party	AfD	76	15.1
	CDU_CSU	131	26.0
	DieLinke	58	11.5
	FDP	73	14.5
	Greens	61	12.1
	SPD	105	20.8
party_color	black	131	26.0
	blue	76	15.1
	deeppink	58	11.5
	gold	73	14.5
	green	61	12.1
	red	105	20.8
female	0	351	69.6
	1	153	30.4

Sparklines in the text
See Tufte in R and Tufte Handouts on how to realize Tufte’s suggestions in R

2.2 SKIP - Escaping ordering hell!

Load the data:

# data_sharing_frequency_summarized.csv
# data_plot <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
#                               "17EJuNwAwIhReW4j6v1w6FowzMjWJXbxH"))
data_plot <- read_csv("data/data_sharing_frequency_summarized.csv")

The old plot:

ggplot(data_plot, aes(x = variable, y = value)) +
  geom_bar(
    stat = "identity",
    width = 0.7,
    position = position_dodge(width = 0.8),
    aes(fill = variable, alpha = category)) +
  geom_text(
    position = position_dodge(width = 0.8),
    aes(alpha = category,
        label = paste(value, "%", sep = "")
    ),
    vjust = 1.6,
    color = "black",
    size = 2
  ) +
  scale_fill_discrete(name = "Platform") +
  scale_alpha_discrete(name = "Sharing frequency",
    range = c(1, 0.5)) +
  xlab("Platforms") +
  ylab("Percentage (%)") +
  theme_light()

Figure 1: Distribution of four categorical variables

Rename the category names:

data_plot <- data_plot %>%
  mutate(variable = recode(variable,
                           "sharing_email" = "Email",
                           "sharing_fb" = "Facebook",
                           "sharing_twitter" = "Twitter",
                           "sharing_whatsapp" = "Whatsapp"),
 category = recode(category,
        "pct.Seltener" = "Rarer",
        "pct.EinpaarMalimJahr" = "A few times a year", 
        "pct.EinpaarMalimMonat" = "A few times a month", 
        "pct.EinmalproWoche" = "Once a week",
        "pct.Taeglich" = "Daily"))

str(data_plot)

tibble [20 × 3] (S3: tbl_df/tbl/data.frame)
 $ variable: chr [1:20] "Email" "Facebook" "Twitter" "Whatsapp" ...
 $ category: chr [1:20] "Daily" "Daily" "Daily" "Daily" ...
 $ value   : num [1:20] 7 7 7 17 67 53 52 46 8 11 ...

Because the data is already aggregated we now always work with stat = "identity".

Data without factor variables (ggplot does the interpretation)! Categories are ordered alphabetically!

ggplot(data_plot, aes(x = variable, y = value)) +
  geom_bar(
    stat = "identity",
    width = 0.7,
    position = position_dodge(width = 0.8),
    aes(fill = variable, alpha = category)) +
  geom_text(
    position = position_dodge(width = 0.8),
    aes(alpha = category,
        label = paste(value, "%", sep = "")
    ),
    vjust = 1.6,
    color = "black",
    size = 2
  ) +
  scale_fill_discrete(name = "Platform") +
  scale_alpha_discrete(name = "Sharing frequency",
    range = c(1, 0.5)) +
  xlab("Platforms") +
  ylab("Percentage (%)") +
  theme_light()

Figure 2: Distribution of four categorical variables

Now we want to change the order of “Platform” to c("Facebook", "Twitter", "Email", "Whatsapp").
And “Sharing Frequency” to c("Daily", "Once a week", "A few times a month","A few times a year","Rarer")
- It is sufficient to convert the variable to factors and to assign levels. The factor does not have to be ordered. Ggplot understands that.

data_plot$variable <- factor(data_plot$variable, 
                             levels = c("Facebook",
                                        "Twitter",
                                        "Email",
                                        "Whatsapp"))
levels(data_plot$variable)

[1] "Facebook" "Twitter"  "Email"    "Whatsapp"

data_plot$category <- factor(data_plot$category, 
                             levels = c("Daily", 
                                        "Once a week", 
                                        "A few times a month", 
                                        "A few times a year", 
                                        "Rarer"))
levels(data_plot$category)

[1] "Daily"               "Once a week"         "A few times a month"
[4] "A few times a year"  "Rarer"

ggplot(data_plot, aes(x = variable, y = value)) +
  geom_bar(
    stat = "identity",
    width = 0.7,
    position = position_dodge(width = 0.8),
    aes(fill = variable, alpha = category)) +
  geom_text(
    position = position_dodge(width = 0.8),
    aes(alpha = category,
        label = paste(value, "%", sep = "")
    ),
    vjust = 1.6,
    color = "black",
    size = 2
  ) +
  scale_fill_discrete(name = "Platform") +
  scale_alpha_discrete(name = "Sharing frequency",
    range = c(1, 0.5)) +
  xlab("Platforms") +
  ylab("Percentage (%)") +
  theme_light()

Figure 3: Distribution of four categorical variables

Insights
- Do all the labeling in the data that you create for the plot data_plot
- Unordered factors are sufficient to tell ggplot about a non-alphabetical order
  - If you want to reorder the data do so in the data that you feed to the plot (and inspect the data)

2.3 SKIP: Time: Wave participation & time-point presence

2.3.1 Data & Packages & functions

Plot type: Stacked bar plot
tidyr::expand(): To create observations/rows for non-observed variable combinations

2.3.2 Graph

Here we’ll reproduce and maybe criticize as well as improve Figure @ref(fig:fig-participation-across-waves) (Bauer et al. 2020)
Questions:
- What does the graph show? What are the underlying variables (and data)?
- How many scales/mappings does it use? Could we reduce them?
- What do you like, what do you dislike about the figure? What is good, what is bad?
- What kind of information could we add to the graph (if any)?
- How would you approach a replication of the graph?

Figure 4: Presence/participation at/in different time points/waves

2.3.3 Lab: Data & Code

The code for Figure @ref(fig:fig-participation-across-waves) is shown below (and creates Figure @ref(fig:fig-participation-across-waves2)).
Learning objectives
- How to make stacked barplots
- How to expand data

We’ll start by preparing the data for our plot. As you can see below the data is in long-format already and contains an individual identifier pid as well as two variables that contain the same information namely the wave identifier in different format: wave.num and wave.

If you want directly move to the plot…

# data_wave_participation.csv
# data <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
#                          "1Y9z1shAjyaHgqpOxwt2T8uSoe-3RW-WI"),
#                  col_types = cols())
data <- read_csv("data/data_wave_participation.csv")
head(data)

# A tibble: 6 × 3
        pid wave.num wave  
      <dbl>    <dbl> <chr> 
1 421518540        1 Wave 1
2 441620046        1 Wave 1
3 454072144        1 Wave 1
4 477478244        1 Wave 1
5 481214044        1 Wave 1
6 453648542        1 Wave 1

nrow(data)

[1] 6258

We expand the data creating a new dataframe that we join with the older one. Like that we end up with a dataframe that indicated missings for missing \(\times\) respondent wave observations.

# Expand to get dataset with rows for non observations
data.expand <- data %>% tidyr::expand(pid, wave.num)
head(data.expand)

# A tibble: 6 × 2
        pid wave.num
      <dbl>    <dbl>
1 401008246        1
2 401008246        2
3 401008246        3
4 401008443        1
5 401008443        2
6 401008443        3

nrow(data.expand)

[1] 10269

# Right_Join with longformat data to real presence of respondents
data.expand <- data %>% right_join(data.expand) %>% arrange(pid, wave.num)
head(data.expand)

# A tibble: 6 × 3
        pid wave.num wave  
      <dbl>    <dbl> <chr> 
1 401008246        1 <NA>  
2 401008246        2 <NA>  
3 401008246        3 Wave 3
4 401008443        1 Wave 1
5 401008443        2 <NA>  
6 401008443        3 <NA>

nrow(data.expand)

[1] 10269

Subsequently, we have to pursue different steps to summarize the data across waves as well as delete the categories with the smallest numbers (participants only in W2/W3 (N = 3) and only in W1/W3 (N = 2)). If you like you can skip this whole part and directly go to the function below.

# Subset variables
#data.expand <- data.expand %>% select(pid, wave.num, wave) # %>% distinct()

# Spread dataset and arrange
data.expand <- data.expand %>% 
               pivot_wider(names_from = wave.num, values_from = wave) %>% 
               arrange(pid)

# Rename wave variables
data.expand <- rename(data.expand, 
                      wave1 = "1", 
                      wave2 = "2", 
                      wave3 = "3")

# Create "across_waves" with information of presence in single waves
data.expand <- unite(data.expand, across_waves, -pid)

# Aggregate to get observations per presence in different waves
data.expand <- data.expand %>% group_by(across_waves) %>% summarize(n = n())

# Separate united variable
data.expand <- data.expand %>% separate(across_waves, c("Wave1", "Wave2", "Wave3"), sep = "_")

# Replace values of wave variables with N values
data.expand$Wave1[data.expand$Wave1 != "NA"] <- data.expand$n[data.expand$Wave1 != "NA"]
data.expand$Wave2[data.expand$Wave2 != "NA"] <- data.expand$n[data.expand$Wave2 != "NA"]
data.expand$Wave3[data.expand$Wave3 != "NA"] <- data.expand$n[data.expand$Wave3 != "NA"]

# Delete groups only W2/W3 (N = 3) and only W1/W3 (N = 2)
data.expand <- data.expand %>% filter(n > 5) %>% select(-n)

# Create barplot illustrating the sampels across waves
data.expand <- pivot_longer(data.expand, Wave1:Wave3, names_to = "wave", values_to = "samples")
data.expand$samples <- as.numeric(data.expand$samples)

data.expand$samples_labels <- dplyr::recode(data.expand$samples,
  "532" = "Only W1 (N = 532)",
  "292" = "W1 and W2 (N = 292)",
  "1269" = "W1, W2 and W3 (N = 1269)",
  "482" = "Only W2 (N = 482)",
  "843" = "Only W3 (N = 843)"
)

data.expand <- data.expand %>% filter(!is.na(samples))
data.expand <- data.expand %>% arrange(wave)
data_plot <- data.expand
data_plot

# A tibble: 8 × 3
  wave  samples samples_labels          
  <chr>   <dbl> <chr>                   
1 Wave1     532 Only W1 (N = 532)       
2 Wave1     292 W1 and W2 (N = 292)     
3 Wave1    1269 W1, W2 and W3 (N = 1269)
4 Wave2     482 Only W2 (N = 482)       
5 Wave2     292 W1 and W2 (N = 292)     
6 Wave2    1269 W1, W2 and W3 (N = 1269)
7 Wave3     843 Only W3 (N = 843)       
8 Wave3    1269 W1, W2 and W3 (N = 1269)

Finally, we plot the participation across waves in Figure @ref(fig:fig-participation-across-waves2).

Figure 5: Presence/participation at/in different time points/waves

2.3.4 Exercise

Try to produce such a graph with a panel survey that you are currently using. Store the panel data in long-format, only keep the participant ID as well as the wave number, rename these pid and wave.num and then start with the code.

2.4 Another animation

Bauer, Paul C, Frederic Gerdon, Florian Keusch, and Frauke Kreuter. 2020. “The Impact of the GDPR Policy on Data Sharing/Privacy Attitudes.” Preliminary Draft, 1–22.

Footnotes

See the sparkline package and http://rstudio-pubs-static.s3.amazonaws.com/237078_159e5997707f44d69e63f71efb38481d.html.↩︎