Appendix C: Data examples & more stuff

1 Data examples

1.1 Long-format for visualization

  • Longformat lends itself to ggplot2 visualization
    • Each mapping (x, y, shape, color) would correspond to one variable
  • Discuss example here..

2 More stuff

2.1 Summary tables with sparklines (CHECK!)

  • Fascinating more recent stuff1: Sparklines

  • Sparklines in tables

    • The datasummary_skim in the modelsummary package
# data_twitter_influence.csv
# data <- read_csv(sprintf("",
#                                 "1dLSTUJ5KA-BmAdS-CHmmxzqDFm2xVfv6"))

data <- readr::read_csv("data/data_twitter_influence.csv",
                 col_types = cols())
# Numeric data
datasummary_skim(data, type = "numeric", output = "html")
Unique Missing Pct. Mean SD Min Median Max Histogram
n_retweets 237 0 429.4 2332.7 0.0 36.0 48568.0
followers_count 485 0 13184.0 51672.6 12.0 2647.5 693125.0
account_age_months 504 0 84.1 40.7 5.0 87.3 143.6
account_age_years 504 0 7.0 3.4 0.4 7.3 12.0
female 2 0 0.3 0.5 0.0 0.0 1.0
# Categorical data (we had to create)
datasummary_skim(data %>% mutate(party = factor(party),
                                 female = factor(female)), 
                 type = "categorical", output = "html")
N %
party AfD 76 15.1
CDU_CSU 131 26.0
DieLinke 58 11.5
FDP 73 14.5
Greens 61 12.1
SPD 105 20.8
party_color black 131 26.0
blue 76 15.1
deeppink 58 11.5
gold 73 14.5
green 61 12.1
red 105 20.8
female 0 351 69.6
1 153 30.4

2.2 SKIP - Escaping ordering hell!

Load the data:

# data_sharing_frequency_summarized.csv
# data_plot <- read_csv(sprintf("",
#                               "17EJuNwAwIhReW4j6v1w6FowzMjWJXbxH"))
data_plot <- read_csv("data/data_sharing_frequency_summarized.csv")

The old plot:

ggplot(data_plot, aes(x = variable, y = value)) +
    stat = "identity",
    width = 0.7,
    position = position_dodge(width = 0.8),
    aes(fill = variable, alpha = category)) +
    position = position_dodge(width = 0.8),
    aes(alpha = category,
        label = paste(value, "%", sep = "")
    vjust = 1.6,
    color = "black",
    size = 2
  ) +
  scale_fill_discrete(name = "Platform") +
  scale_alpha_discrete(name = "Sharing frequency",
    range = c(1, 0.5)) +
  xlab("Platforms") +
  ylab("Percentage (%)") +
Figure 1: Distribution of four categorical variables

Rename the category names:

data_plot <- data_plot %>%
  mutate(variable = recode(variable,
                           "sharing_email" = "Email",
                           "sharing_fb" = "Facebook",
                           "sharing_twitter" = "Twitter",
                           "sharing_whatsapp" = "Whatsapp"),
 category = recode(category,
        "pct.Seltener" = "Rarer",
        "pct.EinpaarMalimJahr" = "A few times a year", 
        "pct.EinpaarMalimMonat" = "A few times a month", 
        "pct.EinmalproWoche" = "Once a week",
        "pct.Taeglich" = "Daily"))

tibble [20 × 3] (S3: tbl_df/tbl/data.frame)
 $ variable: chr [1:20] "Email" "Facebook" "Twitter" "Whatsapp" ...
 $ category: chr [1:20] "Daily" "Daily" "Daily" "Daily" ...
 $ value   : num [1:20] 7 7 7 17 67 53 52 46 8 11 ...

Because the data is already aggregated we now always work with stat = "identity".

  • Data without factor variables (ggplot does the interpretation)! Categories are ordered alphabetically!
ggplot(data_plot, aes(x = variable, y = value)) +
    stat = "identity",
    width = 0.7,
    position = position_dodge(width = 0.8),
    aes(fill = variable, alpha = category)) +
    position = position_dodge(width = 0.8),
    aes(alpha = category,
        label = paste(value, "%", sep = "")
    vjust = 1.6,
    color = "black",
    size = 2
  ) +
  scale_fill_discrete(name = "Platform") +
  scale_alpha_discrete(name = "Sharing frequency",
    range = c(1, 0.5)) +
  xlab("Platforms") +
  ylab("Percentage (%)") +
Figure 2: Distribution of four categorical variables
  • Now we want to change the order of “Platform” to c("Facebook", "Twitter", "Email", "Whatsapp").
  • And “Sharing Frequency” to c("Daily", "Once a week", "A few times a month","A few times a year","Rarer")
    • It is sufficient to convert the variable to factors and to assign levels. The factor does not have to be ordered. Ggplot understands that.
data_plot$variable <- factor(data_plot$variable, 
                             levels = c("Facebook",
[1] "Facebook" "Twitter"  "Email"    "Whatsapp"
data_plot$category <- factor(data_plot$category, 
                             levels = c("Daily", 
                                        "Once a week", 
                                        "A few times a month", 
                                        "A few times a year", 
[1] "Daily"               "Once a week"         "A few times a month"
[4] "A few times a year"  "Rarer"              
ggplot(data_plot, aes(x = variable, y = value)) +
    stat = "identity",
    width = 0.7,
    position = position_dodge(width = 0.8),
    aes(fill = variable, alpha = category)) +
    position = position_dodge(width = 0.8),
    aes(alpha = category,
        label = paste(value, "%", sep = "")
    vjust = 1.6,
    color = "black",
    size = 2
  ) +
  scale_fill_discrete(name = "Platform") +
  scale_alpha_discrete(name = "Sharing frequency",
    range = c(1, 0.5)) +
  xlab("Platforms") +
  ylab("Percentage (%)") +
Figure 3: Distribution of four categorical variables
  • Insights
    • Do all the labeling in the data that you create for the plot data_plot
    • Unordered factors are sufficient to tell ggplot about a non-alphabetical order
      • If you want to reorder the data do so in the data that you feed to the plot (and inspect the data)

2.3 SKIP: Time: Wave participation & time-point presence

2.3.1 Data & Packages & functions

  • Plot type: Stacked bar plot
  • tidyr::expand(): To create observations/rows for non-observed variable combinations

2.3.2 Graph

  • Here we’ll reproduce and maybe criticize as well as improve Figure @ref(fig:fig-participation-across-waves) (Bauer et al. 2020)
  • Questions:
    • What does the graph show? What are the underlying variables (and data)?
    • How many scales/mappings does it use? Could we reduce them?
    • What do you like, what do you dislike about the figure? What is good, what is bad?
    • What kind of information could we add to the graph (if any)?
    • How would you approach a replication of the graph?

Figure 4: Presence/participation at/in different time points/waves

2.3.3 Lab: Data & Code

  • The code for Figure @ref(fig:fig-participation-across-waves) is shown below (and creates Figure @ref(fig:fig-participation-across-waves2)).

  • Learning objectives

    • How to make stacked barplots
    • How to expand data

We’ll start by preparing the data for our plot. As you can see below the data is in long-format already and contains an individual identifier pid as well as two variables that contain the same information namely the wave identifier in different format: wave.num and wave.

If you want directly move to the plot…

# data_wave_participation.csv
# data <- read_csv(sprintf("",
#                          "1Y9z1shAjyaHgqpOxwt2T8uSoe-3RW-WI"),
#                  col_types = cols())
data <- read_csv("data/data_wave_participation.csv")
# A tibble: 6 × 3
        pid wave.num wave  
      <dbl>    <dbl> <chr> 
1 421518540        1 Wave 1
2 441620046        1 Wave 1
3 454072144        1 Wave 1
4 477478244        1 Wave 1
5 481214044        1 Wave 1
6 453648542        1 Wave 1
[1] 6258

We expand the data creating a new dataframe that we join with the older one. Like that we end up with a dataframe that indicated missings for missing \(\times\) respondent wave observations.

# Expand to get dataset with rows for non observations
data.expand <- data %>% tidyr::expand(pid, wave.num)
# A tibble: 6 × 2
        pid wave.num
      <dbl>    <dbl>
1 401008246        1
2 401008246        2
3 401008246        3
4 401008443        1
5 401008443        2
6 401008443        3
[1] 10269
# Right_Join with longformat data to real presence of respondents
data.expand <- data %>% right_join(data.expand) %>% arrange(pid, wave.num)
# A tibble: 6 × 3
        pid wave.num wave  
      <dbl>    <dbl> <chr> 
1 401008246        1 <NA>  
2 401008246        2 <NA>  
3 401008246        3 Wave 3
4 401008443        1 Wave 1
5 401008443        2 <NA>  
6 401008443        3 <NA>  
[1] 10269

Subsequently, we have to pursue different steps to summarize the data across waves as well as delete the categories with the smallest numbers (participants only in W2/W3 (N = 3) and only in W1/W3 (N = 2)). If you like you can skip this whole part and directly go to the function below.

# Subset variables
#data.expand <- data.expand %>% select(pid, wave.num, wave) # %>% distinct()

# Spread dataset and arrange
data.expand <- data.expand %>% 
               pivot_wider(names_from = wave.num, values_from = wave) %>% 

# Rename wave variables
data.expand <- rename(data.expand, 
                      wave1 = "1", 
                      wave2 = "2", 
                      wave3 = "3")

# Create "across_waves" with information of presence in single waves
data.expand <- unite(data.expand, across_waves, -pid)

# Aggregate to get observations per presence in different waves
data.expand <- data.expand %>% group_by(across_waves) %>% summarize(n = n())

# Separate united variable
data.expand <- data.expand %>% separate(across_waves, c("Wave1", "Wave2", "Wave3"), sep = "_")

# Replace values of wave variables with N values
data.expand$Wave1[data.expand$Wave1 != "NA"] <- data.expand$n[data.expand$Wave1 != "NA"]
data.expand$Wave2[data.expand$Wave2 != "NA"] <- data.expand$n[data.expand$Wave2 != "NA"]
data.expand$Wave3[data.expand$Wave3 != "NA"] <- data.expand$n[data.expand$Wave3 != "NA"]

# Delete groups only W2/W3 (N = 3) and only W1/W3 (N = 2)
data.expand <- data.expand %>% filter(n > 5) %>% select(-n)

# Create barplot illustrating the sampels across waves
data.expand <- pivot_longer(data.expand, Wave1:Wave3, names_to = "wave", values_to = "samples")
data.expand$samples <- as.numeric(data.expand$samples)

data.expand$samples_labels <- dplyr::recode(data.expand$samples,
  "532" = "Only W1 (N = 532)",
  "292" = "W1 and W2 (N = 292)",
  "1269" = "W1, W2 and W3 (N = 1269)",
  "482" = "Only W2 (N = 482)",
  "843" = "Only W3 (N = 843)"

data.expand <- data.expand %>% filter(!
data.expand <- data.expand %>% arrange(wave)
data_plot <- data.expand
# A tibble: 8 × 3
  wave  samples samples_labels          
  <chr>   <dbl> <chr>                   
1 Wave1     532 Only W1 (N = 532)       
2 Wave1     292 W1 and W2 (N = 292)     
3 Wave1    1269 W1, W2 and W3 (N = 1269)
4 Wave2     482 Only W2 (N = 482)       
5 Wave2     292 W1 and W2 (N = 292)     
6 Wave2    1269 W1, W2 and W3 (N = 1269)
7 Wave3     843 Only W3 (N = 843)       
8 Wave3    1269 W1, W2 and W3 (N = 1269)

Finally, we plot the participation across waves in Figure @ref(fig:fig-participation-across-waves2).

Figure 5: Presence/participation at/in different time points/waves

2.3.4 Exercise

  • Try to produce such a graph with a panel survey that you are currently using. Store the panel data in long-format, only keep the participant ID as well as the wave number, rename these pid and wave.num and then start with the code.

2.4 Another animation

