Visualizing text data

Learning outcomes/objective: Learn…

Sources: Original material; Camille Landesvatter’s topic model lecture; Silge (2017)

1 Text as Data

  • Many sources of text data for social scientists
    • open ended survey responses, social media data, interview transcripts, news articles, official documents (public records, etc.), research publications, etc.
  • even if data of interest not in textual form (yet)
    • speech recognition, text recognition, machine translation etc.
  • “Past”: text data often ignored (by quants), selectively read, anecdotally used or manually labeled by researchers
  • Today: wide variety of text analytically methods (supervised + unsupervised) and increasing adoption of these methods by social scientists (Wilkerson and Casas 2017)

2 Language in NLP

  • corpus: a collection of documents
  • documents: single tweets, single statements, single text files, etc.
  • tokenization: “the process of splitting text into tokens” (Silge 2017)
    • tokens = single words, sequences of words or entire sentences
    • Often defines the unit of analysis
  • bag of words (method): approach where all tokens are put together in a “bag” without considering their order1
    • possible issues with a simple bag-of-word: “I’m not happy and I don’t like it!”
  • stop words: very common but uninformative terms such as “the”, “and”, “they”, etc.
    • Q: Are stop words really uninformative?
  • document-term/feature matrix (DTM/DFM): common format to store text data (examples later)

3 (R-)Workflow for Text Analysis

  1. Data collection (“Obtaining Text”*)
  2. Data manipulation
    • Corpus pre-processing (“From Text to Data”*)
    • Vectorization: Turning Text into a Matrix (DTM/DFM2) (“From Text to Data”*)
  3. Analysis (“Quantitative Analysis of Text”*)
  4. Validation and Model Selection (“Evaluating Performance”3)
  5. Visualization and Model Interpretation
  • Visualization at descriptive and/or modelling stage

4 Data collection

  • use existing corpora
    • R: {gutenbergr}: contains more than 60k book transcripts
    • R: {quanteda.corpora}: provides multiple corpora; see here for a overview
    • R: {topicmodels}: contains Associated Press data (2246 news articles mostly from around 1988)
    • search for datasets, see e.g. this list
  • collect new corpora
    • electronic sources: application user interfaces (APIs, e.g. Facebook, Twitter), web scraping, wikipedia, transcripts of all german electoral programs
    • undigitized text, e.g. scans of documents
    • data from interviews, surveys and/or experiments (speech → text)
  • consider relevant applications to turn your data into text format (speech-to-text recognition, pdf-to-text, OCR, Mechanical Turk and Crowdflower)




5 Data manipulation

5.1 Data manipulation: Basics (1)

  • Text data is different from “structured” data (e.g., a set of rows and columns)
  • Most often not “clean” but rather messy
    • shortcuts, dialect, incorrect grammar, missing words, spelling issues, ambiguous language, humor
    • web context: emojis, # (twitter), etc.
  • Preprocessing
    • much more important & crucial determinant of successful text analysis!

5.2 Data manipulation: Basics (2)

Common steps in pre-processing text data:

  • stemming (removal of word suffixes), e.g., computation, computational, computer \(\rightarrow\) compute

  • lemmatisation (reduce a term to its lemma, i.e., its base form), e.g., “better” \(\rightarrow\) “good”

  • transformation to lower cases

  • removal of punctuation (e.g., ,;.-) / numbers / white spaces / URLs / stopwords / very infrequent words

  • \(\rightarrow\) Always choose your preprocessing steps carefully!

    • e.g., removing punctuation: “I enjoy: eating, my cat and leaving out commas” vs. “I enjoy: eating my cat and leaving out commas
  • Choosing unit of analysis?! (sentence vs. unigram vs. bigram etc.)

5.3 Data manipulation: Basics (3)

  • In principle, all those transformations can be achieved by using base R

  • Other packages however provide ready-to-apply functions, such as {tidytext}, {tm} or {quanteda}

  • Important

    • transform data to corpus object or tidy text object (examples on the next slides)




6 Lab: Basic data manipulation & visualization using tidytext

6.1 Functions & packages

  • unnest_tokens(): split a column into tokens, flattening the table into one-token-per-row.
    • By default unnest_tokens() removes punctuation and makes all terms lowercase automatically
    • ?unnest_tokens
      • to_lower = TRUE: Specify whether to convert tokens to lowercase
      • drop = TRUE: Specify whether original input column should get dropped
      • token = "words": Specify unit for tokenizing, or a custom tokenizing function
  • anti_join(): Filtering joins filter rows from x based on the presence or absence of matches in y
    • e.g., anti_join(stopwords): Filter out stopwords
  • cast_dtm(): Turns a “tidy” one-term-per-document-per-row data frame into a DocumentTermMatrix from the tm package
    • See also cast_tdm() and cast_dfm()
    • Usage:
data_tidy %>% # Use tidy text format data
    count(text_id,word) %>% # Count words
    cast_dtm(document = text_id, # Spread into matrix
           term = word,
           value = n) %>%
  as.matrix() # Store as matrix
  • Tidytext format lends itself to using dplyr functions
  • Filter out particular tokens:
data_tidy %>% 
    filter(!word %in% c("t.co", "https", "rt", "http"))
  • Filter out 5000 rarest tokens:
tokens_rare <- 
        data_tidy %>% 
        count(word) %>% # Count frequency of tokens
        arrange(n) %>% # Order dataframe
        slice(1:5000) %>% # Take first 5000 rows (rarest tokens)
        pull(word) # Extract tokens
        
 # Filter out tokens define above
 data_tidy %>% 
    filter(!word %in% tokens_rare)

6.2 Importing data & tidy text format & stop words

  • Pre-processing with tidytext requires your data to be stored in a tidy text object
  • Main characteristics of a tidy text dataset
    • one-token-per-row
    • “long format” (Row: Document \(\times\) token)

First, we have to retrieve some data. We’ll use tweet data from Russian trolls (Roeder 2018) (these are not real people anyways). The data below is edited data (variables & observations subsampled, language == English, account_category == LeftTroll or RightTroll) based on the file IRAhandle_tweets_1.csv. The variables are explained here: https://github.com/fivethirtyeight/russian-troll-tweets/.



Below we start by installing/loading the necessary packages:

# load and install packages if neccessary
# install.packages(pacman)
pacman::p_load(tidyverse,
                             rvest,
                             xml2,
                             tidytext,
                             tm,
                             ggwordcloud,
                             knitr)

Then we load the data into R (use the link on your computer):

# Load the data
# data_IRAhandle_tweets_1_sampled.csv
# data <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
#                          "1GNrZfF3itKxUtbngQhP6lCyOM3ApbgTD"))


data <- read_csv("data/data_IRAhandle_tweets_1_sampled.csv",
                 col_types = cols())
dim(data) # What dimensions do we have?
[1] 4000    9
# View(data)



We start by adding an identifier text_id to the documents/tweets.

# Add id
data <- data %>%
                mutate(text_id = row_number()) %>%
                select(text_id, everything()) # What happens here?

# head(data)
kable(head(data))
text_id author text region language publish_date following followers account_type account_category
1 AMELIEBALDWIN (GenFlynn?) (realDonaldTrump?) (mike_pence?) after what I witnessed today… wis needs serious help! Anti worker anti liberty agenda. Elite B.S. https://t.co/URL3FrNfqT’ United States English 10/9/2016 4:45 1944 2472 Right RightTroll
2 ACEJINEV Trans-Siberian Orchestra Tickets On Sale. Buy #TSO Christmas Concert #Tickets (eCityTickets?) https://t.co/FWMaOWxEh7 https://t.co/J2ZzUe87Fl United States English 11/7/2016 14:54 803 908 Left LeftTroll
3 ANDEERLWR #anderr GRAPHIC VIDEO : Did ANTIFA Violence Cause the Tragedy at #Charlottesville? https://t.co/y2QeK4AC0J https://t.co/OEXcWQdiL4 United States English 8/13/2017 16:47 21 6 Right RightTroll
4 AMELIEBALDWIN Remember all the fuss when the US wrongly accused #Russia of destroying hospitals in #Syria? Well this is for real https://t.co/U5LHrZLgnB United States English 3/18/2017 20:22 2303 2744 Right RightTroll
5 ALBERTMORENMORE RT (2AFight?): More people die from alcohol than guns. READ> https://t.co/7H6OlJJtEC #2A #NRA #tcot #tgdn #PJNET #ccot #teaparty https://t.c… United States English 2/18/2016 9:02 1136 729 Right RightTroll
6 AMELIEBALDWIN (RealAlexJones?) (BarackObama?) (POTUS?) (theDemocrats?) Obama is the Divider-In-Chief. His mission from Soros was destroy America from within.’ United States English 11/30/2016 8:25 2366 2578 Right RightTroll
dim(data)
[1] 4000   10




Then, by using the unnest_tokens() function from tidytext we transform this data to a tidy text format, where the words (tokens) of each text/document are written into their own rows.

# Create tidy text format and remove stopwords
data_tidy <- data %>%
  unnest_tokens(word, text, drop = FALSE) %>% # unnest & keep orig. documents
  anti_join(stop_words) %>%
    select(text_id, author, text, word, everything())
    
# head(data_tidy)
kable(head(data_tidy))
text_id author text word region language publish_date following followers account_type account_category
1 AMELIEBALDWIN (GenFlynn?) (realDonaldTrump?) (mike_pence?) after what I witnessed today… wis needs serious help! Anti worker anti liberty agenda. Elite B.S. https://t.co/URL3FrNfqT’ genflynn United States English 10/9/2016 4:45 1944 2472 Right RightTroll
1 AMELIEBALDWIN (GenFlynn?) (realDonaldTrump?) (mike_pence?) after what I witnessed today… wis needs serious help! Anti worker anti liberty agenda. Elite B.S. https://t.co/URL3FrNfqT’ realdonaldtrump United States English 10/9/2016 4:45 1944 2472 Right RightTroll
1 AMELIEBALDWIN (GenFlynn?) (realDonaldTrump?) (mike_pence?) after what I witnessed today… wis needs serious help! Anti worker anti liberty agenda. Elite B.S. https://t.co/URL3FrNfqT’ mike_pence United States English 10/9/2016 4:45 1944 2472 Right RightTroll
1 AMELIEBALDWIN (GenFlynn?) (realDonaldTrump?) (mike_pence?) after what I witnessed today… wis needs serious help! Anti worker anti liberty agenda. Elite B.S. https://t.co/URL3FrNfqT’ witnessed United States English 10/9/2016 4:45 1944 2472 Right RightTroll
1 AMELIEBALDWIN (GenFlynn?) (realDonaldTrump?) (mike_pence?) after what I witnessed today… wis needs serious help! Anti worker anti liberty agenda. Elite B.S. https://t.co/URL3FrNfqT’ wis United States English 10/9/2016 4:45 1944 2472 Right RightTroll
1 AMELIEBALDWIN (GenFlynn?) (realDonaldTrump?) (mike_pence?) after what I witnessed today… wis needs serious help! Anti worker anti liberty agenda. Elite B.S. https://t.co/URL3FrNfqT’ anti United States English 10/9/2016 4:45 1944 2472 Right RightTroll
dim(data_tidy)
[1] 44001    11
  • Questions:
    • How does our dataset change after tokenization and removing stopwords? How many observations do we now have? And what do the variable text_id and word identify/store?
    • Also, have a closer look at the single words. Do you notice anything else that has changed, e.g., is something missing from the original text?



We can use str() and data_tidy$word[1:15] to inspect the resulting data/tokens.

str(data_tidy,10)
tibble [44,001 × 11] (S3: tbl_df/tbl/data.frame)
 $ text_id         : int [1:44001] 1 1 1 1 1 1 1 1 1 1 ...
 $ author          : chr [1:44001] "AMELIEBALDWIN" "AMELIEBALDWIN" "AMELIEBALDWIN" "AMELIEBALDWIN" ...
 $ text            : chr [1:44001] "'@GenFlynn @realDonaldTrump @mike_pence after what I witnessed today... wis needs serious help! Anti worker ant"| __truncated__ "'@GenFlynn @realDonaldTrump @mike_pence after what I witnessed today... wis needs serious help! Anti worker ant"| __truncated__ "'@GenFlynn @realDonaldTrump @mike_pence after what I witnessed today... wis needs serious help! Anti worker ant"| __truncated__ "'@GenFlynn @realDonaldTrump @mike_pence after what I witnessed today... wis needs serious help! Anti worker ant"| __truncated__ ...
 $ word            : chr [1:44001] "genflynn" "realdonaldtrump" "mike_pence" "witnessed" ...
 $ region          : chr [1:44001] "United States" "United States" "United States" "United States" ...
 $ language        : chr [1:44001] "English" "English" "English" "English" ...
 $ publish_date    : chr [1:44001] "10/9/2016 4:45" "10/9/2016 4:45" "10/9/2016 4:45" "10/9/2016 4:45" ...
 $ following       : num [1:44001] 1944 1944 1944 1944 1944 ...
 $ followers       : num [1:44001] 2472 2472 2472 2472 2472 ...
 $ account_type    : chr [1:44001] "Right" "Right" "Right" "Right" ...
 $ account_category: chr [1:44001] "RightTroll" "RightTroll" "RightTroll" "RightTroll" ...
data_tidy$word[1:15]
 [1] "genflynn"        "realdonaldtrump" "mike_pence"      "witnessed"      
 [5] "wis"             "anti"            "worker"          "anti"           
 [9] "liberty"         "agenda"          "elite"           "b.s"            
[13] "https"           "t.co"            "url3frnfqt"     

6.3 Data manipulation: Tidytext Example (2)

  • Other transformations may need some dealing with regular expressions
    • e.g., to remove white space with tidytext (s+ describes a blank space):
example_white_space <- gsub("\\s+","",data$text)
example_white_space[1:5]
[1] "'@GenFlynn@realDonaldTrump@mike_penceafterwhatIwitnessedtoday...wisneedsserioushelp!Antiworkerantilibertyagenda.EliteB.S.https://t.co/URL3FrNfqT'"
[2] "Trans-SiberianOrchestraTicketsOnSale.Buy#TSOChristmasConcert#Tickets@eCityTicketshttps://t.co/FWMaOWxEh7https://t.co/J2ZzUe87Fl"                  
[3] "#anderrGRAPHICVIDEO:DidANTIFAViolenceCausetheTragedyat#Charlottesville?https://t.co/y2QeK4AC0Jhttps://t.co/OEXcWQdiL4"                            
[4] "RememberallthefusswhentheUSwronglyaccused#Russiaofdestroyinghospitalsin#Syria?Wellthisisforrealhttps://t.co/U5LHrZLgnB"                           
[5] "RT@2AFight:Morepeoplediefromalcoholthanguns.READ>https://t.co/7H6OlJJtEC#2A#NRA#tcot#tgdn#PJNET#ccot#teapartyhttps://t.c…"                        
  • Advantage: tidy text format → regular R functions can be used
    • …instead of functions specialized to analyze a corpus object
  • e.g., use dplyr workflow to count the most popular words in your text data:
data_tidy %>% 
    count(word) %>% 
    arrange(desc(n))



Below an example where we first identify the rarest tokens and then filter them out:

# Identify rare tokens
tokens_rare <- 
        data_tidy %>% 
        count(word) %>% # Count frequency of tokens
        arrange(n) %>% # Order dataframe
        slice(1:5000) %>% # Take first 5000 rows (rarest tokens)
        pull(word) # Extract tokens
        
# Filter out tokens define above
 data_tidy_filtered <- 
    data_tidy %>% 
    filter(!word %in% tokens_rare)
 
 dim(data_tidy_filtered)
[1] 39001    11



  • Tidytext is a good starting point (in my opinion), because we (can) carry out these steps individually
    • other packages combine many steps into one single function (e.g. quanteda combines pre-processing and DFM casting in one step)
  • R (as usual) offers many ways to achieve similar or same results
    • e.g. you could also import, filter and pre-process using dplyr and tidytext, further pre-process and vectorize with tm or quanteda (tm has simpler grammar but slightly fewer features), use machine learning applications and eventually re-convert to tidy format for interpretation and visualization (ggplot2)




7 Vectorization: Basics

  • Text analytical models (e.g., topic models) often require the input data to be stored in a certain format
  • Typically: document-term matrix (DTM), sometimes also called document-feature matrix (DFM)
    • turn raw text into a vector-space representation
    • matrix where each row represents a document and each column represents a word
      • term-frequency (tf): the number within each cell describes the number of times the word appears in the document
      • term frequency–inverse document frequency (tf-idf): weights occurrence of certain words, e.g., lowering weight of word “education” in corpus of articles on educational inequality

8 Lab: Vectorization with Tidytext

Repeat all the steps from above…

# load and install packages if neccessary
# install.packages(pacman)
pacman::p_load(tidyverse,
                             rvest,
                             xml2,
                             tidytext,
                             tm,
                             ggwordcloud,
                             knitr)



# Load the data
# data_IRAhandle_tweets_1_sampled.csv
# data <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
#                          "1GNrZfF3itKxUtbngQhP6lCyOM3ApbgTD"))


data <- read_csv("data/data_IRAhandle_tweets_1_sampled.csv",
                 col_types = cols())
dim(data) # What dimensions do we have?
[1] 4000    9
# Add id
data <- data %>%
                mutate(text_id = row_number()) %>%
                select(text_id, everything()) # What happens here?


# Create tidy text format and remove stopwords
data_tidy <- data %>%
  unnest_tokens(word, text, drop = FALSE) %>% 
  anti_join(stop_words) %>%
    select(text_id, author, text, word, everything())

kable(head(data_tidy))
text_id author text word region language publish_date following followers account_type account_category
1 AMELIEBALDWIN (GenFlynn?) (realDonaldTrump?) (mike_pence?) after what I witnessed today… wis needs serious help! Anti worker anti liberty agenda. Elite B.S. https://t.co/URL3FrNfqT’ genflynn United States English 10/9/2016 4:45 1944 2472 Right RightTroll
1 AMELIEBALDWIN (GenFlynn?) (realDonaldTrump?) (mike_pence?) after what I witnessed today… wis needs serious help! Anti worker anti liberty agenda. Elite B.S. https://t.co/URL3FrNfqT’ realdonaldtrump United States English 10/9/2016 4:45 1944 2472 Right RightTroll
1 AMELIEBALDWIN (GenFlynn?) (realDonaldTrump?) (mike_pence?) after what I witnessed today… wis needs serious help! Anti worker anti liberty agenda. Elite B.S. https://t.co/URL3FrNfqT’ mike_pence United States English 10/9/2016 4:45 1944 2472 Right RightTroll
1 AMELIEBALDWIN (GenFlynn?) (realDonaldTrump?) (mike_pence?) after what I witnessed today… wis needs serious help! Anti worker anti liberty agenda. Elite B.S. https://t.co/URL3FrNfqT’ witnessed United States English 10/9/2016 4:45 1944 2472 Right RightTroll
1 AMELIEBALDWIN (GenFlynn?) (realDonaldTrump?) (mike_pence?) after what I witnessed today… wis needs serious help! Anti worker anti liberty agenda. Elite B.S. https://t.co/URL3FrNfqT’ wis United States English 10/9/2016 4:45 1944 2472 Right RightTroll
1 AMELIEBALDWIN (GenFlynn?) (realDonaldTrump?) (mike_pence?) after what I witnessed today… wis needs serious help! Anti worker anti liberty agenda. Elite B.S. https://t.co/URL3FrNfqT’ anti United States English 10/9/2016 4:45 1944 2472 Right RightTroll




With the cast_dtm function from the tidytext package, we can now transform it to a DTM.

# Cast tidy text data into DTM format
dtm <- data_tidy %>% 
  count(text_id,word) %>%
  cast_dtm(document = text_id,
                 term = word,
                 value = n) %>%
        as.matrix()

# Check the dimensions and a subset of the DTM
dim(dtm)
[1]  4000 15151
print(dtm[1:6,1:6]) # important: this is only snippet of DTM (6 terms/cols, 6 rows only)
    Terms
Docs agenda anti b.s elite genflynn https
   1      1    2   1     1        1     1
   2      0    0   0     0        0     2
   3      0    0   0     0        0     2
   4      0    0   0     0        0     1
   5      0    0   0     0        0     2
   6      0    0   0     0        0     0




9 Lab: Text visualization

Repeat all the steps from above…

# load and install packages if neccessary
# install.packages(pacman)
pacman::p_load(tidyverse,
                             rvest,
                             xml2,
                             tidytext,
                             tm,
                             ggwordcloud)

# Load the data
# data_IRAhandle_tweets_1_sampled.csv
# data <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
#                          "1GNrZfF3itKxUtbngQhP6lCyOM3ApbgTD"))


data <- read_csv("data/data_IRAhandle_tweets_1_sampled.csv",
                 col_types = cols())


# Add id
data <- data %>%
                mutate(text_id = row_number()) %>%
                select(text_id, everything()) # What happens here?


# Create tidy text format and remove stopwords
data_tidy <- data %>%
  unnest_tokens(word, text, drop = FALSE) %>% 
  anti_join(stop_words) %>%
    select(text_id, author, text, word, everything())

9.1 Wordclouds

set.seed(42)


# Aggregate by word
data_plot <- 
    data_tidy %>% 
    filter(!word %in% c("t.co", "https", "rt", "http")) %>% # ?
    count(word) %>% # ?
    arrange(desc(n)) %>%
    slice(1:20)
    

ggplot(data_plot, aes(label = word, size = n)) +
  geom_text_wordcloud() +
  scale_size_area(max_size = 20) +
  theme_minimal()

9.2 Wordclouds across subsets (grouped)

set.seed(42)


# Aggregate by word
data_plot <- data_tidy %>% 
    filter(!word %in% c("t.co", "https", "rt", "http")) %>%
    count(word, account_category) %>% 
    group_by(account_category) %>%
    arrange(desc(n)) %>%
    slice(1:20) %>%
    ungroup()
    
# Wordcloud: Coloring different groups
ggplot(data_plot, aes(label = word, size = n, color = account_category)) +
  geom_text_wordcloud() +
  scale_size_area(max_size = 20) +
  theme_minimal()

# Wordcloud: Faceting different groups
ggplot(data_plot, aes(label = word, size = n)) +
  geom_text_wordcloud() +
  scale_size_area(max_size = 20) +
  theme_minimal() +
  facet_wrap(~account_category)

9.3 Barplots (Frequency)

set.seed(42)


# Data for vertical barplot
data_plot <- 
    data_tidy %>% 
    filter(!word %in% c("t.co", "https", "rt", "http")) %>%
    group_by(word) %>% 
  summarize(n= n()) %>%
    arrange(desc(n)) %>%
    slice(1:10) %>%
    mutate(word = factor(word, # Convert to factor for ordering
                                             levels = as.character(.$word),
                                             ordered = TRUE))
    
    
# Create barplot
ggplot(data_plot, aes(x = word, y = n)) +
  geom_bar(stat="identity") +
  theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Data for horizontal barplot
data_plot <- data_plot %>%
    arrange(n) %>%
    mutate(word = factor(word,
                                             levels = as.character(.$word),
                                             ordered = TRUE))

# Create horizontal barplot
ggplot(data_plot, aes(x = n, y = word)) +
  geom_bar(stat="identity") +
  theme_minimal() +
    theme(axis.text.y = element_text(angle = 45, hjust = 1))

# Data for faceted horizontal barplot
data_plot <- 
    data_tidy %>% 
    filter(!word %in% c("t.co", "https", "rt", "http")) %>%
    group_by(word, account_category) %>% 
  summarize(n= n(),
                    account_category = first(account_category)) %>%
    group_by(account_category) %>%
    arrange(desc(n)) %>%
    slice(1:10) %>%
    ungroup()


# Create horizontal barplot
ggplot(data_plot, aes(x = n, y = word)) +
  geom_bar(stat="identity") +
  theme_minimal() +
    theme(axis.text.y = element_text(angle = 45, hjust = 1)) +
  facet_wrap(~account_category)




10 TM example: Text pre-processing and vectorization

  • \(\rightarrow\) consider alternative packages (e.g., tm, quanteda)
  • Example: tm package
    • input: corpus not tidytext object
    • What is a corpus in R? \(\rightarrow\) group of documents with associated metadata
# Load the data
# data_IRAhandle_tweets_1_sampled.csv
# data <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
#                          "1GNrZfF3itKxUtbngQhP6lCyOM3ApbgTD"))


data <- read_csv("data/data_IRAhandle_tweets_1_sampled.csv",
                 col_types = cols()) 

data <- data %>%
                mutate(text_id = row_number()) %>%
                select(text_id, everything())
dim(data)
# Clean corpus
corpus_clean <- VCorpus(VectorSource(data$text)) %>%
  tm_map(removePunctuation, preserve_intra_word_dashes = TRUE) %>%
  tm_map(removeNumbers) %>% 
  tm_map(content_transformer(tolower)) %>% 
  tm_map(removeWords, words = c(stopwords("en"))) %>% 
  tm_map(stripWhitespace) %>% 
  tm_map(stemDocument)

# Check exemplary document
corpus_clean[["1"]][["content"]]
[1] "genflynn realdonaldtrump mikep wit today wis need serious help anti worker anti liberti agenda elit bs httpstcourlfrnfqt"



  • In case you pre-processed your data with the tm package, remember we ended with a pre-processed corpus object
  • Now, simply apply the DocumentTermMatrix function to this corpus object
# Pass your "clean" corpus object to the DocumentTermMatrix function
dtm_tm <- DocumentTermMatrix(corpus_clean, control = list(wordLengths = c(2, Inf))) # control argument here is specified to include words that are at least two characters long

# Check a subset of the DTM
inspect(dtm_tm[,1:6])
<<DocumentTermMatrix (documents: 4000, terms: 6)>>
Non-/sparse entries: 9/23991
Sparsity           : 100%
Maximal term length: 12
Weighting          : term frequency (tf)
Sample             :
      Terms
Docs   ����� ������ ꮮꭺꮖꭼ ꮑꭵꮆꮋꮖ ꭶꮎ� ������������
  1        0      0    0     0   0            0
  1640     1      0    0     0   0            0
  2        0      0    0     0   0            0
  2869     0      0    0     0   0            1
  2889     0      1    0     0   0            0
  3        0      0    0     0   0            0
  3509     0      0    1     1   1            0
  3534     1      0    0     0   0            0
  640      1      0    0     0   0            0
  884      0      1    0     0   0            0

Q: How do the terms between the DTM we created with tidytext and the one created with tm differ? Why?

References

Roeder, Oliver. 2018. “Why We’re Sharing 3 Million Russian Troll Tweets.” https://fivethirtyeight.com/features/why-were-sharing-3-million-russian-troll-tweets/.
Silge, Julia. 2017. Text Mining with r : A Tidy Approach. First edition. Beijing, China.
Wilkerson, John, and Andreu Casas. 2017. Large-Scale Computerized Text Analysis in Political Science: Opportunities and Challenges.” Annu. Rev. Polit. Sci. 20 (1): 529–44.

Footnotes

  1. Alternatives: bigrams/word pairs, word embeddings↩︎

  2. Document-term matrix (DTM) is a mathematical representation of text data where rows correspond to documents in the corpus, and columns correspond to terms (words or phrases). DFM, also known as a document-feature matrix, is similar to a DTM but instead of representing the count of terms in each document, it represents the presence or absence of terms.↩︎

  3. Wilkerson and Casas (2017)↩︎