Chapter 9 Google Natural Language API

Paul C. Bauer, Camille Landesvatter, Malte Söhren

You will need to install the following packages for this chapter (run the code):

# install.packages('pacman')
library(pacman)
p_load('googleLanguageR',
'tidyverse', 'tm', 'ggwordcloud')

9.1 Provided services/data

  • What data/service is provided by the API?

The API is provided by Google.

Google Cloud offers two Natural Language Products: AutoML Natural Language and Natural Language API. See here to read about which of the two products is the one more useful to you. In short, option 1, the Auto Machine Learning (ML) Natural Language allows you to train a new, custom model to either classify text, extract entities or detect sentiment. For instance, you could provide an already pre-labeled subset of your data which the API will then use to train a custom classifier. With this classifier at hand you could then classify and analyze further similar data of yours. This API review focuses on option 2, the Natural Language API. This API uses pre-trained models to analyze your data. Put differently, instead of providing only a pre-labeled subset of your data, here you normally provide the API with your complete (unlabeled) data which it will then analyze.

The following requests are available:

  • Analyzing Sentiment (analyzeSentiment)
  • Analyzing Entities (analyzeEntities)
  • Analyzing Syntax (analyzeSyntax)
  • Analyzing Entity Sentiment (analyzeEntitySentiment)
  • Classifying Content (classifyText)

A demo of the API that allows you to input text and explore the different classification capabilities can be found here.

9.2 Prerequisites

  • What are the prerequisites to access the API (authentication)?

The prerequisite to access Google Natural Language API is a Google Cloud Project. To create this you will need a Google account to log into the Google Cloud Platform (GCP). Within your Google Cloud Platform, you must enable the Natural Language API for your respective Google Cloud Project here. Additionally, if you are planning to request the Natural Language API from outside a Google Cloud environment (e.g., R) you will be required to use a private (service account) key. This can be achieved by creating a service account which in turn will allow you to download your private key as a JSON file. To create your API key for authentication from within the GCP, go to APIs & Services > Credentials. Below we provide an example of how to authenticate from within the Google Cloud Platform (Cloud Shell + API key) and how to authenticate from within R (authentication via JSON key file).

9.3 Simple API call

  • What does a simple API call look like?

Here we describe how a simple API call can be made from within the Google Cloud Platform environment via the Google Cloud Shell:

  • To activate your Cloud Shell, inspect the upper right-hand corner of your Google Cloud Platform Console and click the icon called “Activate Shell”. Google Cloud Shell is a command line environment running in the cloud.
  • Via the Cloud Shell command line, add your individual API key to the environment variables, so it is not required to be called for each request.
export API_KEY=<YOUR_API_KEY>
  • Via the built-in Editor in Cloud Shell create a JSON file (call it for instance ‘request.json’) with the text that you would like to perform analysis on. Consider that text can be uploaded in the request (shown below) or integrated with Cloud Storage. Supported types of your text are PLAIN_TEXT (shown below) or HTML.
{
  "document":{
    "type":"PLAIN_TEXT",
    "content":"Enjoy your vacation!"
  },
  "encodingType": "UTF8"
}
  • For sending your data, pass a curl command to your Cloud Shell command line where you refer (via @) to your request.json file from the previous step.
curl "https://language.googleapis.com/v1/documents:analyzeEntities?key=${API_KEY}" -s -X POST -H "Content-Type: application/json" --data-binary @request.json
  • Depending on to which endpoint you send the request (here: analyzeEntities) you will receive your response with many different insights into your text data.

9.4 API access in R

  • How can we access the API from R (httr + other packages)?

The input (i.e., text data) one provides to the API most often will go beyond a single word or sentence. The most convenient way which also produces the most insightful and structured results (on which you can directly perform further analysis on) are achieved when using the ‘googleLanguageR’ R package - a package which among other options (there are other examples in this review) allows calling the Natural Language API:

In this small example we demonstrate how to..

  • .. authenticate with your Google Cloud Account within R

  • .. how to analyze the syntax of exemplary twitter data (we are using twitter data from two popular german politicians, which we (via the Google Translation API) beforehand also translated to english)

  • .. how to extract terms that are nouns only

  • .. plot your nouns in a word cloud

Step 1: Load package

library(googleLanguageR)
library(tidyverse)
library(tm)#stopwords

Step 2: Authentication

gl_auth("./your-key.json")

Step 3: Analysis

Start with loading your text data. For this example, we retrieve data inherit to the quanteda.corpora R package which in a broader sense is associated with the famous quanteda package.

The data we choose to download (‘data_corpus_guardian’) contains Guardian newspaper articles in politics, economy, society and international sections from 2012 to 2016. See here for a list of even more publicy available text corpora from the quanteda.corpora package.

# Download and store corpus
guardian_corpus <- quanteda.corpora::download("data_corpus_guardian")

# Keep text only from the corpus
text <- guardian_corpus[["documents"]][["texts"]]

# For demonstration purposes, subset the text data to 20 observations only
text <- text[1:20]

# Turn text into a data frame and add an identifier
df <- as.data.frame(text)
df <- tibble::rowid_to_column(df, "ID")

Note: Whenever you choose to work with textual data, a very common procedure is to pre-process the data via a set of certain transformations. For instance, you will convert all letters to lower case, remove numbers and punctuation, trim words to their word stem and eventually remove so-called stopwords. There are many tutorials (for example here or here).

After having retrieved and prepared our to-be-analyzed (text) data, we can now call the API via the function gl_nlp(). Here you will have to specify the quantity of interest (here: analyzeSyntax). Depending on what specific argument you make use of (e.g., analyzeSyntax, analyzeSentiment, etc.) a list with information on different characteristics of your text is returned, e.g., sentences, tokens, tags of tokens.

syntax_analysis <- gl_nlp(df$text, nlp_type = "analyzeSyntax")

Importantly, find the list tokens inherit to the large list syntax_analysis. This list stores two variables: content (contains the token) and tag (contains the tag, e.g., verb or noun). Let’s have a look at the first document.

head(syntax_analysis[["tokens"]][[1]][,1:3])
      content beginOffset   tag
1      London           0  NOUN
2 masterclass           7  NOUN
3          on          19   ADP
4     climate          22  NOUN
5      change          30  NOUN
6           |          37 PUNCT

Now imagine you are interested in all the nouns that were used in the Guardian Articels while removing all other types of words (e.g., adjectives, verbs, etc.). We can simply filter for those using the “tag”-list.

# Add tokens from syntax analysis to original dataframe
df$tokens <- syntax_analysis[["tokens"]]

# Keep nouns only
df <- df %>% dplyr::mutate(nouns = map(tokens, 
                            ~ dplyr::filter(., tag == "NOUN"))) 

Step 4: Visualization

Finally, we can also plot our nouns in a wordcloud using the ggwordcloud package.

# Load package
library(ggwordcloud)
# Create the data for the plot
data_plot <- df %>% 
  # only keep content variable
  mutate(nouns = map(nouns, 
                            ~ select(., content))) %>% 
  # Write tokens in all rows into a single string
  unnest(nouns) %>% # unnest tokens
  # Unnest tokens
  tidytext::unnest_tokens(output = word, input = content) %>% # generate a wordcloud
  anti_join(tidytext::stop_words) %>%
  dplyr::count(word) %>%
  filter(n > 10) #only plot words that appear more than 10 times

# Visualize in a word cloud
data_plot %>%
  ggplot(aes(label = word, 
             size = n)) +
  geom_text_wordcloud() +
  scale_size_area(max_size = 10) +
  theme_minimal()
Wordcloud of nouns found within guardian articles

Figure 9.1: Wordcloud of nouns found within guardian articles

9.5 Social science examples

  • Are there social science research examples using the API?

Text-as-data has become quite a common approach in the social sciences (see e.g., Grimmer and Stewart (2013) for an overview). For the usage of Google’s Natural Language API we however have the impression that it is relatively unknown in NLP and among social scientists. Hence, we want to emphasize the usefulness and importance the usage of Google’s NLP API could have in many research projects.

However, if you are considering making use of it, keep two things in mind:

  • some might interpret using the API as a “blackbox” approach (see Dobbrick et al. (2021) for very recent developments of “glass-box machine learning appraoches” for text analysis) potentially standing in way of transparency and replication (two important criteria of good research?). However it is always possible to perform robustness and sensitivity analysis and to add the version of the API one was using.

  • depending on how large your corpus of text data, Google might charges you some money. However for up to 5,000 units (i.e., terms) the different variants of sentiment and syntax analysis are free. Check this overview by Google to learn about prices for more units here. Generally, also consider that if you are pursuing a CSS-related project in which the GCP Products would come in useful, there is the possibility to achieve Google Cloud Research credits (see here).

References

Dobbrick, Timo, Julia Jakob, Chung-Hong Chan, and Hartmut Wessler. 2021. “Enhancing Theory-Informed Dictionary Approaches with ‘Glass-Box’ Machine Learning: The Case of Integrative Complexity in Social Media Comments.” Commun. Methods Meas., November, 1–18.
Grimmer, Justin, and Brandon M Stewart. 2013. “Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts.” Polit. Anal. 21 (3): 267–97.