Chapter 11 Google Speech-to-Text API

Camille Landesvatter

You will need to install the following packages for this chapter (run the code):

# install.packages('pacman')
library(pacman)
p_load('tidyverse',
'googleLanguageR')

11.1 Provided services/data

  • What data/service is provided by the API?

Google’s Speech-to-Text API allows you to convert audio files to text by applying powerful neural network models. Audio content can be transcribed in real time and of course (and possibly of higher relevance for social science research) from stored files.

The API currently recognizes more than 125 languages. It supports multiple audio formats, and audio files can either be transcribed directly (if the content does not exceed 60 seconds) or perform asynchronous requests for audio files longer than 60 seconds.

A demo of the API that allows you to record text via your microphone (or to upload an audio file) and explore the transcript can be found here.

Also consider that there is a Text-to-Speech API - simply performing operations the other way around - offered by Google.

11.2 Prerequisites

  • What are the prerequisites to access the API (authentication)?

To access and to use the API the following steps are necessary:

  • Create a google account (if you do not already have one).

  • With this google account login to the google cloud platform and create a Google Cloud Project.

  • Within this Google Cloud Project enable the Google Speech-to-text API.

  • For authentication you will need to create an API key (which you additionally should restrict to the Translation API). If however, you are planning to request the Natural Language API from outside a Google Cloud environment (e.g., R) you will be required to use a private (service account) key. This can be achieved by creating a service account which in turn will allow you to download your private key as a JSON file (we show an example below).

11.3 Simple API call

  • What does a simple API call look like?

Note. For both Google’s Translation API as well as Google’s Natural-Language API, in this review we demonstrate an example for a simple API call via the Google Cloud Shell. In principle (and in a very similar procedure) this can be achieved for the Speech-to-Text API. However, your audio file will need some pre-processing. Audio data (such as our exemplary file in wav-format) is binary data. To make your REST request (via the Google Cloud Shell) however JSON is used. JSON eventually does not support binary data which is why you will have to transform your binary audio file into text using Base64 encoding (also refer to this documentation from the Google Website for more information). If you enter audio data which is not Base64 encoded, the Google Cloud Shell will give you an error 400 stating that Base64 decoding failed for your (wav-)file. Nevertheless, in the box below we will provide the basic structure of the request.

  • To activate your Cloud Shell, inspect the upper right-hand corner of your Google Cloud Platform Console and click the icon called “Activate Shell”. Google Cloud Shell is a command line environment running in the cloud.

  • Via the built-in Editor in Cloud Shell create a JSON file (call it for instance ‘request.json’). You can either upload your audio file directly via the Google Cloud Shell (search for the three-dotted “More” menu in the Shell and select “Upload file”), alternatively audio content can be integrated with Cloud Storage.

  • The wav.file we uploaded for this example is an exemplary wav.file that comes along with the ‘googleLanguageR’ R package.

{
  "audio": {
    "content": "woman1_wb"
  },
  "config": {
    "enableAutomaticPunctuation": true,
    "encoding": "LINEAR16",
    "languageCode": "en-US",
    "model": "default"
  }
}
  • For sending your data, pass a curl command to your Cloud Shell command line where you refer (via @) to your request.json file from the previous step.

  • Don’t forget to insert your individual API key (alternatively, you could define it beforehand via a variable in your environment -> see example in the API call for Google’s NLP API later in this document).

curl "https://speech.googleapis.com/v1p1beta1/speech:recognize?key=APIKEY" -s -X POST -H "Content-Type: application/json" --data-binary @request.json

11.4 API access in R

  • How can we access the API from R (httr + other packages)?

Example using R-Package ‘googleLanguageR’

In this small example we demonstrate how to..

*.. authenticate with your Google Cloud Account within R

*.. how to import an exemplary audiofile from the “GoogleLanguageR” package

*.. how to transcribe this audio file and calculate a confidence score

For the usage of further arguments, also read the gl_speech() documentation and this vignette.

1. Load packages

library(tidyverse)
library(googleLanguageR)

Step 2: Authentication

gl_auth("./your-key.json")

Step 3: Analysis

We will now get a sample source file which comes along with the googleLanuageR package. The transcript of this file is: “To administer medicine to animals is frequently a very difficult matter, and yet sometimes it’s necessary to do so” - which according to Edmondson (2017) (one of the authors of the ‘googleLanguageR’ R package) is a fairly difficult sentence for computers to parse.

exemplary_audio <- system.file("woman1_wb.wav", package = "googleLanguageR")

We can now call the API via the function gl_speech(). Here you will have to specify the quantity of interest, namely the audio_source (this can either be a local file or a Google Cloud Storage URI) as well as the languageCode (language spoken in your audio file).

exemplary_audio_analysis <- gl_speech(audio_source=exemplary_audio, languageCode = "en-GB")

The result is a list containing two dataframes: transcript and timings.

dimnames(exemplary_audio_analysis$transcript)
## [[1]]
## [1] "1"
## 
## [[2]]
## [1] "transcript"   "confidence"   "languageCode" "channelTag"

The timings dataframe stores timestamps telling us when each specific term was recognised. The transcript dataframe importantly provides the transcript as well as a confidence score. We can see that the transcript misses one term (“a”) and indicates its confidence with a score close to 1.0.

# Show transcript
exemplary_audio_analysis$transcript$transcript
## [1] "to administer medicine to animals is frequently very difficult matter and yet sometimes it's necessary to do so"
# Show confidence
exemplary_audio_analysis$transcript$confidence #0.92
## [1] "0.9151854"

11.5 Social science examples

  • Are there social science research examples using the API?

Gavras et al. (2022) published research on how to use voice recording in smartphone surveys, i.e. they asked respondents to record their answer via microphone. For pre-processing and ultimately analyzing this data, they explicitly state to have used Google’s Speech-to-Text API (Gavras et al. 2022, 9). They also refer to another source, namely Proksch, Wratil, and Wäckerle (2019), who found the results of the API to be highly similar (e.g., r>0.9 between Google-transcribed and human-transcribed political speeches in German) to manual transcriptions carried out by humans. Importantly, collecting survey data via audio rather than text format has received increasing attention in the recent past (see e.g., Schober et al. 2015; Revilla and Couper 2021) as it is also part of a larger “text-as-data” movement.

References

Edmondson, Mark. 2017. “Google Cloud Speech API.” https://mran.microsoft.com/snapshot/2017-09-27/web/packages/googleLanguageR/vignettes/speech.html.
Gavras, Konstantin, Jan Karem Höhne, Annelies G Blom, and Harald Schoen. 2022. “Innovating the Collection of Open-Ended Answers: The Linguistic and Content Characteristics of Written and Oral Answers to Political Attitude Questions.” J. R. Stat. Soc. Ser. A Stat. Soc. tba (tba): 1–19.
Proksch, Sven-Oliver, Christopher Wratil, and Jens Wäckerle. 2019. “Testing the Validity of Automatic Speech Recognition for Political Text Analysis.” Polit. Anal. 27 (3): 339–59.
Revilla, Melanie, and Mick P Couper. 2021. “Improving the Use of Voice Recording in a Smartphone Survey.” Soc. Sci. Comput. Rev. 39 (6): 1159–78.
Schober, Michael F, Frederick G Conrad, Christopher Antoun, Patrick Ehlen, Stefanie Fail, Andrew L Hupp, Michael Johnston, Lucas Vickers, H Yanna Yan, and Chan Zhang. 2015. “Precision and Disclosure in Text and Voice Interviews on Smartphones.” PLoS One 10 (6): e0128337.