Chapter 18 Media Cloud API

Chung-hong Chan

You will need to install the following packages for this chapter (run the code):

# install.packages('pacman')
library(pacman)
p_load('httr', 'stringr',
'mediacloud', 'tidytext', 'quanteda', 'quanteda')

18.1 Provided services/data

  • What data/service is provided by the API?

According to the official FAQ, Media Cloud is “an open source and open data platform for storing, retrieving, visualizing, and analyzing online news.” It is a consortium project across multiple institutions, including the University of Massachusetts Amherst, Northeastern University, and the Berkman Klein Center for Internet & Society at Harvard University. The full technical information about the project and the data provided are available in Roberts et al. (2021). In short, the system continuously crawls RSS and similar feeds from a large collection of media sources (as of writing: > 25,000 media sources). Based on this large corpus of media contents, the system provides three services: Topic Mapper, Media Explorer, and Source Manager.

The services are accessible through the web interface and csv export is also supported from there. For programmatic access, Media Cloud also provides several APIs. I will focus on the main v2.0 API, because it is currently the only public API.

The main API provides functions to retrieve stories, tags, and sentences. Probably due to copyright reasons, the API does not provide full-text stories. But it is possible to pull document-term matrices from the API.

18.2 Prerequisites

  • What are the prerequisites to access the API (authentication)? *

An API Key is required. One needs to register for an account at the official website of Media Cloud. After having the access, click on your profile to obtain the API Key.

It is recommended to set the API key as the environment variable MEDIACLOUD_API_KEY. Please consult Chapter 2 on how to do that in the section on Environment Variables.

18.3 Simple API call

  • What does a simple API call look like?

The API documentation is available here. Please note the request limit.

The most important end points are:

  1. GET api/v2/media/list/
  2. GET api/v2/stories_public/list
  3. GET api/v2/stories_public/count
  4. GET api/v2/stories_public/word_matrix

It is also important to learn about how to write a solr query. It is used in either q (“query”) and fq (“filter query”) of many end point requests. For example, to search for stories with both “mannheim” and “university” in the New York Times (media_id = 1), the solr query should be: text:mannheim+AND+text:university+AND+media_id:1.

In this example, we are going to search for 20 stories in the New York Times (Media ID: 1) mentioning “mannheim AND university”.

library(httr)
library(stringr)
url <- parse_url("https://api.mediacloud.org/api/v2/stories_public/list")
params <- list(q = "text:mannheim+AND+text:university+AND+media_id:1",
               key = Sys.getenv("MEDIACLOUD_API_KEY"))
url$query <- params
final_url <- str_replace_all(build_url(url), c("%3A" = ":", "%2B" = "+"))

res <- GET(final_url)
httr::content(res)

18.4 API access in R

  • How can we access the API from R (httr + other packages)?

As of writing, there are (at least) four R packages for accessing the Media Cloud API. Although the mediacloudr package by Dix Jan is available on CRAN, I recommend using the mediacloud package by Julian Unkel (LMU). It is available on Dr Unkel’s Github. By default, the package always returns “tidy” objects. The package can be installed by:

devtools::install_github("joon-e/mediacloud")

The above “mannheim” example can be replaced by (the package looks for the environment variable MEDIACLOUD_API_KEY automatically):

library(mediacloud)
mc_mannheim <- search_stories(text = "mannheim AND university", media_id = 1, n = 20)
mc_mannheim
## # A tibble: 20 × 9
##    stories_id media_id publish_date        title                                                                                           url                                                                                                    processed_stories_id media_name collect_date        tags  
##         <int>    <int> <dttm>              <chr>                                                                                           <chr>                                                                                                                 <int> <chr>      <dttm>              <name>
##  1   44790862        1 2011-11-23 15:20:21 Special Report: Education: A Scholarly Role for Consumer Technology                             http://feeds.nytimes.com/click.phdo?i=f14ddd7045ae5899bdaed53548ea0fac                                             99320532 New York … 2011-11-23 23:02:06 <list>
##  2   78373861        1 2012-04-08 21:47:48 Germany Is the Place to Go to Block a Rival's Technology                                        http://feeds.nytimes.com/click.phdo?i=0b033a29b0a1cfd89e7d94303e90af5b                                            102657410 New York … 2012-04-08 23:57:08 <list>
##  3   82249892        1 2012-05-29 20:45:00 European Official Calls for an Economic Road Map                                                http://feeds.nytimes.com/click.phdo?i=b2370b329b6db682100a36312c9bc154                                            103821112 New York … 2012-05-29 20:57:57 <list>
##  4   28218544        1 2010-12-09 03:43:41 Back Injury Sidelines Klitschko                                                                 http://feeds1.nytimes.com/~r/nyt/rss/Sports/~3/ctnUdsJRHzk/sports-uk-boxing-klitschko-injury.html                 338790634 New York … 2010-12-09 17:55:43 <list>
##  5  314056973        1 2015-02-01 00:00:00 As New York Moves People With Developmental Disabilities to Group Homes, Some Families Struggle http://www.nytimes.com/2015/02/01/nyregion/as-new-york-moves-people-with-developmental-disabilities-t…            402084602 New York … 2015-01-29 20:55:31 <list>
##  6   85744480        1 2012-07-02 06:12:03 M.B.A. Candidates Learn Leadership, in the Mud                                                  http://feeds.nytimes.com/click.phdo?i=e758f43cc4e99833540e6d909491b96a                                            419439006 New York … 2012-07-02 08:52:45 <list>
##  7  225224567        1 2014-04-23 21:47:59 Personal Journeys: All Aboard the Orient Local                                                  http://rss.nytimes.com/c/34625/f/642561/s/39ae85bd/sc/38/l/0L0Snytimes0N0C20A140C0A40C270Ctravel0Call…            419552946 New York … 2014-04-24 03:30:12 <list>
##  8   94259735        1 2012-11-24 05:03:13 Arts | New Jersey: New Jersey Holiday Concerts Offer a Wide Range of Styles                     http://www.nytimes.com/2012/11/25/nyregion/new-jersey-holiday-concerts-offer-a-wide-range-of-styles.h…            420309253 New York … 2012-11-24 15:01:11 <list>
##  9   93835753        1 2012-11-18 05:15:34 Alexandra Gutowski, Zachary Allen &#x2014; Weddings                                             http://www.nytimes.com/2012/11/18/fashion/weddings/alexandra-gutowski-zachary-allen-weddings.html?par…            420359845 New York … 2012-11-18 12:53:46 <list>
## 10   87016471        1 2012-06-05 02:31:14 Germany Open to Deal on Pooling Euro Debt, With Limits                                          http://feeds.nytimes.com/click.phdo?i=5d966d2d43aaad1a056e5ae0e1d3e559                                            420417459 New York … 2012-08-16 00:42:29 <list>
## 11   85768000        1 2012-07-02 08:57:47 IHT Rendezvous: The Zany Business of Teaching Management                                        http://feeds.nytimes.com/click.phdo?i=6ab66f3da93480f3e56d6c286b75723b                                            420549716 New York … 2012-07-02 19:37:32 <list>
## 12   82322640        1 2012-06-01 15:14:14 European Crisis Bolsters Illegal Sales of Body Parts                                            http://feeds.nytimes.com/click.phdo?i=71ef4d87201aca8be1194d44b2cf6012                                            420757923 New York … 2012-06-12 20:13:56 <list>
## 13  298730990        1 2014-12-05 05:32:31 Calendar: Events in Connecticut for Dec. 7-13, 2014                                             http://rss.nytimes.com/c/34625/f/640367/s/41240614/sc/10/l/0L0Snytimes0N0C20A140C120C0A70Cnyregion0Ce…            428490928 New York … 2014-12-05 06:17:41 <list>
## 14   23790844        1 2010-09-09 16:24:59 In Land of Fast Cars and Trains, Buses Try to Make Inroads                                      http://feeds.nytimes.com/click.phdo?i=bb0541910fe17957f2cb52a2b205ae58                                            630486887 New York … 2010-09-09 21:30:45 <list>
## 15  354583774        1 2015-06-19 11:02:56 The Saturday Profile: A German Writer Translates a Puzzling Illness Into a Best-Selling Book    http://rss.nytimes.com/c/34625/f/642565/s/47654dfb/sc/33/l/0L0Snytimes0N0C20A150C0A60C20A0Cworld0Ceur…            778925402 New York … 2015-06-19 11:21:31 <list>
## 16  385106169        1 2015-10-07 05:42:02 Well: Homing In on the Source of Runner’s High                                                  http://rss.nytimes.com/c/34625/f/640347/s/4a79477a/sc/14/l/0Lwell0Bblogs0Bnytimes0N0C20A150C10A0C0A70…            833493930 New York … 2015-10-07 07:04:59 <list>
## 17  371289843        1 2015-08-20 07:26:16 News: Morning Agenda: Slow Inflation Makes Fed Cautious Over Rates                              http://news.blogs.nytimes.com/2015/08/20/morning-agenda-slow-inflation-makes-fed-cautious-over-rates/…            869026798 New York … 2015-08-20 08:57:27 <list>
## 18  601772076        1 2011-10-20 08:00:00 Gauging the Value of Your M.B.A.                                                                http://www.nytimes.com/2011/10/20/education/20iht-SReducEmploy20.html?pagewanted=all                              898178364 New York … 2017-03-30 20:11:59 <list>
## 19  651215537        1 2017-07-02 00:25:00 Nashonme Johnson, Pasquale Di Stasio                                                            https://www.nytimes.com/2017/07/02/fashion/weddings/nashonme-johnson-pasquale-di-stasio.html?partner=…            948038684 New York … 2017-07-02 02:00:29 <list>
## 20  656857074        1 2006-09-10 08:00:00 On Self                                                                                         http://www.nytimes.com/2006/09/10/magazine/10sontag.html?_r=1&oref=slogin&pagewanted=all                          954374793 New York … 2017-07-13 13:55:39 <list>

18.4.1 Media keywords of “Universität Mannheim”

In the following slightly more sophisticated example, we are going to first search for a list of all national German media outlets, search for a bunch of (German) articles mentioning “Universität Mannheim”, and then extract keywords using term frequency-inverse document frequency (TF-IDF). There are three steps.

18.4.1.1 Search for all national German media outlets

All major German media outlets are tagged with Germany___National. The function search_media() is used to retrieve information about all national German media outlets.

de_media <- search_media(tag = "Germany___National", n = 100)
## # A tibble: 50 × 5
##    media_id name                 url                             start_date          tags        
##       <int> <chr>                <chr>                           <dttm>              <named list>
##  1    19831 Spiegel              http://www.spiegel.de           2013-06-10 00:00:00 <list [19]> 
##  2    20001 taz.de               http://www.taz.de               2017-04-03 00:00:00 <list [17]> 
##  3    21558 neues-deutschland.de http://www.neues-deutschland.de 2017-04-03 00:00:00 <list [12]> 
##  4    21854 jungewelt.de         http://www.jungewelt.de         2018-06-04 00:00:00 <list [21]> 
##  5    21917 berlinerumschau.com  http://www.berlinerumschau.com  2014-12-29 00:00:00 <list [7]>  
##  6    22009 bild.de              http://www.bild.de              2013-06-10 00:00:00 <list [21]> 
##  7    23037 manager-magazin.de   http://www.manager-magazin.de   2017-04-10 00:00:00 <list [9]>  
##  8    23538 n-tv.de              http://www.n-tv.de              2017-04-03 00:00:00 <list [9]>  
##  9    38697 zeit                 http://www.zeit.de/index        2013-03-11 00:00:00 <list [20]> 
## 10    39206 Tagespiegel          http://www.tagesspiegel.de      2013-03-18 00:00:00 <list [20]> 
## # ℹ 40 more rows

18.4.1.2 Pull a list of articles

The following query gets a list of 100 articles mentioning “universität mannheim” published in a specific date range from all national German media outlets. Unlike the AND operator, this search for the exact term. Also, a query is case insensitive. The function search_stories() can be used for this.

unima_articles <- search_stories(text = "\"universität mannheim\"",
                                 media_id = de_media$media_id,
                                 n = 100,
                                 after_date = "2021-01-01",
                                 before_date = "2021-12-01")
unima_articles
## # A tibble: 88 × 9
##    stories_id media_id publish_date        title                                                                                                                                         url                                                      processed_stories_id media_name collect_date        tags  
##         <int>    <int> <dttm>              <chr>                                                                                                                                         <chr>                                                                   <dbl> <chr>      <dttm>              <name>
##  1 1815576509   385413 2021-01-05 10:57:12 Studie: Erneuter Lockdown belastete Unternehmensgewinne nicht spürbar                                                                         https://www.boersen-zeitung.de/index.php?l=5&isin=&dpas…           2225420413 Börsen-Ze… 2021-01-05 11:14:23 <list>
##  2 1815575525   282241 2021-01-05 10:56:38 Studie: Erneuter Lockdown belastete Unternehmensgewinne nicht spürbar                                                                         https://www.boerse-online.de/nachrichten/aktien/studie-…           2225420496 boerse-on… 2021-01-05 11:14:17 <list>
##  3 1815573995    42022 2021-01-05 11:05:00 Studie: Erneuter Lockdown belastete Unternehmensgewinne nicht spürbar                                                                         https://www.finanznachrichten.de/nachrichten-2021-01/51…           2225423365 finanznac… 2021-01-05 11:14:05 <list>
##  4 1815653090    39727 2021-01-05 12:44:00 Bund-Länder-Beratungen: Jetzt live: Kanzlerin Merkel und Ministerpräsidenten stellen die neuen Beschlüsse vor                                 https://www.stern.de/gesundheit/corona-gipfel-im-livebl…           2225488400 stern      2021-01-05 12:45:59 <list>
##  5 1815772594    21558 2021-01-05 11:07:16 Das Prinzip Links                                                                                                                             https://www.neues-deutschland.de/artikel/1146618.das-pr…           2225615666 neues-deu… 2021-01-05 15:43:01 <list>
##  6 1821852038    42022 2021-01-12 13:11:00 Medizinjurist: Impfpflicht für Pflegekräfte rechtlich vertretbar                                                                              https://www.finanznachrichten.de/nachrichten-2021-01/51…           2231370715 finanznac… 2021-01-12 13:29:47 <list>
##  7 1822816931    41519 2021-01-13 10:54:24 Staatsverschuldung: Rückhalt für Schuldenbremse nimmt ab – Landespolitiker für Infrastrukturinvestitionen                                     https://www.handelsblatt.com/politik/deutschland/staats…           2232460589 Handelsbl… 2021-01-13 11:04:27 <list>
##  8 1822860428    39206 2021-01-13 11:32:47 Schützen Konzerne ihre Beschäftigten ausreichend vor Corona-Ausbrüchen?                                                                       https://www.tagesspiegel.de/politik/weniger-homeoffice-…           2232506700 Tagespieg… 2021-01-13 11:41:40 <list>
##  9 1823395678    42022 2021-01-14 01:07:00 PRESSESPIEGEL/Zinsen, Konjunktur, Kapitalmärkte, Branchen                                                                                     https://www.finanznachrichten.de/nachrichten-2021-01/51…           2233108356 finanznac… 2021-01-14 01:32:49 <list>
## 10 1829126009    41519 2021-01-20 09:57:59 Homeoffice und Hygienevorschriften: „Wir können die Zentrale nicht einfach abschließen“: Wirtschaft reagiert gespalten auf neue Corona-Regeln https://www.handelsblatt.com/karriere/homeoffice-und-hy…           2239054366 Handelsbl… 2021-01-20 10:17:51 <list>
## # ℹ 78 more rows

18.4.1.3 Pull word matrices

With the list of stories_id, we can then use the function get_word_matrices() to obtain word matrices. 11

unima_mat <- get_word_matrices(stories_id = unima_articles$stories_id, n = 100)
## # A tibble: 30,353 × 4
##    stories_id word_counts word_stem          full_word          
##    <chr>            <int> <chr>              <chr>              
##  1 1815573995           1 deutschland        deutschland        
##  2 1815573995           1 befragt            befragt            
##  3 1815573995           1 irgendeinem        irgendeinem        
##  4 1815573995           1 vorgelegten        vorgelegten        
##  5 1815573995           1 solo-selbstständig solo-selbstständige
##  6 1815573995           1 wert               wert               
##  7 1815573995           1 zufolg             zufolge            
##  8 1815573995           1 dpa-afx            dpa-afx            
##  9 1815573995           1 universität        universität        
## 10 1815573995           1 gewinnsitu         gewinnsituation    
## # ℹ 30,343 more rows

The data frame unima_mat is in the so-called “tidytext” format (Silge and Robinson 2016). It can be used directly for analysis if one is fond of tidytext. For users of quanteda (Benoit et al. 2018), it is also possible to cast the data frame into a Document-Feature Matrix (DFM) 12.

library(tidytext)
library(quanteda)
unima_dfm <- cast_dfm(unima_mat, stories_id, word_stem, word_counts)
unima_dfm
## Document-feature matrix of: 88 documents, 14,614 features (97.64% sparse) and 0 docvars.
##             features
## docs         deutschland befragt irgendeinem vorgelegten solo-selbstständig wert zufolg dpa-afx universität gewinnsitu
##   1815573995           1       1           1           1                  1    1      1       1           1          1
##   1815575525           1       1           1           1                  1    1      1       1           1          1
##   1815576509           0       0           0           0                  0    0      0       1           1          0
##   1815653090           1       1           0           1                  0    1      0       0           1          0
##   1815772594           0       0           0           0                  0    0      0       0           1          0
##   1821852038           0       0           0           0                  0    0      0       0           1          0
## [ reached max_ndoc ... 82 more documents, reached max_nfeat ... 14,604 more features ]

And then standard operations can be done.

unima_dfm %>% dfm_tfidf() %>% topfeatures(n = 20)
##     aufsichtsrat hauptversammlung     gesellschaft            westw            group         vorstand         aktionär        vergütung           aktien               ab             aktg               se   verwaltungsrat            gemäß vergütungssystem         mitglied    geschäftsjahr 
##         706.2057         619.3220         583.4058         494.6793         400.0420         379.0577         363.6897         333.5708         333.1135         332.9208         280.5663         260.5542         252.3862         242.9785         209.4179         201.7753         200.6090 
##            board      supervisori     ermächtigung 
##         187.9392         184.7259         167.2792

The faculties of BWL (Business Administration) and Jura (Law) would be happy with this finding.

18.5 Social science examples

  • Are there social science research examples using the API?

According to the paper by the official Media Cloud Team (Roberts et al. 2021), there are over 100 papers mentioning Media Could. Many papers use the counting endpoint to generate a time series of media attention to specific keywords (e.g. Benkler et al. 2015; Huckins et al. 2020). This function is widely used also in many data journalism pieces. The URLs collected from Media Cloud can also be used to do further crawling (e.g. Huckins et al. 2020).

It is perhaps worth mentioning that the openly available useNews dataset (Puschmann and Haim 2021) provides a large collection of content from Media Cloud together with meta data other data sources.

References

Benkler, Yochai, Hal Roberts, Robert Faris, Alicia Solow-Niederman, and Bruce Etling. 2015. “Social Mobilization and the Networked Public Sphere: Mapping the SOPA-PIPA Debate.” Political Communication 32 (4): 594–624. https://doi.org/10.1080/10584609.2014.986349.
Benoit, Kenneth, Kohei Watanabe, Haiyan Wang, Paul Nulty, Adam Obeng, Stefan Müller, and Akitaka Matsuo. 2018. quanteda: An R package for the quantitative analysis of textual data.” Journal of Open Source Software 3 (30): 774. https://doi.org/10.21105/joss.00774.
Huckins, Jeremy F, Alex W daSilva, Weichen Wang, Elin Hedlund, Courtney Rogers, Subigya K Nepal, Jialing Wu, et al. 2020. “Mental Health and Behavior of College Students During the Early Phases of the COVID-19 Pandemic: Longitudinal Smartphone and Ecological Momentary Assessment Study.” Journal of Medical Internet Research 22 (6): e20185. https://doi.org/10.2196/20185.
Puschmann, Cornelius, and Mario Haim. 2021. “useNews.” https://doi.org/10.17605/OSF.IO/UZCA3.
Roberts, Hal, Rahul Bhargava, Linas Valiukas, Dennis Jen, Momin M Malik, Cindy Bishop, Emily Ndulue, et al. 2021. Media Cloud: Massive Open Source Collection of Global News on the Open Web.” arXiv Preprint arXiv:2104.03702.
Silge, Julia, and David Robinson. 2016. tidytext: Text Mining and Analysis Using Tidy Data Principles in R.” Journal of Open Source Software 1 (3). https://doi.org/10.21105/joss.00037.

  1. For the sake of education, I split step 2 and 3 into two steps. Actually, it is possible to merge step 2 and step 3 by simply: get_word_matrices(text = "\"universität mannheim\"")↩︎

  2. It is quite obvious that there are (many) duplicates in the retrieved data. For example, the first few documents are almost the same in the feature space. Packages such as textsdc might be useful for deduplication.↩︎