Chapter 18 Media Cloud API
You will need to install the following packages for this chapter (run the code):
# install.packages('pacman')
library(pacman)
p_load('httr', 'stringr',
'mediacloud', 'tidytext', 'quanteda', 'quanteda')
18.1 Provided services/data
- What data/service is provided by the API?
According to the official FAQ, Media Cloud is “an open source and open data platform for storing, retrieving, visualizing, and analyzing online news.” It is a consortium project across multiple institutions, including the University of Massachusetts Amherst, Northeastern University, and the Berkman Klein Center for Internet & Society at Harvard University. The full technical information about the project and the data provided are available in Roberts et al. (2021). In short, the system continuously crawls RSS and similar feeds from a large collection of media sources (as of writing: > 25,000 media sources). Based on this large corpus of media contents, the system provides three services: Topic Mapper, Media Explorer, and Source Manager.
The services are accessible through the web interface and csv export is also supported from there. For programmatic access, Media Cloud also provides several APIs. I will focus on the main v2.0 API, because it is currently the only public API.
The main API provides functions to retrieve stories, tags, and sentences. Probably due to copyright reasons, the API does not provide full-text stories. But it is possible to pull document-term matrices from the API.
18.2 Prerequisites
- What are the prerequisites to access the API (authentication)? *
An API Key is required. One needs to register for an account at the official website of Media Cloud. After having the access, click on your profile to obtain the API Key.
It is recommended to set the API key as the environment variable MEDIACLOUD_API_KEY
. Please consult Chapter 2 on how to do that in the section on Environment Variables.
18.3 Simple API call
- What does a simple API call look like?
The API documentation is available here. Please note the request limit.
The most important end points are:
- GET api/v2/media/list/
- GET api/v2/stories_public/list
- GET api/v2/stories_public/count
- GET api/v2/stories_public/word_matrix
It is also important to learn about how to write a solr query. It is used in either q
(“query”) and fq
(“filter query”) of many end point requests. For example, to search for stories with both “mannheim” and “university” in the New York Times (media_id = 1), the solr query should be: text:mannheim+AND+text:university+AND+media_id:1
.
In this example, we are going to search for 20 stories in the New York Times (Media ID: 1) mentioning “mannheim AND university”.
library(httr)
library(stringr)
url <- parse_url("https://api.mediacloud.org/api/v2/stories_public/list")
params <- list(q = "text:mannheim+AND+text:university+AND+media_id:1",
key = Sys.getenv("MEDIACLOUD_API_KEY"))
url$query <- params
final_url <- str_replace_all(build_url(url), c("%3A" = ":", "%2B" = "+"))
res <- GET(final_url)
httr::content(res)
18.4 API access in R
- How can we access the API from R (httr + other packages)?
As of writing, there are (at least) four R packages for accessing the Media Cloud API. Although the mediacloudr
package by Dix Jan is available on CRAN, I recommend using the mediacloud
package by Julian Unkel (LMU). It is available on Dr Unkel’s Github. By default, the package always returns “tidy” objects. The package can be installed by:
The above “mannheim” example can be replaced by (the package looks for the environment variable MEDIACLOUD_API_KEY
automatically):
library(mediacloud)
mc_mannheim <- search_stories(text = "mannheim AND university", media_id = 1, n = 20)
mc_mannheim
## # A tibble: 20 × 9
## stories_id media_id publish_date title url processed_stories_id media_name collect_date tags
## <int> <int> <dttm> <chr> <chr> <int> <chr> <dttm> <name>
## 1 44790862 1 2011-11-23 15:20:21 Special Report: Education: A Scholarly Role for Consumer Technology http://feeds.nytimes.com/click.phdo?i=f14ddd7045ae5899bdaed53548ea0fac 99320532 New York … 2011-11-23 23:02:06 <list>
## 2 78373861 1 2012-04-08 21:47:48 Germany Is the Place to Go to Block a Rival's Technology http://feeds.nytimes.com/click.phdo?i=0b033a29b0a1cfd89e7d94303e90af5b 102657410 New York … 2012-04-08 23:57:08 <list>
## 3 82249892 1 2012-05-29 20:45:00 European Official Calls for an Economic Road Map http://feeds.nytimes.com/click.phdo?i=b2370b329b6db682100a36312c9bc154 103821112 New York … 2012-05-29 20:57:57 <list>
## 4 28218544 1 2010-12-09 03:43:41 Back Injury Sidelines Klitschko http://feeds1.nytimes.com/~r/nyt/rss/Sports/~3/ctnUdsJRHzk/sports-uk-boxing-klitschko-injury.html 338790634 New York … 2010-12-09 17:55:43 <list>
## 5 314056973 1 2015-02-01 00:00:00 As New York Moves People With Developmental Disabilities to Group Homes, Some Families Struggle http://www.nytimes.com/2015/02/01/nyregion/as-new-york-moves-people-with-developmental-disabilities-t… 402084602 New York … 2015-01-29 20:55:31 <list>
## 6 85744480 1 2012-07-02 06:12:03 M.B.A. Candidates Learn Leadership, in the Mud http://feeds.nytimes.com/click.phdo?i=e758f43cc4e99833540e6d909491b96a 419439006 New York … 2012-07-02 08:52:45 <list>
## 7 225224567 1 2014-04-23 21:47:59 Personal Journeys: All Aboard the Orient Local http://rss.nytimes.com/c/34625/f/642561/s/39ae85bd/sc/38/l/0L0Snytimes0N0C20A140C0A40C270Ctravel0Call… 419552946 New York … 2014-04-24 03:30:12 <list>
## 8 94259735 1 2012-11-24 05:03:13 Arts | New Jersey: New Jersey Holiday Concerts Offer a Wide Range of Styles http://www.nytimes.com/2012/11/25/nyregion/new-jersey-holiday-concerts-offer-a-wide-range-of-styles.h… 420309253 New York … 2012-11-24 15:01:11 <list>
## 9 93835753 1 2012-11-18 05:15:34 Alexandra Gutowski, Zachary Allen — Weddings http://www.nytimes.com/2012/11/18/fashion/weddings/alexandra-gutowski-zachary-allen-weddings.html?par… 420359845 New York … 2012-11-18 12:53:46 <list>
## 10 87016471 1 2012-06-05 02:31:14 Germany Open to Deal on Pooling Euro Debt, With Limits http://feeds.nytimes.com/click.phdo?i=5d966d2d43aaad1a056e5ae0e1d3e559 420417459 New York … 2012-08-16 00:42:29 <list>
## 11 85768000 1 2012-07-02 08:57:47 IHT Rendezvous: The Zany Business of Teaching Management http://feeds.nytimes.com/click.phdo?i=6ab66f3da93480f3e56d6c286b75723b 420549716 New York … 2012-07-02 19:37:32 <list>
## 12 82322640 1 2012-06-01 15:14:14 European Crisis Bolsters Illegal Sales of Body Parts http://feeds.nytimes.com/click.phdo?i=71ef4d87201aca8be1194d44b2cf6012 420757923 New York … 2012-06-12 20:13:56 <list>
## 13 298730990 1 2014-12-05 05:32:31 Calendar: Events in Connecticut for Dec. 7-13, 2014 http://rss.nytimes.com/c/34625/f/640367/s/41240614/sc/10/l/0L0Snytimes0N0C20A140C120C0A70Cnyregion0Ce… 428490928 New York … 2014-12-05 06:17:41 <list>
## 14 23790844 1 2010-09-09 16:24:59 In Land of Fast Cars and Trains, Buses Try to Make Inroads http://feeds.nytimes.com/click.phdo?i=bb0541910fe17957f2cb52a2b205ae58 630486887 New York … 2010-09-09 21:30:45 <list>
## 15 354583774 1 2015-06-19 11:02:56 The Saturday Profile: A German Writer Translates a Puzzling Illness Into a Best-Selling Book http://rss.nytimes.com/c/34625/f/642565/s/47654dfb/sc/33/l/0L0Snytimes0N0C20A150C0A60C20A0Cworld0Ceur… 778925402 New York … 2015-06-19 11:21:31 <list>
## 16 385106169 1 2015-10-07 05:42:02 Well: Homing In on the Source of Runner’s High http://rss.nytimes.com/c/34625/f/640347/s/4a79477a/sc/14/l/0Lwell0Bblogs0Bnytimes0N0C20A150C10A0C0A70… 833493930 New York … 2015-10-07 07:04:59 <list>
## 17 371289843 1 2015-08-20 07:26:16 News: Morning Agenda: Slow Inflation Makes Fed Cautious Over Rates http://news.blogs.nytimes.com/2015/08/20/morning-agenda-slow-inflation-makes-fed-cautious-over-rates/… 869026798 New York … 2015-08-20 08:57:27 <list>
## 18 601772076 1 2011-10-20 08:00:00 Gauging the Value of Your M.B.A. http://www.nytimes.com/2011/10/20/education/20iht-SReducEmploy20.html?pagewanted=all 898178364 New York … 2017-03-30 20:11:59 <list>
## 19 651215537 1 2017-07-02 00:25:00 Nashonme Johnson, Pasquale Di Stasio https://www.nytimes.com/2017/07/02/fashion/weddings/nashonme-johnson-pasquale-di-stasio.html?partner=… 948038684 New York … 2017-07-02 02:00:29 <list>
## 20 656857074 1 2006-09-10 08:00:00 On Self http://www.nytimes.com/2006/09/10/magazine/10sontag.html?_r=1&oref=slogin&pagewanted=all 954374793 New York … 2017-07-13 13:55:39 <list>
18.4.1 Media keywords of “Universität Mannheim”
In the following slightly more sophisticated example, we are going to first search for a list of all national German media outlets, search for a bunch of (German) articles mentioning “Universität Mannheim”, and then extract keywords using term frequency-inverse document frequency (TF-IDF). There are three steps.
18.4.1.1 Search for all national German media outlets
All major German media outlets are tagged with Germany___National
. The function search_media()
is used to retrieve information about all national German media outlets.
## # A tibble: 50 × 5
## media_id name url start_date tags
## <int> <chr> <chr> <dttm> <named list>
## 1 19831 Spiegel http://www.spiegel.de 2013-06-10 00:00:00 <list [19]>
## 2 20001 taz.de http://www.taz.de 2017-04-03 00:00:00 <list [17]>
## 3 21558 neues-deutschland.de http://www.neues-deutschland.de 2017-04-03 00:00:00 <list [12]>
## 4 21854 jungewelt.de http://www.jungewelt.de 2018-06-04 00:00:00 <list [21]>
## 5 21917 berlinerumschau.com http://www.berlinerumschau.com 2014-12-29 00:00:00 <list [7]>
## 6 22009 bild.de http://www.bild.de 2013-06-10 00:00:00 <list [21]>
## 7 23037 manager-magazin.de http://www.manager-magazin.de 2017-04-10 00:00:00 <list [9]>
## 8 23538 n-tv.de http://www.n-tv.de 2017-04-03 00:00:00 <list [9]>
## 9 38697 zeit http://www.zeit.de/index 2013-03-11 00:00:00 <list [20]>
## 10 39206 Tagespiegel http://www.tagesspiegel.de 2013-03-18 00:00:00 <list [20]>
## # ℹ 40 more rows
18.4.1.2 Pull a list of articles
The following query gets a list of 100 articles mentioning “universität mannheim” published in a specific date range from all national German media outlets. Unlike the AND operator, this search for the exact term. Also, a query is case insensitive. The function search_stories()
can be used for this.
unima_articles <- search_stories(text = "\"universität mannheim\"",
media_id = de_media$media_id,
n = 100,
after_date = "2021-01-01",
before_date = "2021-12-01")
unima_articles
## # A tibble: 88 × 9
## stories_id media_id publish_date title url processed_stories_id media_name collect_date tags
## <int> <int> <dttm> <chr> <chr> <dbl> <chr> <dttm> <name>
## 1 1815576509 385413 2021-01-05 10:57:12 Studie: Erneuter Lockdown belastete Unternehmensgewinne nicht spürbar https://www.boersen-zeitung.de/index.php?l=5&isin=&dpas… 2225420413 Börsen-Ze… 2021-01-05 11:14:23 <list>
## 2 1815575525 282241 2021-01-05 10:56:38 Studie: Erneuter Lockdown belastete Unternehmensgewinne nicht spürbar https://www.boerse-online.de/nachrichten/aktien/studie-… 2225420496 boerse-on… 2021-01-05 11:14:17 <list>
## 3 1815573995 42022 2021-01-05 11:05:00 Studie: Erneuter Lockdown belastete Unternehmensgewinne nicht spürbar https://www.finanznachrichten.de/nachrichten-2021-01/51… 2225423365 finanznac… 2021-01-05 11:14:05 <list>
## 4 1815653090 39727 2021-01-05 12:44:00 Bund-Länder-Beratungen: Jetzt live: Kanzlerin Merkel und Ministerpräsidenten stellen die neuen Beschlüsse vor https://www.stern.de/gesundheit/corona-gipfel-im-livebl… 2225488400 stern 2021-01-05 12:45:59 <list>
## 5 1815772594 21558 2021-01-05 11:07:16 Das Prinzip Links https://www.neues-deutschland.de/artikel/1146618.das-pr… 2225615666 neues-deu… 2021-01-05 15:43:01 <list>
## 6 1821852038 42022 2021-01-12 13:11:00 Medizinjurist: Impfpflicht für Pflegekräfte rechtlich vertretbar https://www.finanznachrichten.de/nachrichten-2021-01/51… 2231370715 finanznac… 2021-01-12 13:29:47 <list>
## 7 1822816931 41519 2021-01-13 10:54:24 Staatsverschuldung: Rückhalt für Schuldenbremse nimmt ab – Landespolitiker für Infrastrukturinvestitionen https://www.handelsblatt.com/politik/deutschland/staats… 2232460589 Handelsbl… 2021-01-13 11:04:27 <list>
## 8 1822860428 39206 2021-01-13 11:32:47 Schützen Konzerne ihre Beschäftigten ausreichend vor Corona-Ausbrüchen? https://www.tagesspiegel.de/politik/weniger-homeoffice-… 2232506700 Tagespieg… 2021-01-13 11:41:40 <list>
## 9 1823395678 42022 2021-01-14 01:07:00 PRESSESPIEGEL/Zinsen, Konjunktur, Kapitalmärkte, Branchen https://www.finanznachrichten.de/nachrichten-2021-01/51… 2233108356 finanznac… 2021-01-14 01:32:49 <list>
## 10 1829126009 41519 2021-01-20 09:57:59 Homeoffice und Hygienevorschriften: „Wir können die Zentrale nicht einfach abschließen“: Wirtschaft reagiert gespalten auf neue Corona-Regeln https://www.handelsblatt.com/karriere/homeoffice-und-hy… 2239054366 Handelsbl… 2021-01-20 10:17:51 <list>
## # ℹ 78 more rows
18.4.1.3 Pull word matrices
With the list of stories_id
, we can then use the function get_word_matrices()
to obtain word matrices. 11
## # A tibble: 30,353 × 4
## stories_id word_counts word_stem full_word
## <chr> <int> <chr> <chr>
## 1 1815573995 1 deutschland deutschland
## 2 1815573995 1 befragt befragt
## 3 1815573995 1 irgendeinem irgendeinem
## 4 1815573995 1 vorgelegten vorgelegten
## 5 1815573995 1 solo-selbstständig solo-selbstständige
## 6 1815573995 1 wert wert
## 7 1815573995 1 zufolg zufolge
## 8 1815573995 1 dpa-afx dpa-afx
## 9 1815573995 1 universität universität
## 10 1815573995 1 gewinnsitu gewinnsituation
## # ℹ 30,343 more rows
The data frame unima_mat
is in the so-called “tidytext” format (Silge and Robinson 2016). It can be used directly for analysis if one is fond of tidytext. For users of quanteda (Benoit et al. 2018), it is also possible to cast the data frame into a Document-Feature Matrix (DFM) 12.
library(tidytext)
library(quanteda)
unima_dfm <- cast_dfm(unima_mat, stories_id, word_stem, word_counts)
unima_dfm
## Document-feature matrix of: 88 documents, 14,614 features (97.64% sparse) and 0 docvars.
## features
## docs deutschland befragt irgendeinem vorgelegten solo-selbstständig wert zufolg dpa-afx universität gewinnsitu
## 1815573995 1 1 1 1 1 1 1 1 1 1
## 1815575525 1 1 1 1 1 1 1 1 1 1
## 1815576509 0 0 0 0 0 0 0 1 1 0
## 1815653090 1 1 0 1 0 1 0 0 1 0
## 1815772594 0 0 0 0 0 0 0 0 1 0
## 1821852038 0 0 0 0 0 0 0 0 1 0
## [ reached max_ndoc ... 82 more documents, reached max_nfeat ... 14,604 more features ]
And then standard operations can be done.
## aufsichtsrat hauptversammlung gesellschaft westw group vorstand aktionär vergütung aktien ab aktg se verwaltungsrat gemäß vergütungssystem mitglied geschäftsjahr
## 706.2057 619.3220 583.4058 494.6793 400.0420 379.0577 363.6897 333.5708 333.1135 332.9208 280.5663 260.5542 252.3862 242.9785 209.4179 201.7753 200.6090
## board supervisori ermächtigung
## 187.9392 184.7259 167.2792
The faculties of BWL (Business Administration) and Jura (Law) would be happy with this finding.
References
For the sake of education, I split step 2 and 3 into two steps. Actually, it is possible to merge step 2 and step 3 by simply:
get_word_matrices(text = "\"universität mannheim\"")
↩︎It is quite obvious that there are (many) duplicates in the retrieved data. For example, the first few documents are almost the same in the feature space. Packages such as textsdc might be useful for deduplication.↩︎
18.5 Social science examples
According to the paper by the official Media Cloud Team (Roberts et al. 2021), there are over 100 papers mentioning Media Could. Many papers use the counting endpoint to generate a time series of media attention to specific keywords (e.g. Benkler et al. 2015; Huckins et al. 2020). This function is widely used also in many data journalism pieces. The URLs collected from Media Cloud can also be used to do further crawling (e.g. Huckins et al. 2020).
It is perhaps worth mentioning that the openly available useNews dataset (Puschmann and Haim 2021) provides a large collection of content from Media Cloud together with meta data other data sources.