Chapter 7 GitHub.com API
You will need to install the following packages for this chapter (run the code):
# install.packages('pacman')
library(pacman)
p_load('jsonlite', 'httr')
7.1 Provided services/data
- What data/service is provided by the API?
The GitHub.com API provides a service for interacting with the social coding platform GitHub. A social coding platform is a website that allows users to work on software projects collaboratively. Users can share their work, engage in discussions and track activities of other users and projects. GitHub is currently the largest such platform with more than 50M user accounts (as of January 2022). GitHub’s API allows for retrieving user-generated data from its platform, which is probably of main interest for social scientists. It also provides tools for controlling your account, organizations and projects on the platform in order to automate workflows. This is more of interest for professional software development.
From a social scientist’s perspective, the API may be of interest when studying online communities, working methods, organizational structures, communication and discussions, etc. with a focus on (open-source) software development. Many projects that are hosted on GitHub are open-source projects with a transparent development process and communications. For private projects, which can also be hosted on GitHub, there’s understandably only a few aggregate data available.
When collecting data on GitHub, you should follow GitHub’s terms of service, especially the API terms. GitHub users agree that, when using the service, anything they post publicly may be viewed and used by other users (see ToS section D5). Still, you should be cautious when collecting and analyzing such user-generated data, as the user was not informed about contributing to a research project. If research data should later be published for replication, only aggregated and/or anonymized data should be made public. Depending on the data you collect, an ethics review should also be considered.
Special care should be taken about sampling techniques and replicability when working with this API. The sheer amount of data shared on GitHub requires sampling in most cases, but this is often hard to implement with the API, since results are usually sorted in a way that introduces bias when you can only fetch a certain number of top results (e.g. searching for users gives you the “most popular” users matching your criteria). Cosentino, Izquierdo, and Cabot (2016) show that many studies that use the GitHub API fail to implement proper sampling.
There are several competing platforms to GitHub. Many of them also provide an API which allow you to retrieve data. To list a few:
7.2 Prerequisites
The API can be used for free and you can send up to 60 requests per hour if you’re not authenticated (i.e. if you don’t provide an API key). For serious data collection, this is not much, so it is recommended to sign up on GitHub and generate a personal access token that acts as API key. This token can then be used to authenticate your API requests. Your quota is then 5000 requests per hour.
7.3 Simple API call
- What does a simple API call look like?
First of all, it’s important to note that GitHub actually provides two APIs: A REST API and a GraphQL API. They differ mainly in how requests are performed and how the responses are formatted, but also in what can be requested. The REST API is the default interface and provides a way of access akin to most other APIs in this book. The GraphQL API provides a quite different interface which requires you to more specifically describe what kind of information you want to retrieve. Though some very detailed information can only be retrieved via the GraphQL API, we will stick to the REST API for this chapter due to the complexity of the GraphQL API.
We can perform a request using the curl
command in a terminal or by simply visiting the URL in a browser. We will start by retrieving public profile data from the GitHub account of software developer and hacker Lilith Wittmann:6
The result is an HTTP response with JSON formatted data which contains user profile information that can also be found on the respective public profile page:
{
"login": "LilithWittmann",
"name": "Lilith Wittmann",
"location": "Berlin, Germany",
"bio": "freelance developer & consultant",
"twitter_username": "LilithWittmann",
"public_repos": 37,
"public_gists": 6,
"followers": 440,
"following": 52,
// [ more data ... ]
}
The GitHub API offers extensive search capabilities. You can search for users, repositories, discussions and more by constructing a search query. Here, we search for users that use R and report being located in Berlin:
{
"total_count": 541,
"incomplete_results": false,
"items": [
{
"login": "IndrajeetPatil",
"id": 11330453,
// [...]
},
{
"login": "christophergandrud",
"id": 1285805,
// [...]
},
{
"login": "RobertTLange",
"id": 20374662,
// [...]
},
// [ more data ... ]
}
We could then continue fetching profile information for each account in the result set.
Note that by default, the results are ordered by “best match”. This can be changed with an additional “sort” parameter. The GitHub API doesn’t provide a way of sampling search results – they are always sorted in some way and hence may introduce bias when you only fetch a limited number of results (which is often necessary since the result set is too large). You may consider to narrow the criteria for your search requests, e.g. by sampling geographical locations, so that you can always obtain the whole search results for a place.
Another example would be retrieving data about a project repository. We will use the GitHub repository for the popular R package dplyr as an example:
{
"id": 6427813,
"name": "dplyr",
"full_name": "tidyverse/dplyr",
"description": "dplyr: A grammar of data manipulation",
"homepage": "https://dplyr.tidyverse.org",
"size": 50767,
"stargazers_count": 3959,
"watchers_count": 3959,
"language": "R",
// [ more data ... ]
}
Finally, let’s fetch data about an issue in the dplyr repository. On GitHub, issues are code or documentation problems as well as development tasks that users and collaborators bring up and discuss. Here, we collect all data on dplyr issue #5958:
{
"url": "https://api.github.com/repos/tidyverse/dplyr/issues/5958",
"number": 5958,
"title": "Inaccurate documentation for `name` argument of `count()`",
"user": {
"login": "sfirke",
"id": 7569808,
// [ more data ... ]
},
"comments": 13,
"body": "The documentation for `count()` says of the [...]", // truncated
// [ more data ... ]
}
As you can see in the example above, a response may contain nested data (the “user” entry shows information about the author of the issue as a nested structure).
7.4 API access in R
- How can we access the API from R?
There are packages for many programming languages that provide convenient access for communicating with the GitHub API. Unfortunately, there are no such packages for R at the time of writing.7 This means we can only access the API directly, e.g. by using the jsonlite package to fetch the data and convert it to an R list or dataframe.
We start with translating the first API call from the previous section to R:
library(jsonlite)
profile_data <- fromJSON('https://api.github.com/users/LilithWittmann')
# this gives a list
head(profile_data, 3)
## $login
## [1] "LilithWittmann"
##
## $id
## [1] 891030
##
## $node_id
## [1] "MDQ6VXNlcjg5MTAzMA=="
The JSON response from the GitHub API was directly converted to an R list object. When a JSON result only consists of arrays, fromJSON()
automatically converts the result to a dataframe by default. We can observe that when fetching the repositories of an user:
profile_repos <- fromJSON('https://api.github.com/users/LilithWittmann/repos')
# this gives a dataframe
head(profile_repos[,1:3]) # selecting only the first 3 columns
## id node_id name
## 1 42730077 MDEwOlJlcG9zaXRvcnk0MjczMDA3Nw== airflow
## 2 481880572 R_kgDOHLjp_A ant-design
## 3 410145768 R_kgDOGHJT6A aries-staticagent-python
## 4 319779724 MDEwOlJlcG9zaXRvcnkzMTk3Nzk3MjQ= awesome-behoerden-floss
## 5 817707951 R_kgDOML07rw bund-id-documentation-backup
## 6 811975324 R_kgDOMGXCnA bundid-scam
Let’s also repeat the search query from the previous section. We want to obtain accounts that use R and entered “Berlin” in their profile’s location field. This time, the result is a list that reports the number of search results as search_results$total_count
and the search results details as a dataframe in search_results$items
:
search_results <- fromJSON('https://api.github.com/search/users?q=language:r+location:berlin')
# this gives a list with a dataframe in "items"
str(search_results, list.len = 3, vec.len = 3)
## List of 3
## $ total_count : int 694
## $ incomplete_results: logi FALSE
## $ items :'data.frame': 30 obs. of 19 variables:
## ..$ login : chr [1:30] "RobertTLange" "christophergandrud" "lisacharlotterost" ...
## ..$ id : int [1:30] 20374662 1285805 9379282 1569647 638225 18618077 1563902 42495969 ...
## ..$ node_id : chr [1:30] "MDQ6VXNlcjIwMzc0NjYy" "MDQ6VXNlcjEyODU4MDU=" "MDQ6VXNlcjkzNzkyODI=" ...
## .. [list output truncated]
Let’s suppose we want to collect profile data for each account in the result set. First, let’s investigate the search results in search_results$items
:
## login id node_id
## 1 RobertTLange 20374662 MDQ6VXNlcjIwMzc0NjYy
## 2 christophergandrud 1285805 MDQ6VXNlcjEyODU4MDU=
## 3 lisacharlotterost 9379282 MDQ6VXNlcjkzNzkyODI=
## 4 dirkschumacher 1569647 MDQ6VXNlcjE1Njk2NDc=
## 5 al2na 638225 MDQ6VXNlcjYzODIyNQ==
## 6 kozodoi 18618077 MDQ6VXNlcjE4NjE4MDc3
To retrieve the profile details of each user, we only need the account name which we can then feed into the https://api.github.com/users/<account>
API query. For demonstration purposes, let’s only select the first ten users in the result set:
r_users_berlin <- search_results$items$login[1:10]
# gives a character vector of 10 account names
head(r_users_berlin)
## [1] "RobertTLange" "christophergandrud" "lisacharlotterost" "dirkschumacher" "al2na" "kozodoi"
We can now build the API queries to fetch the profile information for each user:
users_query <- paste0('https://api.github.com/users/', r_users_berlin)
# gives a character vector of 10 API queries
head(users_query)
## [1] "https://api.github.com/users/RobertTLange" "https://api.github.com/users/christophergandrud" "https://api.github.com/users/lisacharlotterost" "https://api.github.com/users/dirkschumacher" "https://api.github.com/users/al2na"
## [6] "https://api.github.com/users/kozodoi"
The fromJSON()
function only accepts a single character string as query, but we have a vector of ten queries for the ten users. Hence, we resort to lapply()
which passes each query to fromJSON()
and stores the result as a list of ten lists (i.e. a nested structure – a list of lists). Each item in users_details
is then a list that contains the profile information for the respective user.
users_details <- lapply(users_query, fromJSON)
# gives a list of 10 lists (one for each user)
str(users_details, list.len = 3)
## List of 10
## $ :List of 32
## ..$ login : chr "RobertTLange"
## ..$ id : int 20374662
## ..$ node_id : chr "MDQ6VXNlcjIwMzc0NjYy"
## .. [list output truncated]
## $ :List of 32
## ..$ login : chr "christophergandrud"
## ..$ id : int 1285805
## ..$ node_id : chr "MDQ6VXNlcjEyODU4MDU="
## .. [list output truncated]
## $ :List of 32
## ..$ login : chr "lisacharlotterost"
## ..$ id : int 9379282
## ..$ node_id : chr "MDQ6VXNlcjkzNzkyODI="
## .. [list output truncated]
## [list output truncated]
A nested list is hard to work with, so we convert the result to a dataframe. On the way, we select only the information that we actually want to work with. For demonstration purposes, we only select the account name, the real name, the location, the number of public repositories and the number of followers of each user.
users_rows <- lapply(users_details, function(d) {
# select information that we need and generate a dataframe with a single row
data.frame(login = d$login,
name = d$name,
loc = d$location,
repos = d$public_repos,
followers = d$followers,
stringsAsFactors = FALSE)
})
# combine all dataframe rows to form the final dataframe
users_df <- do.call(rbind, users_rows)
# gives a dataframe with login, name, location, number of repos, number of followers
users_df
## login name loc repos followers
## 1 RobertTLange Robert Tjarko Lange Tokyo, Berlin, Barcelona, London 36 630
## 2 christophergandrud Christopher Gandrud Berlin 147 443
## 3 lisacharlotterost Lisa Charlotte Muth Berlin, Germany 11 368
## 4 dirkschumacher Dirk Schumacher Berlin 157 329
## 5 al2na Altuna Akalin Berlin 73 266
## 6 kozodoi Nikita Kozodoi Berlin, Germany 26 166
## 7 tholor Malte Pietsch Berlin, Germany 19 156
## 8 MathiasHarrer Mathias Harrer Munich & Berlin 30 153
## 9 joachim-gassen Joachim Gassen Berlin, Germany 34 150
## 10 saraswatmks Manish Saraswat Berlin, Germany 62 149
As we can see, working with the results from the API requires some effort in R, since we have to deal with the often complex structure of the provided data. There’s no “convenience wrapper” package for the GitHub API in R so far. If you’re familiar with other programming languages, you may consider using one of the recommended packages.8
7.4.1 Provide API key
Authenticating with the GitHub API via an API key allows you to send much more requests to the API (5000 requests per hour instead of 60 at the time of writing). API access keys for the GitHub API are called personal access tokens (PAT) and the documentation explains how to generate a PAT once you’ve logged into your GitHub account. Please be careful with your PATs and never publish them.
When you want to use authenticated requests, you need to pass along authentication information (your GitHub user account and your PAT) with every request. This is not possible with fromJSON()
– you have to send HTTP requests directly, e.g. via GET()
from the httr package and then need to handle the HTTP response. The following example shows this. Here, we have the secret access token PAT
stored as environment variable and fetch it via Sys.getenv()
. Next, we use GET()
to make an authenticated HTTP response to the GitHub API. We use the /user
API endpoint, which simply reports information about the currently authenticated user (which is your own account). If we were to use this API endpoint in an unauthenticated manner, we’d receive the error message “Requires authentication”.9 We then need to get the content of the response as text (content(response, as = 'text')
) and pass this to fromJSON()
in order to get the result as list object.
library(httr)
PAT <- Sys.getenv("GitHub_token")
response <- GET('https://api.github.com/user', authenticate('<account name>', PAT))
account_details <- fromJSON(httr::content(response, as = 'text'))
account_details
You can use that code the same way as for the other API requests, only that when you use authenticated requests the request quotas are much higher.
References
The author asked for permission for fetching and displaying this data.↩︎
Please note that the git2r package does not provide access to the GitHub API. It provides access to git repositories which is something completely different to accessing the web API of a social coding platform.↩︎
For example, the PyGithub package for Python provides a comprehensive set of tools for interacting with the GitHub API which usually requires much less programming effort than with R and jsonlite.↩︎
Try that out yourself via
curl https://api.github.com/user
in the terminal.↩︎
7.5 Social science examples
When the GitHub API is used in a research context, this is mainly done in the fields of computer science, (IT) business economics and social media studies. Lima, Rossi, and Musolesi (2014) analyze social ties, collaboration patterns and the geographical distribution of users on the platform. Chatziasimidis and Stamelos (2015) try to find “success rules” for software projects by analyzing GitHub data and relating it to software download numbers. Importantly, Cosentino, Izquierdo, and Cabot (2016) provide a meta-analysis of 93 research papers and find “concerns regarding the dataset collection process and size, the low level of replicability, poor sampling techniques, lack of longitudinal studies and scarce variety of methodologies.” Hipp and Konrad (2021) used the GitHub API to collect data about open-source software development contributions before and during the COVID-19 pandemic. This was then used to analyze the different impact of the pandemic on the productivity of female and male software developers. They tackled the user sampling issue by employing geographic sampling and narrowing search results so that always the full result set of a user search for a place could be obtained.