Intro
Maintaining an updated list of academic publications on a website can be tedious, especially if you’re using Google Scholar as your primary source. To streamline this process, we can build an automated workflow using GitHub Actions. In this post, I’ll show you how to create a system that fetches publication details from Google Scholar, formats them, and integrates them into a Hugo-based website.
Scraping Scholar Publications using JavaScript
We start by writing a Node.js script to fetch publication details from Google Scholar using node-fetch and cheerio. These libraries help us make HTTP requests and parse the HTML of the response:
node-fetch
makes HTTP requests to the Google Scholar page.cheerio
parses the HTML response to extract publication data such as title, authors, year, and venue.fs
andpath
: Handle file and directory operations for saving output.fileURLToPath
anddirname
: Convert import.meta.url to a usable directory path, ensuring compatibility
import fetch from 'node-fetch';
import { load } from 'cheerio';
import fs from 'fs';
import path from 'path';
import { fileURLToPath } from 'url';
import { dirname } from 'path';
const __filename = fileURLToPath(import.meta.url);
const __dirname = dirname(__filename);
We next implement getSourceLinks
, a function that retrieves external links for a publication. This is useful for sourcing PDFs, arXiv links, or other scholarly resources:
async function getSourceLinks(citationUrl, title) {
try {
console.log(`\nFetching details for: ${title}`);
const response = await fetch('https://scholar.google.com' + citationUrl, {
headers: {
'Accept-Charset': 'UTF-8',
'Accept-Language': 'de-DE,de;q=0.9',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
});
const html = await response.text();
const $ = load(html, { decodeEntities: true });
let sourceLink = '';
$('.gsc_oci_value_ext a').each((i, elem) => {
const href = $(elem).attr('href');
if (href) {
// Prepend Google Scholar base URL if the link is relative
const fullHref = href.startsWith('http') ? href : 'https://scholar.google.com' + href;
console.log(`Found link: ${fullHref}`);
if (fullHref.includes('arxiv.org') || !sourceLink) {
sourceLink = fullHref;
}
}
});
// Also look for alternative sources
if (!sourceLink) {
$('.gsc_oci_value a').each((i, elem) => {
const href = $(elem).attr('href');
if (href && !href.includes('google.com/scholar')) {
const fullHref = href.startsWith('http') ? href : 'https://scholar.google.com' + href;
console.log(`Found alternative link: ${fullHref}`);
sourceLink = fullHref;
return false; // break each loop
}
});
}
console.log(`Final source link for "${title}": ${sourceLink}`);
return { source: sourceLink };
} catch (error) {
console.error('Error fetching source links:', error);
return { source: '' };
}
}
getSourceLinks
navigates to the detailed page for each publication on Google Scholar, enabling it to extract specific information. It prioritizes finding links to external sources, such as PDFs or repositories like arXiv, ensuring users have direct access to the publication. If no primary link is available, we employ a fallback logic to search alternative sections for additional sources, maximizing the likelihood of retrieving a useful link.
The main function, fetchPublications
, iterates through a user’s Google Scholar profile and compiles a list of publications:
async function fetchPublications() {
try {
const scholarId = process.env.SCHOLAR_ID;
const response = await fetch(
`https://scholar.google.com/citations?user=${scholarId}&hl=de`,
{
headers: {
'Accept-Charset': 'UTF-8',
'Accept-Language': 'de-DE,de;q=0.9',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
}
);
const html = await response.text();
const $ = load(html, { decodeEntities: true });
const publications = [];
// Add delay between requests to avoid rate limiting
const delay = ms => new Promise(resolve => setTimeout(resolve, ms));
for (const elem of $('#gsc_a_b .gsc_a_tr').get()) {
const titleElem = $(elem).find('.gsc_a_t a');
const grayDivs = $(elem).find('.gsc_a_t .gs_gray');
const yearElem = $(elem).find('.gsc_a_y');
let year = yearElem.text().trim();
const yearInVenue = $(elem).find('.gs_oph').text().trim();
if (yearInVenue) {
year = yearInVenue.replace(/,\s*/, '');
}
const title = titleElem.text().trim();
const citationUrl = titleElem.attr('href');
await delay(2000);
const links = await getSourceLinks(citationUrl, title);
let authors = $(grayDivs[0]).text().trim();
authors = authors.split(', ').map(author => author.trim()).join(', ');
// Clean up venue: remove trailing comma and [Google Scholar] text
let venue = $(grayDivs[1]).text().trim()
.replace(year, '') // Remove year
.replace(/\[Google Scholar\]/g, '') // Remove [Google Scholar] text
.replace(/,\s*$/, '') // Remove trailing comma and spaces
.trim();
const publication = {
title: title,
scholarLink: 'https://scholar.google.com' + citationUrl,
sourceLink: links.source,
authors: authors,
venue: venue,
year: year,
};
publications.push(publication);
console.log(`Processed: ${title} (Source: ${links.source})`);
}
const projectRoot = path.join(__dirname, '..', '..');
const dataDir = path.join(projectRoot, 'data');
if (!fs.existsSync(dataDir)) {
fs.mkdirSync(dataDir, { recursive: true });
}
const jsonContent = JSON.stringify(publications, null, 2);
fs.writeFileSync(
path.join(dataDir, 'publications.json'),
jsonContent,
'utf8'
);
console.log(`Successfully fetched ${publications.length} publications`);
} catch (error) {
console.error('Error fetching publications:', error);
process.exit(1);
}
}
fetchPublications();
-
fetchPublications
fetches the user’s public profile page (as given by the workflow environment variableSCHOLAR_ID
). It then parses the page to identify each publication entry, extracting details such as the title, authors, venue, and publication year (these components are identified by their HTML/CSS properties). -
For each publication, we navigate to the “details” page to retrieve additional information, such as external source links using
getSourceLinks
-
A delay is introduced between requests to prevent triggering Google’s rate-limiting measures.
-
Each publication is then cleaned and formatted, ensuring metadata consistency by removing redundant or irrelevant text.
-
Finally, the structured data is saved as a JSON file. It shoud look like this:
[ { "title": "A Comprehensive Review of Neural Networks", "authors": "John Doe, Jane Smith", "venue": "Journal of AI Research", "year": "2023", "scholarLink": "https://scholar.google.com/citations?view_op=view_citation&citation_for_view=...", "sourceLink": "https://arxiv.org/abs/1234.5678" }, ... ]
All the above JavaScript chunks should be placed in a single file, fetch-publications.js
.
Automating the Script with GitHub Actions
To run the above JavaScript automatically in our GitHub repository, we’ll use GitHub Actions. A suitable workflow can be defined in a .yaml
file.
The workflow is triggered in two ways: a daily scheduled run at midnight (UTC) or manually through GitHub’s interface. By setting these triggers, the workflow maintains flexibility while ensuring consistent updates without manual intervention.
on:
schedule:
- cron: '0 0 * * *' # run daily at midnight
workflow_dispatch: # and allow manual trigger
The process starts by checking out the repository to the runner, allowing access to all its files. It then sets up a Node.js
environment (with version 18), which is required for the dependencies used in the scraping script fetch-publications.js
.
jobs:
fetch-publications:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-node@v3
with:
node-version: '18'
Next, we use npm
to install the required libraries like node-fetch
and cheerio
.
- name: Create package.json
run: |
echo '{
"name": "scholar-fetch",
"type": "module",
"dependencies": {
"node-fetch": "^3.3.2",
"cheerio": "^1.0.0-rc.12"
}
}' > package.json
- name: Install dependencies
run: npm install
Once the environment is ready, we run our fetch-publications.js
script.
- name: Fetch publications
run: node .github/scripts/fetch-publications.js
env:
SCHOLAR_ID: ${{ secrets.SCHOLAR_ID }}
Finally, our workflow stages and commits the updated JSON file back to the repository, if changes are detected. This avoids unnecessary commits and ensures a clean version history.
- name: Commit and push if changed
run: |
git config --global user.name 'GitHub Action'
git config --global user.email 'action@github.com'
git add data/publications.json
git diff --quiet && git diff --staged --quiet || (git commit -m "Update publications" && git push)
Here’s the comlete yaml file (hopefully with correct indents!):
#| code-fold: true
name: Fetch Google Scholar Publications
on:
schedule:
- cron: '0 0 * * *'
workflow_dispatch:
jobs:
fetch-publications:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-node@v3
with:
node-version: '18'
- name: Create package.json
run: |
echo '{
"name": "scholar-fetch",
"type": "module",
"dependencies": {
"node-fetch": "^3.3.2",
"cheerio": "^1.0.0-rc.12"
}
}' > package.json
- name: Install dependencies
run: npm install
- name: Fetch publications
run: node .github/scripts/fetch-publications.js
env:
SCHOLAR_ID: ${{ secrets.SCHOLAR_ID }}
- name: Commit and push if changed
run: |
git config --global user.name 'GitHub Action'
git config --global user.email 'action@github.com'
git add data/publications.json
git diff --quiet && git diff --staged --quiet || (git commit -m "Update publications" && git push)
The file structure that we need for the workflow to work once we push these files to GitHub is like this:
project-directory/
├── .github/
│ ├── scripts/
│ │ └── fetch-publications.js
│ └── workflows/
│ └── fetch-publications.yml
Another requirement for the above the work is that the SCHOLAR_ID
environment variable needs to be stored as a GitHub Secret in the repository.
Integration in Hugo
Now that we have the JSON
file updated automatically, let’s integrate it into our Hugo project. We’ll create a Hugo partial.This partial is like a template that reads info form the JSON file and displays each publication dynamically using HTML formatting, making it display as a list of references.
{{/* layouts/partials/scholar-publications.html */}}
{{ $publications := slice }}
{{ with os.ReadFile "data/publications.json" }}
{{ $publications = . | unmarshal }}
{{ end }}
<div class="publications-list">
{{ if and $publications (gt (len $publications) 0) }}
{{ range $publications }}
<div class="publication">
<p class="citation">
{{ .authors }}
{{ with .year }}({{ . }}).{{ end }}
<em class="publication-title">{{ .title }}</em>.
{{ with .venue }}{{ . }}{{ end }}
<span class="publication-links">
{{ if .sourceLink }}
<br>
<a href="{{ .sourceLink }}" target="_blank" rel="noopener" class="source-link">[Source/PDF]</a>
{{ end }}
<a href="{{ .scholarLink }}" target="_blank" rel="noopener" class="scholar-link">[Google Scholar]</a>
</span>
</p>
</div>
{{ end }}
{{ else }}
<p>No publications available yet. Please check back later.</p>
{{ end }}
</div>
Finally, include the partial in your desired Hugo layout using:
{{ partial "scholar-publications.html" . }}
It’s also possible to include the references list as needed, for example in a post (see below!), if you put the above in a shortcode HTML file, for example partials/shortcodes/publications.html
.
With the Hugo intergration, the file structure should now look like this:1
hugo-project/
├── data/
│ └── publications.json
├── layouts/
│ └── shortcudes/
| └── publications.html
│ └── partials/
│ └── scholar-publications.html
├── .github/
│ ├── scripts/
│ │ └── fetch-publications.js
│ └── workflows/
│ └── fetch-publications.yml
And here’s what is looks like for my Google scholar ID:
C Hanck, M Arnold, A Gerber, M Schmelzer
(2019).
Introduction to Econometrics with R.
Essen: University of Duisburg-Essen.
[Source/PDF]
[Google Scholar]
C Hanck, MC Arnold
(2022).
Hierarchical Bayes modelling of penalty conversion rates of Bundesliga players.
AStA Advances in Statistical Analysis, 1-28
[Source/PDF]
[Google Scholar]
MC Arnold, T Reinschlüssel
(2024).
Adaptive Unit Root Inference in Autoregressions using the Lasso Solution Path.
arXiv preprint arXiv:2404.06205
[Source/PDF]
[Google Scholar]
MC Arnold, C Hanck
(2019).
On combining evidence from heteroskedasticity robust panel unit root tests in pooled regressions.
Journal of Risk and Financial Management 12 (3), 117
[Source/PDF]
[Google Scholar]
MC Arnold, T Reinschlüssel
(2024).
Bootstrap Adaptive Lasso Solution Path Unit Root Tests.
arXiv preprint arXiv:2409.07859
[Source/PDF]
[Google Scholar]
T Reinschlüssel, MC Arnold
(2024).
Information-Enriched Selection of Stationary and Non-Stationary Autoregressions using the Adaptive Lasso.
arXiv preprint arXiv:2402.16580
[Source/PDF]
[Google Scholar]
MC Arnold
(2024).
Information-Enriched Selection of Stationary and Non-Stationary Autoregressions using the Adaptive Lasso.
arXiv. org
[Source/PDF]
[Google Scholar]
And that’s it for now 😊.
-
Note that
publications.json
only exists (locally) after the workflow has succesfully run at GitHub and you pulled from your remote repository. ↩︎