Martin C. Arnold's Homepage

Intro

Maintaining an updated list of academic publications on a website can be tedious, especially if you’re using Google Scholar as your primary source. To streamline this process, we can build an automated workflow using GitHub Actions. In this post, I’ll show you how to create a system that fetches publication details from Google Scholar, formats them, and integrates them into a Hugo-based website.

Scraping Scholar Publications using JavaScript

We start by writing a Node.js script to fetch publication details from Google Scholar using node-fetch and cheerio. These libraries help us make HTTP requests and parse the HTML of the response:

node-fetch makes HTTP requests to the Google Scholar page.
cheerio parses the HTML response to extract publication data such as title, authors, year, and venue.
fs and path: Handle file and directory operations for saving output.
fileURLToPath and dirname: Convert import.meta.url to a usable directory path, ensuring compatibility

import fetch from 'node-fetch';
import { load } from 'cheerio';
import fs from 'fs';
import path from 'path';
import { fileURLToPath } from 'url';
import { dirname } from 'path';

const __filename = fileURLToPath(import.meta.url);
const __dirname = dirname(__filename);

We next implement getSourceLinks, a function that retrieves external links for a publication. This is useful for sourcing PDFs, arXiv links, or other scholarly resources:

async function getSourceLinks(citationUrl, title) {
  try {
    console.log(`\nFetching details for: ${title}`);
    const response = await fetch('https://scholar.google.com' + citationUrl, {
      headers: {
        'Accept-Charset': 'UTF-8',
        'Accept-Language': 'de-DE,de;q=0.9',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
      }
    });
    const html = await response.text();
    const $ = load(html, { decodeEntities: true });
    
    let sourceLink = '';

    $('.gsc_oci_value_ext a').each((i, elem) => {
      const href = $(elem).attr('href');
      if (href) {
        // Prepend Google Scholar base URL if the link is relative
        const fullHref = href.startsWith('http') ? href : 'https://scholar.google.com' + href;
        console.log(`Found link: ${fullHref}`);
        if (fullHref.includes('arxiv.org') || !sourceLink) {
          sourceLink = fullHref;
        }
      }
    });

    // Also look for alternative sources
    if (!sourceLink) {
      $('.gsc_oci_value a').each((i, elem) => {
        const href = $(elem).attr('href');
        if (href && !href.includes('google.com/scholar')) {
          const fullHref = href.startsWith('http') ? href : 'https://scholar.google.com' + href;
          console.log(`Found alternative link: ${fullHref}`);
          sourceLink = fullHref;
          return false; // break each loop
        }
      });
    }

    console.log(`Final source link for "${title}": ${sourceLink}`);
    return { source: sourceLink };
  } catch (error) {
    console.error('Error fetching source links:', error);
    return { source: '' };
  }
}

getSourceLinks navigates to the detailed page for each publication on Google Scholar, enabling it to extract specific information. It prioritizes finding links to external sources, such as PDFs or repositories like arXiv, ensuring users have direct access to the publication. If no primary link is available, we employ a fallback logic to search alternative sections for additional sources, maximizing the likelihood of retrieving a useful link.

The main function, fetchPublications, iterates through a user’s Google Scholar profile and compiles a list of publications:

async function fetchPublications() {
  try {
    const scholarId = process.env.SCHOLAR_ID;
    const response = await fetch(
      `https://scholar.google.com/citations?user=${scholarId}&hl=de`,
      {
        headers: {
          'Accept-Charset': 'UTF-8',
          'Accept-Language': 'de-DE,de;q=0.9',
          'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
        }
      }
    );
    
    const html = await response.text();
    const $ = load(html, { decodeEntities: true });
    const publications = [];

    // Add delay between requests to avoid rate limiting
    const delay = ms => new Promise(resolve => setTimeout(resolve, ms));

    for (const elem of $('#gsc_a_b .gsc_a_tr').get()) {
      const titleElem = $(elem).find('.gsc_a_t a');
      const grayDivs = $(elem).find('.gsc_a_t .gs_gray');
      const yearElem = $(elem).find('.gsc_a_y');

      let year = yearElem.text().trim();
      const yearInVenue = $(elem).find('.gs_oph').text().trim();
      if (yearInVenue) {
        year = yearInVenue.replace(/,\s*/, '');
      }

      const title = titleElem.text().trim();
      const citationUrl = titleElem.attr('href');
      await delay(2000);
      const links = await getSourceLinks(citationUrl, title);

      let authors = $(grayDivs[0]).text().trim();
      authors = authors.split(', ').map(author => author.trim()).join(', ');

      // Clean up venue: remove trailing comma and [Google Scholar] text
      let venue = $(grayDivs[1]).text().trim()
        .replace(year, '')  // Remove year
        .replace(/\[Google Scholar\]/g, '')  // Remove [Google Scholar] text
        .replace(/,\s*$/, '')  // Remove trailing comma and spaces
        .trim();

      const publication = {
        title: title,
        scholarLink: 'https://scholar.google.com' + citationUrl,
        sourceLink: links.source,
        authors: authors,
        venue: venue,
        year: year,
      };

      publications.push(publication);
      console.log(`Processed: ${title} (Source: ${links.source})`);
    }

    const projectRoot = path.join(__dirname, '..', '..');
    const dataDir = path.join(projectRoot, 'data');
    if (!fs.existsSync(dataDir)) {
      fs.mkdirSync(dataDir, { recursive: true });
    }

    const jsonContent = JSON.stringify(publications, null, 2);
    fs.writeFileSync(
      path.join(dataDir, 'publications.json'),
      jsonContent,
      'utf8'
    );
    
    console.log(`Successfully fetched ${publications.length} publications`);
  } catch (error) {
    console.error('Error fetching publications:', error);
    process.exit(1);
  }
}

fetchPublications();

fetchPublications fetches the user’s public profile page (as given by the workflow environment variable SCHOLAR_ID). It then parses the page to identify each publication entry, extracting details such as the title, authors, venue, and publication year (these components are identified by their HTML/CSS properties).
For each publication, we navigate to the “details” page to retrieve additional information, such as external source links using getSourceLinks
A delay is introduced between requests to prevent triggering Google’s rate-limiting measures.
Each publication is then cleaned and formatted, ensuring metadata consistency by removing redundant or irrelevant text.

Finally, the structured data is saved as a JSON file. It shoud look like this:

[
  {
    "title": "A Comprehensive Review of Neural Networks",
    "authors": "John Doe, Jane Smith",
    "venue": "Journal of AI Research",
    "year": "2023",
    "scholarLink": "https://scholar.google.com/citations?view_op=view_citation&citation_for_view=...",
    "sourceLink": "https://arxiv.org/abs/1234.5678"
  },
  ...
]

All the above JavaScript chunks should be placed in a single file, fetch-publications.js.

Automating the Script with GitHub Actions

To run the above JavaScript automatically in our GitHub repository, we’ll use GitHub Actions. A suitable workflow can be defined in a .yaml file.

The workflow is triggered in two ways: a daily scheduled run at midnight (UTC) or manually through GitHub’s interface. By setting these triggers, the workflow maintains flexibility while ensuring consistent updates without manual intervention.

on:
  schedule:
    - cron: '0 0 * * *'  # run daily at midnight
  workflow_dispatch:     # and allow manual trigger

The process starts by checking out the repository to the runner, allowing access to all its files. It then sets up a Node.js environment (with version 18), which is required for the dependencies used in the scraping script fetch-publications.js.

jobs:
  fetch-publications:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - uses: actions/setup-node@v3
        with:
          node-version: '18'

Next, we use npm to install the required libraries like node-fetch and cheerio.

      - name: Create package.json
        run: |
          echo '{
            "name": "scholar-fetch",
            "type": "module",
            "dependencies": {
              "node-fetch": "^3.3.2",
              "cheerio": "^1.0.0-rc.12"
            }
          }' > package.json          

      - name: Install dependencies
        run: npm install

Once the environment is ready, we run our fetch-publications.js script.

      - name: Fetch publications
        run: node .github/scripts/fetch-publications.js
        env:
          SCHOLAR_ID: ${{ secrets.SCHOLAR_ID }}

Finally, our workflow stages and commits the updated JSON file back to the repository, if changes are detected. This avoids unnecessary commits and ensures a clean version history.

      - name: Commit and push if changed
        run: |
          git config --global user.name 'GitHub Action'
          git config --global user.email 'action@github.com'
          git add data/publications.json
          git diff --quiet && git diff --staged --quiet || (git commit -m "Update publications" && git push)

Here’s the comlete yaml file (hopefully with correct indents!):

#| code-fold: true
name: Fetch Google Scholar Publications

on:
  schedule:
    - cron: '0 0 * * *' 
  workflow_dispatch:

jobs:
  fetch-publications:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - uses: actions/setup-node@v3
        with:
          node-version: '18'
          
      - name: Create package.json
        run: |
          echo '{
            "name": "scholar-fetch",
            "type": "module",
            "dependencies": {
              "node-fetch": "^3.3.2",
              "cheerio": "^1.0.0-rc.12"
            }
          }' > package.json
                    
      - name: Install dependencies
        run: npm install
          
      - name: Fetch publications
        run: node .github/scripts/fetch-publications.js
        env:
          SCHOLAR_ID: ${{ secrets.SCHOLAR_ID }}
          
      - name: Commit and push if changed
        run: |
          git config --global user.name 'GitHub Action'
          git config --global user.email 'action@github.com'
          git add data/publications.json
          git diff --quiet && git diff --staged --quiet || (git commit -m "Update publications" && git push)

The file structure that we need for the workflow to work once we push these files to GitHub is like this:

project-directory/
├── .github/
│   ├── scripts/
│   │   └── fetch-publications.js
│   └── workflows/
│       └── fetch-publications.yml

Another requirement for the above the work is that the SCHOLAR_ID environment variable needs to be stored as a GitHub Secret in the repository.

Integration in Hugo

Now that we have the JSON file updated automatically, let’s integrate it into our Hugo project. We’ll create a Hugo partial.This partial is like a template that reads info form the JSON file and displays each publication dynamically using HTML formatting, making it display as a list of references.

{{/* layouts/partials/scholar-publications.html */}}
{{ $publications := slice }}

{{ with os.ReadFile "data/publications.json" }}
  {{ $publications = . | unmarshal }}
{{ end }}

<div class="publications-list">
  {{ if and $publications (gt (len $publications) 0) }}
    {{ range $publications }}
      <div class="publication">
        <p class="citation">
          {{ .authors }}
          {{ with .year }}({{ . }}).{{ end }}
          <em class="publication-title">{{ .title }}</em>.
          {{ with .venue }}{{ . }}{{ end }}
          <span class="publication-links">
            {{ if .sourceLink }}
              <br>
              <a href="{{ .sourceLink }}" target="_blank" rel="noopener" class="source-link">[Source/PDF]</a>
            {{ end }}
            <a href="{{ .scholarLink }}" target="_blank" rel="noopener" class="scholar-link">[Google Scholar]</a>
          </span>
        </p>
      </div>
    {{ end }}
  {{ else }}
    <p>No publications available yet. Please check back later.</p>
  {{ end }}
</div>

Finally, include the partial in your desired Hugo layout using:

{{ partial "scholar-publications.html" . }}

It’s also possible to include the references list as needed, for example in a post (see below!), if you put the above in a shortcode HTML file, for example partials/shortcodes/publications.html.

With the Hugo intergration, the file structure should now look like this:¹

hugo-project/
├── data/
│   └── publications.json
├── layouts/
│   └── shortcudes/
|       └── publications.html
│   └── partials/
│       └── scholar-publications.html
├── .github/
│   ├── scripts/
│   │   └── fetch-publications.js
│   └── workflows/
│       └── fetch-publications.yml

And here’s what is looks like for my Google scholar ID:

C Hanck, M Arnold, A Gerber, M Schmelzer (2019). Introduction to Econometrics with R. Essen: University of Duisburg-Essen.
[Source/PDF] [Google Scholar]

C Hanck, MC Arnold (2022). Hierarchical Bayes modelling of penalty conversion rates of Bundesliga players. AStA Advances in Statistical Analysis, 1-28
[Source/PDF] [Google Scholar]

MC Arnold, T Reinschlüssel (2024). Adaptive Unit Root Inference in Autoregressions using the Lasso Solution Path. arXiv preprint arXiv:2404.06205
[Source/PDF] [Google Scholar]

MC Arnold, C Hanck (2019). On combining evidence from heteroskedasticity robust panel unit root tests in pooled regressions. Journal of Risk and Financial Management 12 (3), 117
[Source/PDF] [Google Scholar]

MC Arnold, T Reinschlüssel (2024). Bootstrap Adaptive Lasso Solution Path Unit Root Tests. arXiv preprint arXiv:2409.07859
[Source/PDF] [Google Scholar]

T Reinschlüssel, MC Arnold (2024). Information-Enriched Selection of Stationary and Non-Stationary Autoregressions using the Adaptive Lasso. arXiv preprint arXiv:2402.16580
[Source/PDF] [Google Scholar]

MC Arnold (2024). Information-Enriched Selection of Stationary and Non-Stationary Autoregressions using the Adaptive Lasso. arXiv. org
[Source/PDF] [Google Scholar]

And that’s it for now 😊.

Note that publications.json only exists (locally) after the workflow has succesfully run at GitHub and you pulled from your remote repository. ↩︎

Intro#

Scraping Scholar Publications using JavaScript#

Automating the Script with GitHub Actions#

Integration in Hugo#

Intro

Scraping Scholar Publications using JavaScript

Automating the Script with GitHub Actions

Integration in Hugo