Scraping pages using JavaScript#
The shot-scraper javascript
command can be used to execute JavaScript directly against a page and return the result as JSON.
This command doesn’t produce a screenshot, but has interesting applications for scraping.
To retrieve a string title of a document:
shot-scraper javascript https://datasette.io/ "document.title"
This returns a JSON string:
"Datasette: An open source multi-tool for exploring and publishing data"
To return a raw string instead, use the -r
or --raw
options:
shot-scraper javascript https://datasette.io/ "document.title" -r
Output:
Datasette: An open source multi-tool for exploring and publishing data
To return a JSON object, wrap an object literal in parenthesis:
shot-scraper javascript https://datasette.io/ "({
title: document.title,
tagline: document.querySelector('.tagline').innerText
})"
This returns:
{
"title": "Datasette: An open source multi-tool for exploring and publishing data",
"tagline": "An open source multi-tool for exploring and publishing data"
}
Running more than one statement#
You can use () => { ... }
function syntax to run multiple statements, returning a result at the end of your function.
This example raises an error if no paragraphs are found.
shot-scraper javascript https://www.example.com/ "
() => {
var paragraphs = document.querySelectorAll('p');
if (paragraphs.length == 0) {
throw 'No paragraphs found';
}
return Array.from(paragraphs, el => el.innerText);
}"
Using async/await#
You can pass an async
function if you want to use await
, including to import modules from external URLs. This example loads the Readability.js library from Skypack and uses it to extract the core content of a page:
shot-scraper javascript \
https://simonwillison.net/2022/Mar/14/scraping-web-pages-shot-scraper/ "
async () => {
const readability = await import('https://cdn.skypack.dev/@mozilla/readability');
return (new readability.Readability(document)).parse();
}"
To use functions such as setInterval()
, for example if you need to delay the shot for a second to allow an animation to finish running, return a promise:
shot-scraper javascript datasette.io "
new Promise(done => setInterval(
() => {
done({
title: document.title,
tagline: document.querySelector('.tagline').innerText
});
}, 1000
));"
Bypassing Content Security Policy headers#
Some websites use Content Security Policy (CSP) headers to prevent additional JavaScript from executing on the page, as a security measure.
When using shot-scraper
this can prevent some JavaScript features from working. You might see error messages that look like this:
shot-scraper javascript github.com "
async () => {
await import('https://cdn.jsdelivr.net/npm/left-pad/+esm');
return 'content-security-policy ignored' }
"
Output:
Error: TypeError: Failed to fetch dynamically imported module:
https://cdn.jsdelivr.net/npm/left-pad/+esm
You can use the --bypass-csp
option to have shot-scraper
run the browser in a mode that ignores these headers:
shot-scraper javascript github.com "
async () => {
await import('https://cdn.jsdelivr.net/npm/left-pad/+esm');
return 'content-security-policy ignored' }
" --bypass-csp
Output:
"content-security-policy ignored"
Running JavaScript from a file#
You can also save JavaScript to a file and execute it like this:
shot-scraper javascript datasette.io -i script.js
Or read it from standard input like this:
echo "document.title" | shot-scraper javascript datasette.io
Using this for automated tests#
If a JavaScript error occurs, a stack trace will be written to standard error and the tool will terminate with an exit code of 1.
This can be used to run JavaScript tests in continuous integration environments, by taking advantage of the throw "error message"
JavaScript statement.
This example uses GitHub Actions:
- name: Test page title
run: |-
shot-scraper javascript datasette.io "
if (document.title != 'Datasette') {
throw 'Wrong title detected';
}"
Example: Extracting page content with Readability.js#
Readability.js is “a standalone version of the readability library used for Firefox Reader View.” It lets you parse the content on a web page and extract just the title, content, byline and some other key metadata.
The following recipe imports the library from the Skypack CDN, runs it against the current page and returns the results to the console as JSON:
shot-scraper javascript https://simonwillison.net/2022/Mar/24/datasette-061/ "
async () => {
const readability = await import('https://cdn.skypack.dev/@mozilla/readability');
return (new readability.Readability(document)).parse();
}"
The output looks like this:
{
"title": "Datasette 0.61: The annotated release notes",
"byline": null,
"dir": null,
"lang": "en-gb",
"content": "<div id=\"readability-page-1\" class=\"page\"><div id=\"primary\">\n\n\n\n\n<p>I released ... <this is a very long string>",
"length": 8625,
"excerpt": "I released Datasette 0.61 this morning\u2014closely followed by 0.61.1 to fix a minor bug. Here are the annotated release notes. In preparation for Datasette 1.0, this release includes two potentially \u2026",
"siteName": null
}
See Extracting web page content using Readability.js and shot-scraper for more.
shot-scraper javascript –help#
Full --help
for this command:
Usage: shot-scraper javascript [OPTIONS] URL [JAVASCRIPT]
Execute JavaScript against the page and return the result as JSON
Usage:
shot-scraper javascript https://datasette.io/ "document.title"
To return a JSON object, use this:
"({title: document.title, location: document.location})"
To use setInterval() or similar, pass a promise:
"new Promise(done => setInterval(
() => {
done({
title: document.title,
h2: document.querySelector('h2').innerHTML
});
}, 1000
));"
If a JavaScript error occurs an exit code of 1 will be returned.
Options:
-i, --input FILENAME Read input JavaScript from this file
-a, --auth FILENAME Path to JSON authentication context file
-o, --output FILENAME Save output JSON to this file
-r, --raw Output JSON strings as raw text
-b, --browser [chromium|firefox|webkit|chrome|chrome-beta]
Which browser to use
--browser-arg TEXT Additional arguments to pass to the browser
--user-agent TEXT User-Agent header to use
--reduced-motion Emulate 'prefers-reduced-motion' media feature
--log-console Write console.log() to stderr
--fail Fail with an error code if a page returns an
HTTP error
--skip Skip pages that return HTTP errors
--bypass-csp Bypass Content-Security-Policy
--auth-password TEXT Password for HTTP Basic authentication
--auth-username TEXT Username for HTTP Basic authentication
--help Show this message and exit.