An Apify actor that crawls Shopify stores via sitemap.xml and fetches product data using the Storefront GraphQL API. Optimized for speed and cost with per-host batching, incremental processing, and buffered dataset writes.
- Reads
sitemap.xml, filters product URLs (/products/<handle>) - Batches GraphQL requests per store using aliases (fewer round-trips)
- Optional incremental runs (skips already processed product IDs)
- Optional lastmod cutoff to skip old products
- Outputs a single record per product; all variants are available under
additional.variants - Extensible via
extendScraperFunctionandextendOutputFunction
startUrls: array ofsitemap.xmlURLsstorefrontApiVersion: Storefront API version (e.g.,2024-07)storefrontAccessToken: your Storefront access tokenmaxRequestsPerCrawl,maxConcurrency,maxRequestRetries,proxyConfig,debugLog
updatedSince: ISO date; skips products with<lastmod>older than thisbatchSize: product handles per GraphQL request (default 10)flushIntervalMs: max delay before sending a partial batch (default 300)perHostConcurrency: parallel GraphQL requests per store (default 2)bufferWrites: buffer dataset writes (default true)bufferSize: items per dataset push (default 100)
- Install dependencies:
npm install- Create local input at
apify_storage/key_value_stores/default/INPUT.json, for example:
{
"startUrls": [{ "url": "https://example.com/sitemap.xml" }],
"storefrontApiVersion": "2024-07",
"storefrontAccessToken": "<YOUR_STOREFRONT_TOKEN>",
"maxRequestsPerCrawl": 50,
"maxConcurrency": 10,
"debugLog": true
}- Start the actor:
npm startOr development mode with auto-restart:
npm run devWorkflows in .github/workflows/:
ci.yml: install, lint, and syntax check on push/PR tomain.codeql.yml: CodeQL security analysis on push/PR and weekly.
make init # creates .env and INPUT.json from templates
make run # docker compose up --build actorOutputs will be in apify_storage/datasets/default.
extendScraperFunction: lifecycle hooks (SETUP,FILTER_SITEMAP_URL,PRENAVIGATION,POSTNAVIGATION,RUN,FINISHED)extendOutputFunction: transform/filter final records before they are saved to the Dataset
This project is licensed under the Apache License 2.0. See LICENSE and NOTICE.