Skip to content

Add local parser support for zero-token portal scans #594

@lejrn

Description

@lejrn

Problem

Some company career pages expose complete job data in server-rendered HTML or stable public payloads, but they do not have a scanner-supported ATS API. Today those companies fall back to agent/browser workflows, which are more expensive than a local structured parser.

Proposed Change

Add a local_parser scan source that lets scan.mjs execute an explicitly configured local parser command for a tracked company. The parser prints normalized jobs JSON to stdout, and scan.mjs applies the existing title filtering, deduplication, dry-run behavior, pipeline append, and scan history logic.

The local parser script performs the HTTP request to the career page/API and parses the response locally. That keeps the agent out of the scrape-deciphering loop, saving LLM tokens that would otherwise be spent reading page snapshots or deciding what to extract.

Example

Use Cohere as the example parser:

  • templates/portals.example.yml configures Cohere with scan_method: local_parser.
  • scripts/parsers/cohere_jobs.py reads Cohere jobs from the Ashby public board API and emits jobs-json-v1 compatible stdout.
  • Generated JSON artifacts are kept under data/parser-output/{company}/ and ignored except for .gitkeep placeholders.

Why This Helps

  • Keeps SSR/static career page scanning zero-token.
  • Avoids Playwright/browser scraping when a deterministic parser exists.
  • Keeps parsers explicit in portals.yml rather than auto-discovering executable files.
  • Preserves existing scanner filtering and dedup behavior.

Acceptance Criteria

  • scan.mjs can run a configured local parser without shell interpolation.
  • Parser stdout can be a JSON array, { jobs: [] }, or { results: [] }.
  • Relative URLs resolve against careers_url.
  • Parser failures are reported without stopping the whole scan.
  • Docs explain the parser contract and output artifact location.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions