Skip to content

Fix: JSON schema union types (type arrays) fail with 'type must be a string' error#1675

Merged
rlouf merged 8 commits intodottxt-ai:mainfrom
brightlikethelight:fix-dynamic-schema-1383
Jul 19, 2025
Merged

Fix: JSON schema union types (type arrays) fail with 'type must be a string' error#1675
rlouf merged 8 commits intodottxt-ai:mainfrom
brightlikethelight:fix-dynamic-schema-1383

Conversation

@brightlikethelight
Copy link
Copy Markdown
Contributor

@brightlikethelight brightlikethelight commented Jul 3, 2025

Summary

Background

JSON schemas with union types specified as arrays (e.g., {"type": ["string", "null"]}) currently fail with a ValueError: "'type' must be a string". This prevents the use of optional fields and other union type patterns in dynamic JSON schema generation.

The root cause is that the current version of outlines-core (0.1.26) does not support type arrays. This was fixed in outlines-core PR dottxt-ai/outlines-core#138, but that fix is only available in outlines-core v0.2+. Upgrading to v0.2+ requires significant changes due to breaking API changes (see #1380).

Solution

This PR implements a Python-side preprocessing step that:

  1. Recursively traverses JSON schemas
  2. Converts type arrays to the equivalent anyOf format
  3. Preserves type-specific constraints (minLength, pattern, etc.)

This is a temporary workaround until outlines can be migrated to outlines-core v0.2+.

Test plan

  • Added comprehensive unit tests in test_json_schema_union_types.py
  • Tests cover simple optionals, nested optionals, arrays with optional items, and constraint preservation
  • All tests pass with the current implementation

Example

Before preprocessing:

{
  "type": "object",
  "properties": {
    "age": {"type": ["integer", "null"]}
  }
}

After preprocessing:

{
  "type": "object", 
  "properties": {
    "age": {"anyOf": [{"type": "integer"}, {"type": "null"}]}
  }
}

Related Issues

…reation

This commit adds a Python-side workaround for issue dottxt-ai#1383 where dynamic
JSON schema creation fails when using union types (type arrays like
["string", "null"]) with nested optionals.

The fix introduces a preprocessing step that converts JSON schema type
arrays into the anyOf format that outlines-core 0.1.26 can handle.

Key changes:
- Add json_schema_utils.py with preprocessing function
- Update JsonSchema class to preprocess schemas before passing to Rust
- Add comprehensive tests for various union type scenarios

Fixes dottxt-ai#1383
@brightlikethelight brightlikethelight changed the title Fix: Handle JSON schema union types in dynamic schema creation Fix: JSON schema union types (type arrays) fail with 'type must be a string' error Jul 3, 2025
Copy link
Copy Markdown
Contributor

@RobinPicard RobinPicard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for both the detailed issue and this excellent contribution! I have just a little comment, but looks good to me otherwise. Upgrading outlines-core to v0.2+ is something we want to do in the coming weeks fyi.

Comment thread outlines/types/json_schema_utils.py Outdated
schema_dict = schema

preprocessed = _preprocess_schema_dict(schema_dict)
return json.dumps(preprocessed)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to include the ensure_ascii argument here. Otherwise the value provided by the user will be ignored if it was False.

…reation

- Add preprocessing to convert type arrays like ["string", "null"] to anyOf format
- Implement thread-safe LRU cache with compression and performance optimizations
- Add comprehensive test suite with 24 test cases covering edge cases
- Include DoS protection, graceful fallback, and performance metrics
- Support all JSON schema keywords and nested structures
- Add benchmark scripts and reproduction examples

Fixes dottxt-ai#1383
@RobinPicard
Copy link
Copy Markdown
Contributor

I don't think we need all that for such a small change (especially considering it'll be removed in a few weeks). I think what you had in your 1st commit was sufficient. There's still the issue of the ensure_ascii argument.

brightlikethelight and others added 4 commits July 5, 2025 10:39
- Simplified json_schema_utils.py from 700+ lines to ~120 lines
- Added ensure_ascii parameter to preprocessing function
- Updated JsonSchema class to pass ensure_ascii to preprocessing
- Simplified tests to match minimal implementation
- Removed benchmark scripts and complex features
- Fixed edge case where type-specific properties were not properly isolated

This addresses reviewer feedback requesting a simpler implementation
that preserves the ensure_ascii parameter.
… compatibility

- Replaced match/case syntax with if/elif statements in to_regex function
- Fixed JSON schema preprocessing to preserve original format when unchanged
- All tests now pass, style checks pass
- Maintains backward compatibility with Python 3.9

This fixes CI failures caused by Python 3.10+ syntax in Python 3.9 environment.
- Added try-catch in preprocess_schema_for_union_types to handle invalid JSON
- Return original string unchanged for invalid JSON to preserve error handling
- Fixed pre-commit formatting issues (trailing whitespace, end-of-file)
- All tests pass including union type tests and DSL tests

This fixes the failing CI test and pre-commit hook failures.
@RobinPicard
Copy link
Copy Markdown
Contributor

Merging is blocked by failing coverage check

@RobinPicard RobinPicard force-pushed the fix-dynamic-schema-1383 branch from 989b447 to 848f8ca Compare July 19, 2025 17:59
@rlouf rlouf merged commit e12442f into dottxt-ai:main Jul 19, 2025
6 checks passed
@RobinPicard
Copy link
Copy Markdown
Contributor

I fixed the test coverage issue and pushed to your branch. Thanks again for the contribution @brightlikethelight!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

JSON schema union types (type arrays) fail with 'type must be a string' error

3 participants