Skip to content

feat: expressiveness mode, stateless Instructions, structured LLM output#5635

Open
theomonnom wants to merge 24 commits intomainfrom
theo/expressiveness-mode
Open

feat: expressiveness mode, stateless Instructions, structured LLM output#5635
theomonnom wants to merge 24 commits intomainfrom
theo/expressiveness-mode

Conversation

@theomonnom
Copy link
Copy Markdown
Member

@theomonnom theomonnom commented May 4, 2026

Summary

  • Expressiveness mode — auto-injects TTS markup instructions + speaker context into LLM, strips markup from transcripts
  • Stateless Instructions — reworked from str subclass to plain class with common/audio/text
  • STT speaker contextRecognizeStream.context + SpeakerContext protocol
  • AudioRecognition — now public, all fields/methods private except stt_context
  • Structured LLM outputllm_output_format with llm.Response annotation, streaming JSON partial parsing
  • TTS markupTTS.Markup inner class, shared _provider_format.py for Cartesia/ElevenLabs
  • XML-aware tokenizerBufferedTokenStream holds back tokens with unclosed XML tags (53 regression tests)
  • WorkflowInstructions — replaces InstructionParts

Expressiveness mode

from livekit.agents import Agent, AgentSession, inference

agent = Agent(
    instructions="You are an empathetic therapist.",
    expressiveness=True,
    stt=inference.STT("deepgram/nova-3"),
    llm=inference.LLM("openai/gpt-4o"),
    tts=inference.TTS("cartesia/sonic-3"),
)
session = AgentSession()
await session.start(agent, room=room)

The framework injects system messages telling the LLM about available TTS tags:

The TTS supports the following formatting capabilities...
<emotion value="EMOTION"/> where EMOTION is one of: neutral, angry, excited...
<speed ratio="VALUE"/>, <volume ratio="VALUE"/>, <break time="1s"/>...

The LLM then uses markup naturally. Markup is stripped from transcripts and chat history:

Path Text
LLM output <emotion value="sad"/> I understand how you feel.
TTS receives <emotion value="sad"/> I understand how you feel.
Transcript I understand how you feel.
Chat history I understand how you feel.

Custom templates and per-plugin overrides:

from livekit.agents.llm.chat_context import AgentInstructions
from livekit.plugins import cartesia

# Custom framing for the injected instructions
agent = Agent(
    instructions=AgentInstructions(
        "You are helpful.",
        tts_instructions_template="Use speech markup sparingly:\n\n{tts_instructions}",
        audio_recognition_instructions_template="Speaker: {speaker_context}",
    ),
    expressiveness=True,
    tts=inference.TTS("cartesia/sonic-3"),
    llm=inference.LLM("openai/gpt-4o"),
)

# Override specific parts of a plugin's default instructions
tts = cartesia.TTS(
    instruction_parts=cartesia.InstructionParts(
        constraints="Only use emotion tags. Never use speed or volume."
    )
)

ElevenLabs example with normalization:

from livekit.plugins import elevenlabs

agent = Agent(
    instructions="You are a friendly customer support agent.",
    expressiveness=True,
    llm=inference.LLM("openai/gpt-4o"),
    tts=elevenlabs.TTS(model="eleven_flash_v2"),
)

# LLM receives ElevenLabs-specific instructions:
#   "Normalize numbers and symbols for spoken clarity..."
#   "$42.50 → forty-two dollars and fifty cents"
#   "SSML: <break time="1.5s"/>, <phoneme alphabet="cmu-arpabet" ph="...">word</phoneme>"
#
# LLM outputs: Hold on, let me check. <break time="1.5s"/> Your total is forty-two dollars.
# Transcript:  Hold on, let me check. Your total is forty-two dollars.

Stateless Instructions

Reworked from str subclass to plain class. No Pydantic, no runtime state.

from livekit.agents.llm.chat_context import Instructions

# Simple — same for all modalities
Instructions("You are helpful.")

# Modality-aware — common text + per-modality additions
instr = Instructions(
    "You are a helpful assistant.",
    audio="Keep responses short for voice.",
    text="Use markdown formatting.",
)
instr.as_modality("audio")  # "You are a helpful assistant.\n\nKeep responses short for voice."
instr.as_modality("text")   # "You are a helpful assistant.\n\nUse markdown formatting."
str(instr)                   # "You are a helpful assistant."

Hierarchy: InstructionsAgentInstructionsWorkflowInstructions

InstructionParts removed, replaced by WorkflowInstructions(AgentInstructions).

STT speaker context + AudioRecognition

STT plugins set metadata on their stream. Accessible anywhere on the Agent:

from pydantic import BaseModel
from livekit.agents.stt import SpeakerContext

# STT plugin defines its own context model
class MySpeakerProfile(BaseModel):
    emotion: str | None = None
    gender: str | None = None

    def to_instructions(self) -> str:
        parts = []
        if self.emotion:
            parts.append(f"Emotion: {self.emotion}")
        if self.gender:
            parts.append(f"Gender: {self.gender}")
        return "\n".join(parts)

# Plugin sets it during recognition:
self.context = MySpeakerProfile(emotion="happy", gender="female")

# Agent reads it anywhere — nodes, tools, callbacks:
self.audio_recognition.stt_context  # MySpeakerProfile instance or None

AudioRecognition is now a public class but all fields and methods are private — only stt_context is exposed.

Structured LLM output

from pydantic import BaseModel
from livekit.agents import Agent, llm
from livekit.plugins import openai
from livekit.agents import inference

class TherapistOutput(BaseModel):
    emotion: str | None = None
    therapeutic_technique: str | None = None
    response: llm.Response = ""

class TherapistAgent(Agent):
    llm_output_format = TherapistOutput

agent = TherapistAgent(
    instructions="You are an empathetic therapist.",
    llm=openai.LLM(),
    tts=inference.TTS("cartesia/sonic-3"),
)

All fields must have defaults — validated at class definition via __init_subclass__. LLM is configured for structured output, JSON is streamed and partially parsed via pydantic_core.from_json(allow_partial=True).

tts_node receives BaseModel chunks (explicit opt-in — existing custom tts_node implementations that only handle str are unaffected unless llm_output_format is set):

from collections.abc import AsyncIterable

class MyAgent(Agent):
    llm_output_format = TherapistOutput

    async def tts_node(
        self, text: AsyncIterable[TherapistOutput], model_settings  # type: ignore[override]
    ):
        async for chunk in text:
            chunk.emotion               # "empathetic" — populated before first text token
            chunk.therapeutic_technique  # "active listening"
            chunk.response              # text delta (accumulated)
        return Agent.default.tts_node(self, text, model_settings)

Parsed output stored on ChatMessage.llm_output:

result = await session.run(user_input="I'm having a terrible day")
msg = result.expect.next_event(type="message").event().item
msg.text_content  # "I understand how you feel..."
msg.llm_output    # TherapistOutput(emotion="empathetic", response="...")

XML-aware sentence tokenizer

BufferedTokenStream now holds back tokens that contain unclosed XML tags, preventing sentence splits inside markup like <spell>U.S.A.</spell>. Blingfire batch path also merges split-tag sentences. 53 regression tests covering self-closing tags, wrapping tags, decimals in attributes, nested tags, chunk boundary splits, unicode, and a realistic multi-sentence conversation.

@chenghao-mou chenghao-mou requested a review from a team May 4, 2026 03:31
devin-ai-integration[bot]

This comment was marked as resolved.

@theomonnom theomonnom force-pushed the theo/expressiveness-mode branch from bea7e82 to c36f944 Compare May 4, 2026 03:58
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 11 additional findings in Devin Review.

Open in Devin Review

Comment on lines 323 to +325
if text_transforms:
input = _apply_text_transforms(input, text_transforms)
# text transforms only apply to plain text mode (no structured output)
input = _apply_text_transforms(input, text_transforms) # type: ignore[arg-type]
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot May 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Text transforms crash at runtime when llm_output_format sends BaseModel objects through the TTS pipeline

When llm_output_format is set on an Agent, _llm_inference_task (generation.py:219-225) sends BaseModel objects through text_ch. These flow into _tts_inference_task where _apply_text_transforms is applied unconditionally at line 325. The default text transforms (filter_markdown and filter_emoji) perform string operations like buffer += chunk (filters.py:103) and EMOJI_PATTERN.sub("", chunk) (filters.py:156) that will raise TypeError when chunk is a BaseModel instead of str.

Since DEFAULT_TTS_TEXT_TRANSFORMS = ["filter_markdown", "filter_emoji"] is always active by default, any Agent using llm_output_format will crash at runtime unless the user explicitly sets tts_text_transforms=None. The comment at line 324 acknowledges the incompatibility but no guard is implemented.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

current_span.set_attribute(trace_types.ATTR_SPEECH_ID, speech_handle.id)
if instructions is not None:
current_span.set_attribute(trace_types.ATTR_INSTRUCTIONS, instructions)
current_span.set_attribute(trace_types.ATTR_INSTRUCTIONS, str(instructions))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this only adds the common part to the trace but not the version used for this turn?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — trace now shows the modality-resolved text, not just the common part.

Comment on lines +2426 to +2478
chat_ctx.add_message(role="system", content=[instructions])
# re-resolve instructions for the current turn's modality
turn_modality = speech_handle.input_details.modality
turn_instructions = instructions if instructions is not None else self._agent.instructions
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is that expected for replacing the original instructions with the turn instructions entirely?

devin-ai-integration[bot]

This comment was marked as resolved.

Instructions("You are a helpful assistant.")

@property
def audio(self) -> str:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is breaking? IMO we should keep this just as a wrapper.. it's much easier to write instructions.text instead of instructions.as_modality('text')

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instructions was supposed to be in beta, I'm not sure if anybody is using it

rtc.EventEmitter[Literal["metrics_collected", "error"] | TEvent],
Generic[TEvent],
):
class Markup:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, not sure about the name. it wraps a TTS. why not expose these on TTS itself?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the case tho?

tts.markup?

llm: NotGivenOr[llm.LLM | llm.RealtimeModel | LLMModels | str | None] = NOT_GIVEN,
tts: NotGivenOr[tts.TTS | TTSModels | str | None] = NOT_GIVEN,
mcp_servers: NotGivenOr[list[mcp.MCPServer] | None] = NOT_GIVEN,
expressiveness: NotGivenOr[bool] = NOT_GIVEN,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if a user wanted to override how they prompt the LLM for expressiveness. where should they do it?

should this be a bool | ExpressivenessOptions?

Copy link
Copy Markdown
Member Author

@theomonnom theomonnom May 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They do it inside the new AgentInstructions class

return self._interruption_detection

@property
def expressiveness(self) -> NotGivenOr[bool]:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we want options, then it'd be better to always return options vs a bool

str(instructions) if not isinstance(instructions, str) else instructions
)

class _SafeFormatter(string.Formatter):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should this be util?

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

🐛 1 issue in files not directly in the diff

🐛 AgentConfigUpdate raises ValidationError when Agent.instructions is an Instructions object (livekit-agents/livekit/agents/voice/agent_activity.py:771)

At agent_activity.py:770-771, self._agent.instructions (typed as str | Instructions) is passed directly to llm.AgentConfigUpdate(instructions=...), whose field is typed str | None. The old Instructions class was a str subclass and had a custom __get_pydantic_core_schema__, so Pydantic accepted it. The refactored Instructions is a plain class with neither, so Pydantic v2 rejects it with ValidationError: Input should be a valid string. This crashes any agent created with Agent(instructions=Instructions(...)) when the activity starts.

View 12 additional findings in Devin Review.

Open in Devin Review

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 new potential issues.

⚠️ 1 issue in files not directly in the diff

⚠️ Per-chunk markup stripping in streaming transcript fails when XML tags span LLM tokens (livekit-agents/livekit/agents/voice/agent_activity.py:2616-2617)

When expressiveness is enabled, _read_text at livekit-agents/livekit/agents/voice/agent_activity.py:2616-2617 calls self.tts.markup.to_text(chunk) on individual LLM output tokens. Since to_text uses regex to match complete XML tags (strip_xml_tags in livekit-agents/livekit/agents/tts/markup_utils.py:37), partial tags spanning multiple tokens (e.g., <emotion then value="happy"/>) won't be matched and will leak into the real-time user transcript. The final transcript stored in chat history at livekit-agents/livekit/agents/voice/agent_activity.py:2787-2788 IS correctly stripped because it operates on the full accumulated text, so only the streaming display is affected.

View 17 additional findings in Devin Review.

Open in Devin Review

Comment on lines +126 to +133
def __str__(self) -> str:
return self.common

Both ``_audio_variant`` and ``_text_variant`` are preserved so this can
be called again for a different modality (e.g. across tool-call turns).
"""
return Instructions(
audio=self._audio_variant,
text=self._text_variant,
_represent=self.audio if modality == "audio" else self.text,
)
def __repr__(self) -> str:
return f"Instructions({self.common!r})"

def __hash__(self) -> int:
return hash((self.common, self.audio, self.text))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Instructions.eq with str violates hash contract

Instructions.__eq__ returns True when compared to a plain str with the same common value, but __hash__ produces a different value (it hashes a 3-tuple of (common, audio, text)). This violates Python's data model invariant: if a == b, then hash(a) == hash(b) must hold. This causes incorrect behavior when Instructions and str objects are mixed in sets or used as dict keys.

Demonstration
instr = Instructions("hello")
assert instr == "hello"  # True
assert hash(instr) == hash("hello")  # False! Violates contract

d = {"hello": 1}
d[instr]  # May raise KeyError despite instr == "hello"
Suggested change
def __str__(self) -> str:
return self.common
Both ``_audio_variant`` and ``_text_variant`` are preserved so this can
be called again for a different modality (e.g. across tool-call turns).
"""
return Instructions(
audio=self._audio_variant,
text=self._text_variant,
_represent=self.audio if modality == "audio" else self.text,
)
def __repr__(self) -> str:
return f"Instructions({self.common!r})"
def __hash__(self) -> int:
return hash((self.common, self.audio, self.text))
def __hash__(self) -> int:
return hash(self.common)
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 3 new potential issues.

🐛 1 issue in files not directly in the diff

🐛 Transcription stream drops all content when llm_output_format is set, breaking chat history and user-facing transcripts (livekit-agents/livekit/agents/voice/agent_activity.py:2613-2615)

When Agent.llm_output_format is set for structured LLM output, _llm_inference_task sends BaseModel instances (not plain strings) to text_ch (livekit-agents/livekit/agents/voice/generation.py:224). This channel is tee'd into tts_text_input and tr_input. The TTS path correctly handles BaseModel in the default tts_node (livekit-agents/livekit/agents/voice/agent.py:542-549), extracting the response field delta. However, the transcription path's _read_text wrapper unconditionally skips all BaseModel instances (isinstance(chunk, (FlushSentinel, BaseModel)): continue), yielding nothing to the transcription_node. This means text_out.text is empty, which cascades to forwarded_text being empty, so the assistant's message is never added to chat_ctx — breaking conversation history, user-facing transcription, and any downstream logic that depends on the assistant message existing in the chat context.

View 20 additional findings in Devin Review.

Open in Devin Review

Comment on lines +22 to +25
# incomplete tag at end: < without matching >
last_open = text.rfind("<")
last_close = text.rfind(">")
if last_open > last_close:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 _has_unclosed_xml_tags false positive on < in non-XML text prevents sentence splitting

The _has_unclosed_xml_tags function in token_stream.py returns True whenever the text contains a < that appears after the last >, even in regular prose like "the price is < 5 dollars. That's cheap.". The check at lines 23-25 (last_open = text.rfind("<"); last_close = text.rfind(">"); if last_open > last_close: return True) fires for any bare < character, causing the streaming tokenizer to hold the entire buffer and never split sentences. This could stall TTS output for any text containing mathematical comparisons, template syntax, or other non-XML uses of <.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment on lines +59 to +68
def __init__(
self,
common: str = "",
*,
audio: str | None = None,
text: str | None = None,
) -> None:
self.common = common
self.audio = audio
self.text = text
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Workflow tasks still use old Instructions(audio_text, text=text_text) positional-arg pattern

The Instructions.__init__ signature changed from (audio, *, text=None) to (common='', *, audio=None, text=None). Several workflow files that were NOT updated by this PR still construct Instructions with the old pattern where the first positional arg was the audio-specific variant:

Affected files
  • livekit-agents/livekit/agents/beta/workflows/phone_number.py:82-96
  • livekit-agents/livekit/agents/beta/workflows/dob.py:89-105
  • livekit-agents/livekit/agents/beta/workflows/name.py:114-133
  • livekit-agents/livekit/agents/beta/workflows/credit_card.py:165-177, :293-..., :388-...

With the old API, Instructions(audio_text, text=text_text) stored audio_text as the audio variant and text_text as the text variant — mutually exclusive. With the new API, audio_text becomes the common field (included in ALL modalities) and text_text becomes the text addition (appended to common). So render(modality="text") now returns audio_text + "\n\n" + text_text — concatenating audio-specific and text-specific instructions together, which is incorrect and will produce garbled LLM prompts in text mode.

Prompt for agents
The Instructions constructor was changed from Instructions(audio, *, text=None) to Instructions(common, *, audio=None, text=None), but several workflow files still use the old positional-arg pattern Instructions(audio_text, text=text_text). These files need to be updated to the new signature. The correct migration for each call site like Instructions(audio_text, text=text_text) would be Instructions(common='', audio=audio_text, text=text_text) — making common empty and placing the modality-specific text in the audio and text params. Alternatively, these workflow tasks could be refactored to use the WorkflowInstructions/resolve() pattern like address.py and email_address.py were updated. Affected files: beta/workflows/phone_number.py, beta/workflows/dob.py, beta/workflows/name.py, beta/workflows/credit_card.py.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

theomonnom added 20 commits May 7, 2026 18:47
- Add expressiveness flag (Agent + AgentSession) that auto-injects TTS
  markup instructions and speaker context into LLM system messages
- Rework Instructions from str subclass to stateless class with
  common/audio/text fields. No Pydantic dependency, no runtime state.
- Add AgentInstructions with expressiveness templates, WorkflowInstructions
  replaces InstructionParts
- Add TTS Markup inner class (llm_instructions + to_text) with shared
  _provider_format.py for Cartesia/ElevenLabs
- Add RecognizeStream.context + SpeakerContext protocol for STT metadata
- Privatize AudioRecognition, expose only stt_context
- Add llm_output_format class-level attribute for structured LLM output
  with streaming JSON partial parsing
- Add llm.Response annotation, ChatMessage.llm_output field
- Validate all llm_output_format fields have defaults at class definition
BufferedTokenStream now holds back tokens that contain unclosed XML tags,
preventing sentence splits inside markup like <spell>U.S.A.</spell>.
Batch path in blingfire also merges split-tag sentences.

Removes unused TagAwareBuffer — tokenizer handles it natively.
Fixes AgentConfigUpdate.instructions to use str instead of Instructions.
21 regression tests for batch + streaming with all TTS tag patterns.
Covers batch + streaming paths with: self-closing tags, wrapping tags,
periods in attributes and content, abbreviations (U.S.A., N.A.S.A.),
phoneme with IPA/arpabet, chunk boundary splits, char-by-char streaming,
unicode (French, Chinese, emoji), mixed tags, and a realistic
multi-sentence conversation.
…cised

Blingfire doesn't split tiny fragments. Tests now use realistic
multi-sentence content inside tags so splits actually trigger and
the XML-aware merge is verified.
…failures

- ChatContext only stores str, never Instructions objects
- Per-turn modality resolution only when Instructions has audio/text variants
- Plain str instructions pass through unchanged (no re-resolution)
- Revert unintended fake_llm changes
- Fix add_message to resolve Instructions to str
…with render(), improved provider prompts

- ExpressivenessOptions moved to agent_session.py as TypedDict with DEFAULT_EXPRESSIVENESS_OPTIONS
- Instructions: removed format/as_modality/__add__, added render(modality, data) returning str
- Instructions: added resolve_template() static method for workflow modality-aware composition
- safe_render utility in utils/misc.py with nested dict→SimpleNamespace, error logging with full dotted paths
- Template data uses explicit dicts with proper namespaces (tts.markup.llm_instructions, audio_recognition.stt_context.emotion)
- AudioRecognition.llm_instructions() method matching tts.markup.llm_instructions() API
- Cartesia prompt: complete 62 emotion list, examples, XML format explained
- ElevenLabs prompt: normalization rules, SSML tags, examples
- Removed _concat_optional, _safe_format, AgentInstructions
…Labs v3)

<expression value="..."/> is the XML bridge for providers that use []
brackets natively. The LLM always generates XML, plugins convert to
native format before sending to API.

- Cartesia: native XML, no conversion needed
- ElevenLabs v2: native SSML, no conversion
- ElevenLabs v3: <expression> → [laughs], [whispers], etc.
- Inworld TTS 2: <expression> → [say excitedly], [laugh], etc.

Added TTS.Markup.convert() method, convert_expression_tags() and
strip_bracket_tags() helpers, complete provider prompts with examples.
…ting self._markup

Base TTS.__init__ calls self.Markup(self) automatically. Plugins just
define their Markup inner class — no manual self._markup assignment needed.
- Complete steering prompt with free-form delivery, non-verbals, breaks, emphasis
- Based on Inworld TTS 2 docs (steering, prompting best practices)
- Add inworld-tts-2 to inference gateway models
- Gateway detects TTS 2 vs older models (only TTS 2 supports steering)
- Remove IPA and asterisk emphasis from prompt (framework doesn't strip these)
…xamples

Before: cartesia ~464, elevenlabs ~301, elevenlabs_v3 ~363, inworld ~905 tokens
After:  cartesia ~294, elevenlabs ~110, elevenlabs_v3 ~174, inworld ~349 tokens
…veness

11 delivery styles from casual to extreme, practical non-verbal examples,
conversational and emotional range the LLM can reference.
Prevents sending chunks exceeding provider limits (e.g. Inworld 1000 chars).
Splits at sentence boundaries — never mid-sentence.
max_input_len dict in _provider_format.py, used by inference gateway.
Switched gateway from basic to blingfire tokenizer.
The convert() was called on individual LLM tokens — too early, the regex
never saw complete <expression> tags. Now converts after the sentence
tokenizer accumulates full sentences, right before sending to the API.

Removed token-level convert from default tts_node.
Added debug logging in gateway showing converted text sent to API.
Cleaned up debug prints from drive-thru example.
theomonnom and others added 4 commits May 7, 2026 18:47
All examples now layer mood + energy + pacing + vocal style.
Added singing example. Removed bland short labels the LLM was copying.
…ributes

[^>]* greedily consumed the / before >, so <expression value="..."/>
was never detected as self-closing. The tokenizer thought every tag was
unclosed and held the entire buffer, causing all text to merge into one
chunk (hitting Inworld's 1000 char limit).

Fix: [^>]*? (non-greedy) stops before the /.
Verifies that <expression value="..."/> and similar self-closing tags
with attributes do NOT block sentence splitting (batch + streaming).
…eline

Tokenizer:
- Strip XML tags before blingfire sentence detection (blingfire fails to
  split when /> sits between sentences)
- Remap clean-text offsets back to original text so tags stay with their
  sentence
- Merge pass for unclosed/tag-only sentences
- Guard blingfire against empty text input
- Try/except fallback returns unsplit text on any tokenizer failure
- All XML logic centralized in _xml_wrap_tokenizer, blingfire stays pure

Markup pipeline:
- TTS.Markup.to_text_stream() for buffered streaming markup stripping
  (per-chunk stripping was a no-op on LLM token fragments)
- TTS.Markup.normalize() for fixing unclosed self-closing tags per provider
- Provider normalize config in _provider_format._SELF_CLOSING_TAGS
- <sound> tag support for non-verbal sounds (converts to [] like expression)
- Plugins (cartesia, elevenlabs, inworld) updated with normalize()

Provider prompts:
- Rewritten Inworld, Cartesia, ElevenLabs v3 prompts with tested examples
- Per-sentence delivery tags, non-verbal sounds, thinking patterns
- Prompts tested with gpt-4.1-mini across drive-thru and frontdesk scenarios

Examples:
- Drive-thru and frontdesk updated with natural speech instructions
- Both use inworld-tts-2 with expressiveness=True

Co-Authored-By: Théo Monnom <theo@livekit.io>
@theomonnom theomonnom force-pushed the theo/expressiveness-mode branch from 26288b4 to 4962032 Compare May 8, 2026 02:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants