feat: expressiveness mode, stateless Instructions, structured LLM output by theomonnom · Pull Request #5635 · livekit/agents

theomonnom · 2026-05-04T03:30:51Z

Summary

Expressiveness mode — auto-injects TTS markup instructions + speaker context into LLM, strips markup from transcripts
Stateless Instructions — reworked from str subclass to plain class with common/audio/text
STT speaker context — RecognizeStream.context + SpeakerContext protocol
AudioRecognition — now public, all fields/methods private except stt_context
Structured LLM output — llm_output_format with llm.Response annotation, streaming JSON partial parsing
TTS markup — TTS.Markup inner class, shared _provider_format.py for Cartesia/ElevenLabs
XML-aware tokenizer — BufferedTokenStream holds back tokens with unclosed XML tags (53 regression tests)
WorkflowInstructions — replaces InstructionParts

Expressiveness mode

from livekit.agents import Agent, AgentSession, inference

agent = Agent(
    instructions="You are an empathetic therapist.",
    expressiveness=True,
    stt=inference.STT("deepgram/nova-3"),
    llm=inference.LLM("openai/gpt-4o"),
    tts=inference.TTS("cartesia/sonic-3"),
)
session = AgentSession()
await session.start(agent, room=room)

The framework injects system messages telling the LLM about available TTS tags:

The TTS supports the following formatting capabilities...
<emotion value="EMOTION"/> where EMOTION is one of: neutral, angry, excited...
<speed ratio="VALUE"/>, <volume ratio="VALUE"/>, <break time="1s"/>...

The LLM then uses markup naturally. Markup is stripped from transcripts and chat history:

Path	Text
LLM output	`<emotion value="sad"/> I understand how you feel.`
TTS receives	`<emotion value="sad"/> I understand how you feel.`
Transcript	`I understand how you feel.`
Chat history	`I understand how you feel.`

Custom templates and per-plugin overrides:

from livekit.agents.llm.chat_context import AgentInstructions
from livekit.plugins import cartesia

# Custom framing for the injected instructions
agent = Agent(
    instructions=AgentInstructions(
        "You are helpful.",
        tts_instructions_template="Use speech markup sparingly:\n\n{tts_instructions}",
        audio_recognition_instructions_template="Speaker: {speaker_context}",
    ),
    expressiveness=True,
    tts=inference.TTS("cartesia/sonic-3"),
    llm=inference.LLM("openai/gpt-4o"),
)

# Override specific parts of a plugin's default instructions
tts = cartesia.TTS(
    instruction_parts=cartesia.InstructionParts(
        constraints="Only use emotion tags. Never use speed or volume."
    )
)

ElevenLabs example with normalization:

from livekit.plugins import elevenlabs

agent = Agent(
    instructions="You are a friendly customer support agent.",
    expressiveness=True,
    llm=inference.LLM("openai/gpt-4o"),
    tts=elevenlabs.TTS(model="eleven_flash_v2"),
)

# LLM receives ElevenLabs-specific instructions:
#   "Normalize numbers and symbols for spoken clarity..."
#   "$42.50 → forty-two dollars and fifty cents"
#   "SSML: <break time="1.5s"/>, <phoneme alphabet="cmu-arpabet" ph="...">word</phoneme>"
#
# LLM outputs: Hold on, let me check. <break time="1.5s"/> Your total is forty-two dollars.
# Transcript:  Hold on, let me check. Your total is forty-two dollars.

Stateless Instructions

Reworked from str subclass to plain class. No Pydantic, no runtime state.

from livekit.agents.llm.chat_context import Instructions

# Simple — same for all modalities
Instructions("You are helpful.")

# Modality-aware — common text + per-modality additions
instr = Instructions(
    "You are a helpful assistant.",
    audio="Keep responses short for voice.",
    text="Use markdown formatting.",
)
instr.as_modality("audio")  # "You are a helpful assistant.\n\nKeep responses short for voice."
instr.as_modality("text")   # "You are a helpful assistant.\n\nUse markdown formatting."
str(instr)                   # "You are a helpful assistant."

Hierarchy: Instructions → AgentInstructions → WorkflowInstructions

InstructionParts removed, replaced by WorkflowInstructions(AgentInstructions).

STT speaker context + AudioRecognition

STT plugins set metadata on their stream. Accessible anywhere on the Agent:

from pydantic import BaseModel
from livekit.agents.stt import SpeakerContext

# STT plugin defines its own context model
class MySpeakerProfile(BaseModel):
    emotion: str | None = None
    gender: str | None = None

    def to_instructions(self) -> str:
        parts = []
        if self.emotion:
            parts.append(f"Emotion: {self.emotion}")
        if self.gender:
            parts.append(f"Gender: {self.gender}")
        return "\n".join(parts)

# Plugin sets it during recognition:
self.context = MySpeakerProfile(emotion="happy", gender="female")

# Agent reads it anywhere — nodes, tools, callbacks:
self.audio_recognition.stt_context  # MySpeakerProfile instance or None

AudioRecognition is now a public class but all fields and methods are private — only stt_context is exposed.

Structured LLM output

from pydantic import BaseModel
from livekit.agents import Agent, llm
from livekit.plugins import openai
from livekit.agents import inference

class TherapistOutput(BaseModel):
    emotion: str | None = None
    therapeutic_technique: str | None = None
    response: llm.Response = ""

class TherapistAgent(Agent):
    llm_output_format = TherapistOutput

agent = TherapistAgent(
    instructions="You are an empathetic therapist.",
    llm=openai.LLM(),
    tts=inference.TTS("cartesia/sonic-3"),
)

All fields must have defaults — validated at class definition via __init_subclass__. LLM is configured for structured output, JSON is streamed and partially parsed via pydantic_core.from_json(allow_partial=True).

tts_node receives BaseModel chunks (explicit opt-in — existing custom tts_node implementations that only handle str are unaffected unless llm_output_format is set):

from collections.abc import AsyncIterable

class MyAgent(Agent):
    llm_output_format = TherapistOutput

    async def tts_node(
        self, text: AsyncIterable[TherapistOutput], model_settings  # type: ignore[override]
    ):
        async for chunk in text:
            chunk.emotion               # "empathetic" — populated before first text token
            chunk.therapeutic_technique  # "active listening"
            chunk.response              # text delta (accumulated)
        return Agent.default.tts_node(self, text, model_settings)

Parsed output stored on ChatMessage.llm_output:

result = await session.run(user_input="I'm having a terrible day")
msg = result.expect.next_event(type="message").event().item
msg.text_content  # "I understand how you feel..."
msg.llm_output    # TherapistOutput(emotion="empathetic", response="...")

XML-aware sentence tokenizer

BufferedTokenStream now holds back tokens that contain unclosed XML tags, preventing sentence splits inside markup like <spell>U.S.A.</spell>. Blingfire batch path also merges split-tag sentences. 53 regression tests covering self-closing tags, wrapping tags, decimals in attributes, nested tags, chunk boundary splits, unicode, and a realistic multi-sentence conversation.

devin-ai-integration

Devin Review found 1 new potential issue.

View 11 additional findings in Devin Review.

devin-ai-integration · 2026-05-04T04:08:57Z

        if text_transforms:
-            input = _apply_text_transforms(input, text_transforms)
+            # text transforms only apply to plain text mode (no structured output)
+            input = _apply_text_transforms(input, text_transforms)  # type: ignore[arg-type]


🔴 Text transforms crash at runtime when llm_output_format sends BaseModel objects through the TTS pipeline

When llm_output_format is set on an Agent, _llm_inference_task (generation.py:219-225) sends BaseModel objects through text_ch. These flow into _tts_inference_task where _apply_text_transforms is applied unconditionally at line 325. The default text transforms (filter_markdown and filter_emoji) perform string operations like buffer += chunk (filters.py:103) and EMOJI_PATTERN.sub("", chunk) (filters.py:156) that will raise TypeError when chunk is a BaseModel instead of str.

Since DEFAULT_TTS_TEXT_TRANSFORMS = ["filter_markdown", "filter_emoji"] is always active by default, any Agent using llm_output_format will crash at runtime unless the user explicitly sets tts_text_transforms=None. The comment at line 324 acknowledges the incompatibility but no guard is implemented.

Was this helpful? React with 👍 or 👎 to provide feedback.

longcw · 2026-05-04T04:12:59Z

        current_span.set_attribute(trace_types.ATTR_SPEECH_ID, speech_handle.id)
        if instructions is not None:
-            current_span.set_attribute(trace_types.ATTR_INSTRUCTIONS, instructions)
+            current_span.set_attribute(trace_types.ATTR_INSTRUCTIONS, str(instructions))


this only adds the common part to the trace but not the version used for this turn?

Fixed — trace now shows the modality-resolved text, not just the common part.

longcw · 2026-05-04T04:16:35Z

-            chat_ctx.add_message(role="system", content=[instructions])
+        # re-resolve instructions for the current turn's modality
+        turn_modality = speech_handle.input_details.modality
+        turn_instructions = instructions if instructions is not None else self._agent.instructions


is that expected for replacing the original instructions with the turn instructions entirely?

davidzhao · 2026-05-04T05:49:06Z

+        Instructions("You are a helpful assistant.")

-    @property
-    def audio(self) -> str:


this is breaking? IMO we should keep this just as a wrapper.. it's much easier to write instructions.text instead of instructions.as_modality('text')

Instructions was supposed to be in beta, I'm not sure if anybody is using it

davidzhao · 2026-05-04T06:05:08Z

    rtc.EventEmitter[Literal["metrics_collected", "error"] | TEvent],
    Generic[TEvent],
 ):
+    class Markup:


nit, not sure about the name. it wraps a TTS. why not expose these on TTS itself?

It is the case tho?

tts.markup?

davidzhao · 2026-05-04T06:10:32Z

        llm: NotGivenOr[llm.LLM | llm.RealtimeModel | LLMModels | str | None] = NOT_GIVEN,
        tts: NotGivenOr[tts.TTS | TTSModels | str | None] = NOT_GIVEN,
        mcp_servers: NotGivenOr[list[mcp.MCPServer] | None] = NOT_GIVEN,
+        expressiveness: NotGivenOr[bool] = NOT_GIVEN,


if a user wanted to override how they prompt the LLM for expressiveness. where should they do it?

should this be a bool | ExpressivenessOptions?

They do it inside the new AgentInstructions class

davidzhao · 2026-05-04T06:11:05Z

        return self._interruption_detection

+    @property
+    def expressiveness(self) -> NotGivenOr[bool]:


if we want options, then it'd be better to always return options vs a bool

davidzhao · 2026-05-04T06:35:12Z

+                str(instructions) if not isinstance(instructions, str) else instructions
+            )
+
+        class _SafeFormatter(string.Formatter):


nit: should this be util?

devin-ai-integration

Devin Review found 1 new potential issue.

🐛 1 issue in files not directly in the diff

🐛 AgentConfigUpdate raises ValidationError when Agent.instructions is an Instructions object (`livekit-agents/livekit/agents/voice/agent_activity.py:771`)

At agent_activity.py:770-771, self._agent.instructions (typed as str | Instructions) is passed directly to llm.AgentConfigUpdate(instructions=...), whose field is typed str | None. The old Instructions class was a str subclass and had a custom __get_pydantic_core_schema__, so Pydantic accepted it. The refactored Instructions is a plain class with neither, so Pydantic v2 rejects it with ValidationError: Input should be a valid string. This crashes any agent created with Agent(instructions=Instructions(...)) when the activity starts.

View 12 additional findings in Devin Review.

devin-ai-integration

Devin Review found 2 new potential issues.

⚠️ 1 issue in files not directly in the diff

⚠️ Per-chunk markup stripping in streaming transcript fails when XML tags span LLM tokens (`livekit-agents/livekit/agents/voice/agent_activity.py:2616-2617`)

When expressiveness is enabled, _read_text at livekit-agents/livekit/agents/voice/agent_activity.py:2616-2617 calls self.tts.markup.to_text(chunk) on individual LLM output tokens. Since to_text uses regex to match complete XML tags (strip_xml_tags in livekit-agents/livekit/agents/tts/markup_utils.py:37), partial tags spanning multiple tokens (e.g., <emotion then value="happy"/>) won't be matched and will leak into the real-time user transcript. The final transcript stored in chat history at livekit-agents/livekit/agents/voice/agent_activity.py:2787-2788 IS correctly stripped because it operates on the full accumulated text, so only the streaming display is affected.

View 17 additional findings in Devin Review.

devin-ai-integration · 2026-05-07T03:45:58Z

+    def __str__(self) -> str:
+        return self.common

-        Both ``_audio_variant`` and ``_text_variant`` are preserved so this can
-        be called again for a different modality (e.g. across tool-call turns).
-        """
-        return Instructions(
-            audio=self._audio_variant,
-            text=self._text_variant,
-            _represent=self.audio if modality == "audio" else self.text,
-        )
+    def __repr__(self) -> str:
+        return f"Instructions({self.common!r})"
+
+    def __hash__(self) -> int:
+        return hash((self.common, self.audio, self.text))


🟡 Instructions.eq with str violates hash contract

Instructions.__eq__ returns True when compared to a plain str with the same common value, but __hash__ produces a different value (it hashes a 3-tuple of (common, audio, text)). This violates Python's data model invariant: if a == b, then hash(a) == hash(b) must hold. This causes incorrect behavior when Instructions and str objects are mixed in sets or used as dict keys.

Demonstration

instr = Instructions("hello") assert instr == "hello" # True assert hash(instr) == hash("hello") # False! Violates contract d = {"hello": 1} d[instr] # May raise KeyError despite instr == "hello"

Suggested change

def __str__(self) -> str:

return self.common

Both ``_audio_variant`` and ``_text_variant`` are preserved so this can

be called again for a different modality (e.g. across tool-call turns).

"""

return Instructions(

audio=self._audio_variant,

text=self._text_variant,

_represent=self.audio if modality == "audio" else self.text,

)

def __repr__(self) -> str:

return f"Instructions({self.common!r})"

def __hash__(self) -> int:

return hash((self.common, self.audio, self.text))

def __hash__(self) -> int:

return hash(self.common)

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration

Devin Review found 3 new potential issues.

🐛 1 issue in files not directly in the diff

🐛 Transcription stream drops all content when `llm_output_format` is set, breaking chat history and user-facing transcripts (`livekit-agents/livekit/agents/voice/agent_activity.py:2613-2615`)

When Agent.llm_output_format is set for structured LLM output, _llm_inference_task sends BaseModel instances (not plain strings) to text_ch (livekit-agents/livekit/agents/voice/generation.py:224). This channel is tee'd into tts_text_input and tr_input. The TTS path correctly handles BaseModel in the default tts_node (livekit-agents/livekit/agents/voice/agent.py:542-549), extracting the response field delta. However, the transcription path's _read_text wrapper unconditionally skips all BaseModel instances (isinstance(chunk, (FlushSentinel, BaseModel)): continue), yielding nothing to the transcription_node. This means text_out.text is empty, which cascades to forwarded_text being empty, so the assistant's message is never added to chat_ctx — breaking conversation history, user-facing transcription, and any downstream logic that depends on the assistant message existing in the chat context.

View 20 additional findings in Devin Review.

devin-ai-integration · 2026-05-07T04:13:45Z

+    # incomplete tag at end: < without matching >
+    last_open = text.rfind("<")
+    last_close = text.rfind(">")
+    if last_open > last_close:


🟡 _has_unclosed_xml_tags false positive on < in non-XML text prevents sentence splitting

The _has_unclosed_xml_tags function in token_stream.py returns True whenever the text contains a < that appears after the last >, even in regular prose like "the price is < 5 dollars. That's cheap.". The check at lines 23-25 (last_open = text.rfind("<"); last_close = text.rfind(">"); if last_open > last_close: return True) fires for any bare < character, causing the streaming tokenizer to hold the entire buffer and never split sentences. This could stall TTS output for any text containing mathematical comparisons, template syntax, or other non-XML uses of <.

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration · 2026-05-07T04:13:46Z

+    def __init__(
+        self,
+        common: str = "",
+        *,
+        audio: str | None = None,
+        text: str | None = None,
+    ) -> None:
+        self.common = common
+        self.audio = audio
+        self.text = text


🔴 Workflow tasks still use old Instructions(audio_text, text=text_text) positional-arg pattern

The Instructions.__init__ signature changed from (audio, *, text=None) to (common='', *, audio=None, text=None). Several workflow files that were NOT updated by this PR still construct Instructions with the old pattern where the first positional arg was the audio-specific variant:

Affected files

livekit-agents/livekit/agents/beta/workflows/phone_number.py:82-96

livekit-agents/livekit/agents/beta/workflows/dob.py:89-105

livekit-agents/livekit/agents/beta/workflows/name.py:114-133

livekit-agents/livekit/agents/beta/workflows/credit_card.py:165-177, :293-..., :388-...

With the old API, Instructions(audio_text, text=text_text) stored audio_text as the audio variant and text_text as the text variant — mutually exclusive. With the new API, audio_text becomes the common field (included in ALL modalities) and text_text becomes the text addition (appended to common). So render(modality="text") now returns audio_text + "\n\n" + text_text — concatenating audio-specific and text-specific instructions together, which is incorrect and will produce garbled LLM prompts in text mode.

Prompt for agents

The Instructions constructor was changed from Instructions(audio, *, text=None) to Instructions(common, *, audio=None, text=None), but several workflow files still use the old positional-arg pattern Instructions(audio_text, text=text_text). These files need to be updated to the new signature. The correct migration for each call site like Instructions(audio_text, text=text_text) would be Instructions(common='', audio=audio_text, text=text_text) — making common empty and placing the modality-specific text in the audio and text params. Alternatively, these workflow tasks could be refactored to use the WorkflowInstructions/resolve() pattern like address.py and email_address.py were updated. Affected files: beta/workflows/phone_number.py, beta/workflows/dob.py, beta/workflows/name.py, beta/workflows/credit_card.py.

Was this helpful? React with 👍 or 👎 to provide feedback.

- Add expressiveness flag (Agent + AgentSession) that auto-injects TTS markup instructions and speaker context into LLM system messages - Rework Instructions from str subclass to stateless class with common/audio/text fields. No Pydantic dependency, no runtime state. - Add AgentInstructions with expressiveness templates, WorkflowInstructions replaces InstructionParts - Add TTS Markup inner class (llm_instructions + to_text) with shared _provider_format.py for Cartesia/ElevenLabs - Add RecognizeStream.context + SpeakerContext protocol for STT metadata - Privatize AudioRecognition, expose only stt_context - Add llm_output_format class-level attribute for structured LLM output with streaming JSON partial parsing - Add llm.Response annotation, ChatMessage.llm_output field - Validate all llm_output_format fields have defaults at class definition

…s enabled

BufferedTokenStream now holds back tokens that contain unclosed XML tags, preventing sentence splits inside markup like <spell>U.S.A.</spell>. Batch path in blingfire also merges split-tag sentences. Removes unused TagAwareBuffer — tokenizer handles it natively. Fixes AgentConfigUpdate.instructions to use str instead of Instructions. 21 regression tests for batch + streaming with all TTS tag patterns.

Covers batch + streaming paths with: self-closing tags, wrapping tags, periods in attributes and content, abbreviations (U.S.A., N.A.S.A.), phoneme with IPA/arpabet, chunk boundary splits, char-by-char streaming, unicode (French, Chinese, emoji), mixed tags, and a realistic multi-sentence conversation.

…cised Blingfire doesn't split tiny fragments. Tests now use realistic multi-sentence content inside tags so splits actually trigger and the XML-aware merge is verified.

…failures - ChatContext only stores str, never Instructions objects - Per-turn modality resolution only when Instructions has audio/text variants - Plain str instructions pass through unchanged (no re-resolution) - Revert unintended fake_llm changes - Fix add_message to resolve Instructions to str

…with render(), improved provider prompts - ExpressivenessOptions moved to agent_session.py as TypedDict with DEFAULT_EXPRESSIVENESS_OPTIONS - Instructions: removed format/as_modality/__add__, added render(modality, data) returning str - Instructions: added resolve_template() static method for workflow modality-aware composition - safe_render utility in utils/misc.py with nested dict→SimpleNamespace, error logging with full dotted paths - Template data uses explicit dicts with proper namespaces (tts.markup.llm_instructions, audio_recognition.stt_context.emotion) - AudioRecognition.llm_instructions() method matching tts.markup.llm_instructions() API - Cartesia prompt: complete 62 emotion list, examples, XML format explained - ElevenLabs prompt: normalization rules, SSML tags, examples - Removed _concat_optional, _safe_format, AgentInstructions

…Labs v3) <expression value="..."/> is the XML bridge for providers that use [] brackets natively. The LLM always generates XML, plugins convert to native format before sending to API. - Cartesia: native XML, no conversion needed - ElevenLabs v2: native SSML, no conversion - ElevenLabs v3: <expression> → [laughs], [whispers], etc. - Inworld TTS 2: <expression> → [say excitedly], [laugh], etc. Added TTS.Markup.convert() method, convert_expression_tags() and strip_bracket_tags() helpers, complete provider prompts with examples.

…ting self._markup Base TTS.__init__ calls self.Markup(self) automatically. Plugins just define their Markup inner class — no manual self._markup assignment needed.

- Complete steering prompt with free-form delivery, non-verbals, breaks, emphasis - Based on Inworld TTS 2 docs (steering, prompting best practices) - Add inworld-tts-2 to inference gateway models - Gateway detects TTS 2 vs older models (only TTS 2 supports steering) - Remove IPA and asterisk emphasis from prompt (framework doesn't strip these)

…xamples Before: cartesia ~464, elevenlabs ~301, elevenlabs_v3 ~363, inworld ~905 tokens After: cartesia ~294, elevenlabs ~110, elevenlabs_v3 ~174, inworld ~349 tokens

…veness 11 delivery styles from casual to extreme, practical non-verbal examples, conversational and emotional range the LLM can reference.

Prevents sending chunks exceeding provider limits (e.g. Inworld 1000 chars). Splits at sentence boundaries — never mid-sentence. max_input_len dict in _provider_format.py, used by inference gateway. Switched gateway from basic to blingfire tokenizer.

The convert() was called on individual LLM tokens — too early, the regex never saw complete <expression> tags. Now converts after the sentence tokenizer accumulates full sentences, right before sending to the API. Removed token-level convert from default tts_node. Added debug logging in gateway showing converted text sent to API. Cleaned up debug prints from drive-thru example.

All examples now layer mood + energy + pacing + vocal style. Added singing example. Removed bland short labels the LLM was copying.

…ributes [^>]* greedily consumed the / before >, so <expression value="..."/> was never detected as self-closing. The tokenizer thought every tag was unclosed and held the entire buffer, causing all text to merge into one chunk (hitting Inworld's 1000 char limit). Fix: [^>]*? (non-greedy) stops before the /.

Verifies that <expression value="..."/> and similar self-closing tags with attributes do NOT block sentence splitting (batch + streaming).

…eline Tokenizer: - Strip XML tags before blingfire sentence detection (blingfire fails to split when /> sits between sentences) - Remap clean-text offsets back to original text so tags stay with their sentence - Merge pass for unclosed/tag-only sentences - Guard blingfire against empty text input - Try/except fallback returns unsplit text on any tokenizer failure - All XML logic centralized in _xml_wrap_tokenizer, blingfire stays pure Markup pipeline: - TTS.Markup.to_text_stream() for buffered streaming markup stripping (per-chunk stripping was a no-op on LLM token fragments) - TTS.Markup.normalize() for fixing unclosed self-closing tags per provider - Provider normalize config in _provider_format._SELF_CLOSING_TAGS - <sound> tag support for non-verbal sounds (converts to [] like expression) - Plugins (cartesia, elevenlabs, inworld) updated with normalize() Provider prompts: - Rewritten Inworld, Cartesia, ElevenLabs v3 prompts with tested examples - Per-sentence delivery tags, non-verbal sounds, thinking patterns - Prompts tested with gpt-4.1-mini across drive-thru and frontdesk scenarios Examples: - Drive-thru and frontdesk updated with natural speech instructions - Both use inworld-tts-2 with expressiveness=True Co-Authored-By: Théo Monnom <theo@livekit.io>

chenghao-mou requested a review from a team May 4, 2026 03:31

This comment was marked as resolved.

Sign in to view

theomonnom force-pushed the theo/expressiveness-mode branch from bea7e82 to c36f944 Compare May 4, 2026 03:58

devin-ai-integration Bot reviewed May 4, 2026

View reviewed changes

longcw reviewed May 4, 2026

View reviewed changes

This comment was marked as resolved.

Sign in to view

davidzhao reviewed May 4, 2026

View reviewed changes

devin-ai-integration Bot reviewed May 5, 2026

View reviewed changes

devin-ai-integration Bot reviewed May 7, 2026

View reviewed changes

theomonnom added 20 commits May 7, 2026 18:47

fix: strip TTS markup from streamed transcripts when expressiveness i…

211f76d

…s enabled

remove unused get_tags from _provider_format

8e18ca3

test: use full sentences inside wrapping tags to ensure merge is exer…

59f13ad

…cised Blingfire doesn't split tiny fragments. Tests now use realistic multi-sentence content inside tags so splits actually trigger and the XML-aware merge is verified.

feat: add Markup support to Inworld TTS plugin

14fbf85

refactor: plugins override Markup inner class instead of manually set…

b43e505

…ting self._markup Base TTS.__init__ calls self.Markup(self) automatically. Plugins just define their Markup inner class — no manual self._markup assignment needed.

refactor: audio_recognition property raises instead of returning None

dd62cc8

refactor: trim provider prompts — keep syntax/values, cut redundant e…

d640e3c

…xamples Before: cartesia ~464, elevenlabs ~301, elevenlabs_v3 ~363, inworld ~905 tokens After: cartesia ~294, elevenlabs ~110, elevenlabs_v3 ~174, inworld ~349 tokens

feat: better Inworld delivery examples showing full range of expressi…

b9517ec

…veness 11 delivery styles from casual to extreme, practical non-verbal examples, conversational and emotional range the LLM can reference.

example: switch drive-thru to Inworld TTS 2 with expressiveness

d17b907

switch drive-thru to gpt-4.1-mini, don't truncate TTS debug logs

c8319d5

example: soften drive-thru guardrails to not block expressiveness

b40c4ad

theomonnom and others added 4 commits May 7, 2026 18:47

fix: richer Inworld delivery examples — short labels sound flat

b5b4aa9

All examples now layer mood + energy + pacing + vocal style. Added singing example. Removed bland short labels the LLM was copying.

test: regression tests for self-closing XML tag regex bug

aada98f

Verifies that <expression value="..."/> and similar self-closing tags with attributes do NOT block sentence splitting (batch + streaming).

theomonnom force-pushed the theo/expressiveness-mode branch from 26288b4 to 4962032 Compare May 8, 2026 02:53

Conversation

theomonnom commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Expressiveness mode

Stateless Instructions

STT speaker context + AudioRecognition

Structured LLM output

XML-aware sentence tokenizer

Uh oh!

This comment was marked as resolved.

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

theomonnom May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

🐛 AgentConfigUpdate raises ValidationError when Agent.instructions is an Instructions object (livekit-agents/livekit/agents/voice/agent_activity.py:771)

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

⚠️ Per-chunk markup stripping in streaming transcript fails when XML tags span LLM tokens (livekit-agents/livekit/agents/voice/agent_activity.py:2616-2617)

Uh oh!

devin-ai-integration Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

🐛 Transcription stream drops all content when llm_output_format is set, breaking chat history and user-facing transcripts (livekit-agents/livekit/agents/voice/agent_activity.py:2613-2615)

Uh oh!

devin-ai-integration Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

theomonnom commented May 4, 2026 •

edited

Loading

devin-ai-integration Bot May 4, 2026 •

edited

Loading

theomonnom May 4, 2026 •

edited

Loading

🐛 AgentConfigUpdate raises ValidationError when Agent.instructions is an Instructions object (`livekit-agents/livekit/agents/voice/agent_activity.py:771`)

⚠️ Per-chunk markup stripping in streaming transcript fails when XML tags span LLM tokens (`livekit-agents/livekit/agents/voice/agent_activity.py:2616-2617`)

🐛 Transcription stream drops all content when `llm_output_format` is set, breaking chat history and user-facing transcripts (`livekit-agents/livekit/agents/voice/agent_activity.py:2613-2615`)