[Platform] Introduce `Speech` support by Guikingone · Pull Request #943 · symfony/ai

Guikingone · 2025-11-22T18:04:39Z

Q	A
Bug fix?	no
New feature?	yes
Docs?	yes
Issues	--
License	MIT

Introduce support for TTS, STT and STS for agents
Add new configuration options to configure "speech" at agent-level.

Example for an OpenAI-based STS agent:

ai:
    platform:
        elevenlabs:
            api_key: '%env(ELEVEN_LABS_API_KEY)%'
        openai:
            api_key: '%env(OPENAI_API_KEY)%'

    agent:
        sts_openai:
            platform: ai.platform.openai
            model: gpt-4o
            speech:
                enabled: true
                platform: ai.platform.elevenlabs
                tts_model: eleven_multilingual_v2
                tts_options:
                    voice_id: some-voice-id
                stt_model: scribe_v1

Agents can either be TTS or STT independently
A SpeechConfiguration object handle the speech configuration
A SpeechProcessor handle the input/output

OskarStark · 2025-11-23T09:30:23Z

To me we maybe should introduce capabilities also to platforms rather than having a voice component. As far as I understand I cannot use the Voice component standalone, right?

I don't think a dedicated component is the way to go here

Guikingone · 2025-11-23T09:32:20Z

We can introduce it via the Platform, could be easier, the voice can be used without agents but it will requires the Platform at least.

Will update the PR to match this approach 👍🏻

OskarStark · 2025-11-23T09:33:36Z

I agree, Agent scope is not needed 👍🏻

chr-hertel · 2025-11-23T10:29:49Z

Hi @Guikingone, i agree that week lack some kind of guidance on how voices work - but same goes for other binary stuff like creating images or videos.

so two things i would like to understand

what's the high-level goal here - like what do you want to build?
why is it an extra component and not part of Platform?

btw, "speech" is more common than "vioce" isn't it?
btw2, have you seen the demo around audio and video?

Guikingone · 2025-11-23T10:35:36Z

what's the high-level goal here - like what do you want to build?

The main goal is to add the capacity to have an agent/platform that can "listen" and answer to inputs thanks to voice / speech (voice is used as a sugar here, could be renamed to speech), creating a workflow where you can submit voice, call the platform that transforms it to speech / text (depending on the situation you're in) and returning it to the user without frictions.

why is it an extra component and not part of Platform?

It is now part of Platform, I just pushed an update on it following the comment from @OskarStark.

btw, "speech" is more common than "voice" isn't it?

Agreed, could be renamed to Speech.

btw2, have you seen the demo around audio and video?

Yes, the goal is to ease it with a "built-in" approach / API that stays transparent for the user.

chr-hertel · 2025-11-23T10:50:15Z

just realized we should the "audio" demo to "speech" as well - and i'm def not really happy with that solution there.

can we make it as easy as the structured output - like with an listener?

i like that starting point:

$result = $platform->invoke('eleven_multilingual_v2', new Text('Hello world'), [
    'voice' => 'Dslrhjl3ZpzrctukrQSN', // Brad (https://elevenlabs.io/app/voice-library?voiceId=Dslrhjl3ZpzrctukrQSN)
]);

echo $result->asVoice();

what would be the return type here? would it be same as asBinary() or asDataUri()

Guikingone · 2025-11-23T14:42:20Z

can we make it as easy as the structured output - like with an listener?

Could be something to explore, the API is not locked for now.

what would be the return type here? would it be same as asBinary() or asDataUri()

My first approach was to do the same thing as asBinary to ease the usage.

This PR was merged into the main branch. Discussion ---------- [Demo][Website] Rename audio demo to speech | Q | A | ------------- | --- | Bug fix? | no | New feature? | no | Docs? | | Issues | | License | MIT Following a discussion of #943 Commits ------- ffc2b64 Rename audio demo to speech

Guikingone · 2025-11-25T12:56:56Z

Well, might seems weird but here we go, stt, tts and sts are working like a charm ... 👀

Guikingone · 2026-01-21T08:30:41Z

Hi @chr-hertel / @OskarStark 👋🏻

Friendly ping on this PR, should I keep rebasing it / targeting 0.3 - 0.4 or should we rollback it to the Draft status?

OskarStark · 2026-01-22T11:13:51Z

@chr-hertel will have a look soon, not sure it will land in 0.3, lets keep it for now

Guikingone · 2026-02-17T18:35:52Z

Hi @OskarStark @chr-hertel, yes, I know, again 😄

I think that this time, that's the one, while thinking about #1572 and the comment from chris, I thought about this PR and the listener approach didn't looked like "THE" solution, especially while we have the processors, so, I asked Claude (yes, sometimes, asking for an external opinion might lead to a solution) for a "reworked implementation" that could ease the user experience and the maintenance of it, it submitted a solution close to the processors and I did the final tweaking.

So, what changed?

Now, the speech configuration is moved where it needs to be, at the Agent level, each agent can specify which platform to use, the options and so on, no more "in or out option", the agent handle the configuration, simple, clever, straight to the point.

The Platform still an important aspect of the PR (as the configuration is in the Platform but shared with the Agent), I also simplified the DI experience along with the code in the SpeechProcessor (the main entry point for speech now).

I updated the examples and reworked the documentation, much better, make more sense IMHO to be like that.

I let you take a look at it and review it if you think it deserves to be reviewed, is #1572 needed anymore? Thought question, if this PR is merged, probably not, at least, I don't see use case except for the validation/evaluation part (for now) that could require it (as speech is now at the agent level), probably another topic for another day 😄

Guikingone · 2026-03-07T16:45:12Z

A set of small improvements on the API, smaller scope for result support of Speech.

CI is failing due to vektor bridge, not tied to the PR itself.

chr-hertel · 2026-03-15T20:59:08Z

Can we reduce this to be just a decorator for an AgentInterface? like SpeakingAgent or SpeechAgent?
and document how to wire it as a service instead of extending the agent config in bundle?
would reduce the impact and effort here.

Guikingone · 2026-03-16T12:05:28Z

Could be a solution, the only issue it will not be "as smooth" as in the PR thanks to the configuration and the auto-injection of the processor, I'll take a look in that direction but can't promise anything this week.

Guikingone · 2026-03-28T13:07:10Z

Hi @chr-hertel, hope you're fine 👋🏻

I pushed the commit that contains the decorator approach, you were right about this approach, make sense as it removes the unnecessary configuration in agent section, I also updated the documentation to explain how to use it (I can add a extra section for the injection using Autowire if needed).

I let you review it when you have time 🙂

PS: If the PR is good for you, I'll fully squash it before you merge it, this way we get a clean history.

chr-hertel

Hey @Guikingone, yes i like - thanks for keeping this alive - i feel like we're on a good track.

Some things to tackle from my point of we:

I think we should merge the additional methods of Speech into the DeferredResult and get rid off the extra data layer there, like Speech itself and that SpeechAwareInterface - i think it is basically only asDataUri missing there - would slim this down more and reducing the impact (e.g. not having SpeechAwareTrait in TextResult)
Funnily enough @OskarStark and I just had a discussion yesterday about adopting Options (see notifier) or the OptionsResolver and decided against it since it is also limiting users - in this case i don't even know why we can't just have discrete properties in the SpeechConfiguration - it plays well with the extensibility of form component, but I don't see the same case here at this point
=> removing symfony/options-resolver in favor of well-defined SpeechConfiguration

Things I'm currenty unsure about:

Extending agent's bundle config vs. standalone service registration - I'm slightly in favor of having it in the agent like you did, yes - not entirely sure tho
Extending the MessageBag - that thing is growing - don't have a good alternative and see why you decided for it

My thoughts for now on rather high level - thanks again!

Guikingone · 2026-03-28T14:26:48Z

I'll take a look at points you highlighted, I can get rid of the OptionsResolver, not a big work on this side, I agree on the method for speech, could be merged in the result too.

Regarding the MessageBag, don't know, don't have a strong opinion on it, one solution might be to decorate it but for which use case? To do what and with which methods? 😅

For the configuration, my favorite approach would be to extend the existing one rather than depending on the DIC, I think that having it through service decoration is kind of an anti-pattern and can be easily simplified by the configuration, if you have multiple "speech agents", you'd be forced to declare them one by one while in the same time, keeping track of your agents in the configuration, looks like a "wrong turn" to take 🤔

chr-hertel · 2026-03-28T18:12:03Z

Alright, let's keep MessageBag and config like you changed already 👍

Guikingone · 2026-03-29T08:53:47Z

Here's the "final draft" using the decorator and the configuration, I also updated the demo to use the new SpeechAgent (it's fully transparent for the end user), the documentation contains the detailed process on how to use the speech support at both the Agent and AiBundle levels, tests also updated.

Guikingone · 2026-04-01T17:59:01Z

@chr-hertel I removed SpeechResult and updated the doc along with the examples, ready for review when you have time 🙂

chr-hertel

Thanks @Guikingone - will patch some inconsistencies across Cartesia, ElevenLabs and OpenAI TTS regarding support of plain string vs Text instances while merging - and my last comment about the text vs speech in metadata was confusing - will flip that as well.

Thanks again - great improvement! 👍

Guikingone · 2026-04-05T17:10:08Z

Oops, sorry, pushed the fix for the CHANGELOG.md at the same time 😅

chr-hertel · 2026-04-05T17:10:59Z

@Guikingone all good 😂 the risk of me force pushing into someone else's branches 😂

chr-hertel · 2026-04-05T17:34:57Z

Thank you @Guikingone.

Guikingone force-pushed the agent/voice_provider branch from 2c573eb to 8dd5cd5 Compare November 23, 2025 09:30

Guikingone changed the title ~~[Voice] Introduce the component~~ [Platform] Introduce VoiceProviders and VoiceListeners Nov 23, 2025

Guikingone changed the title ~~[Platform] Introduce VoiceProviders and VoiceListeners~~ [Platform] Introduce Speech support via Platform Nov 23, 2025

OskarStark reviewed Nov 23, 2025

View reviewed changes

Comment thread src/agent/composer.json Outdated

Guikingone force-pushed the agent/voice_provider branch from 79ddf87 to f011c3e Compare November 23, 2025 17:41

chr-hertel mentioned this pull request Nov 23, 2025

[Demo][Website] Rename audio demo to speech #958

Merged

Guikingone force-pushed the agent/voice_provider branch from dcae952 to be04280 Compare November 24, 2025 14:32

OskarStark changed the title ~~[Platform] Introduce Speech support via Platform~~ [Platform] Introduce Speech support Nov 24, 2025

Guikingone force-pushed the agent/voice_provider branch from be04280 to b319521 Compare November 25, 2025 12:49

Guikingone force-pushed the agent/voice_provider branch 3 times, most recently from 120f391 to 1963409 Compare November 26, 2025 12:42

Guikingone marked this pull request as ready for review November 26, 2025 12:44

Guikingone requested review from Nyholm and chr-hertel as code owners November 26, 2025 12:44

carsonbot added Feature New feature Platform Issues & PRs about the AI Platform component Status: Needs Review labels Nov 26, 2025

Guikingone marked this pull request as draft November 26, 2025 12:46

Guikingone marked this pull request as ready for review November 26, 2025 13:00

Guikingone force-pushed the agent/voice_provider branch from be85dda to 74bd8cb Compare November 26, 2025 13:00

Guikingone force-pushed the agent/voice_provider branch 2 times, most recently from 75fd1df to 0c00d21 Compare January 16, 2026 10:36

Guikingone mentioned this pull request Jan 18, 2026

[Platform] ElevenLabs definitions rework #1273

Closed

Guikingone force-pushed the agent/voice_provider branch from c8a8615 to 537c382 Compare January 18, 2026 18:29

Guikingone force-pushed the agent/voice_provider branch from 537c382 to aaece69 Compare January 22, 2026 09:13

Guikingone force-pushed the agent/voice_provider branch 5 times, most recently from ddb5904 to 5a0a9a2 Compare January 28, 2026 15:24

This was referenced Feb 9, 2026

[Agent] Introduce capabilities #1572

Open

[Agent] Add context compression strategies #1549

Closed

chr-hertel requested changes Mar 28, 2026

View reviewed changes

chr-hertel requested changes Mar 31, 2026

View reviewed changes

Comment thread src/platform/src/Bridge/ElevenLabs/composer.json Outdated

Comment thread src/agent/src/SpeechAgent.php Outdated

Comment thread src/agent/src/SpeechAgent.php

Comment thread src/agent/src/SpeechAgent.php Outdated

chr-hertel approved these changes Apr 5, 2026

View reviewed changes

[Platform] Introduce Speech support

c5b30d6

Uh oh!

Uh oh!

Conversation

Guikingone commented Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

OskarStark commented Nov 23, 2025

Uh oh!

Guikingone commented Nov 23, 2025

Uh oh!

OskarStark commented Nov 23, 2025

Uh oh!

chr-hertel commented Nov 23, 2025

Uh oh!

Guikingone commented Nov 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chr-hertel commented Nov 23, 2025

Uh oh!

Guikingone commented Nov 23, 2025

Uh oh!

Uh oh!

Guikingone commented Nov 25, 2025

Uh oh!

Guikingone commented Jan 21, 2026

Uh oh!

OskarStark commented Jan 22, 2026

Uh oh!

Guikingone commented Feb 17, 2026

Uh oh!

Guikingone commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chr-hertel commented Mar 15, 2026

Uh oh!

Guikingone commented Mar 16, 2026

Uh oh!

Guikingone commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chr-hertel left a comment

Choose a reason for hiding this comment

Uh oh!

Guikingone commented Mar 28, 2026

Uh oh!

chr-hertel commented Mar 28, 2026

Uh oh!

Guikingone commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Guikingone commented Apr 1, 2026

Uh oh!

chr-hertel left a comment

Choose a reason for hiding this comment

Uh oh!

Guikingone commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chr-hertel commented Apr 5, 2026

Uh oh!

chr-hertel commented Apr 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Guikingone commented Nov 22, 2025 •

edited

Loading

Guikingone commented Nov 23, 2025 •

edited

Loading

Guikingone commented Mar 7, 2026 •

edited

Loading

Guikingone commented Mar 28, 2026 •

edited

Loading

Guikingone commented Mar 29, 2026 •

edited

Loading

Guikingone commented Apr 5, 2026 •

edited

Loading