[Platform] Introduce Speech support#943
Conversation
2c573eb to
8dd5cd5
Compare
|
To me we maybe should introduce capabilities also to platforms rather than having a voice component. As far as I understand I cannot use the Voice component standalone, right? I don't think a dedicated component is the way to go here |
|
We can introduce it via the Platform, could be easier, the voice can be used without agents but it will requires the Will update the PR to match this approach 👍🏻 |
|
I agree, Agent scope is not needed 👍🏻 |
VoiceProviders and VoiceListeners
|
Hi @Guikingone, i agree that week lack some kind of guidance on how voices work - but same goes for other binary stuff like creating images or videos. so two things i would like to understand
btw, "speech" is more common than "vioce" isn't it? |
The main goal is to add the capacity to have an agent/platform that can "listen" and answer to inputs thanks to voice / speech (voice is used as a sugar here, could be renamed to speech), creating a workflow where you can submit voice, call the platform that transforms it to speech / text (depending on the situation you're in) and returning it to the user without frictions.
It is now part of
Agreed, could be renamed to
Yes, the goal is to ease it with a "built-in" approach / API that stays transparent for the user. |
VoiceProviders and VoiceListenersSpeech support via Platform
|
just realized we should the "audio" demo to "speech" as well - and i'm def not really happy with that solution there. can we make it as easy as the structured output - like with an listener? i like that starting point: $result = $platform->invoke('eleven_multilingual_v2', new Text('Hello world'), [
'voice' => 'Dslrhjl3ZpzrctukrQSN', // Brad (https://elevenlabs.io/app/voice-library?voiceId=Dslrhjl3ZpzrctukrQSN)
]);
echo $result->asVoice();what would be the return type here? would it be same as |
Could be something to explore, the API is not locked for now.
My first approach was to do the same thing as |
79ddf87 to
f011c3e
Compare
dcae952 to
be04280
Compare
Speech support via PlatformSpeech support
be04280 to
b319521
Compare
|
Well, might seems weird but here we go, |
120f391 to
1963409
Compare
be85dda to
74bd8cb
Compare
75fd1df to
0c00d21
Compare
c8a8615 to
537c382
Compare
|
Hi @chr-hertel / @OskarStark 👋🏻 Friendly ping on this PR, should I keep rebasing it / targeting |
537c382 to
aaece69
Compare
|
@chr-hertel will have a look soon, not sure it will land in 0.3, lets keep it for now |
ddb5904 to
5a0a9a2
Compare
|
Hi @OskarStark @chr-hertel, yes, I know, again 😄 I think that this time, that's the one, while thinking about #1572 and the comment from chris, I thought about this PR and the listener approach didn't looked like "THE" solution, especially while we have the processors, so, I asked Claude (yes, sometimes, asking for an external opinion might lead to a solution) for a "reworked implementation" that could ease the user experience and the maintenance of it, it submitted a solution close to the processors and I did the final tweaking. So, what changed? Now, the speech configuration is moved where it needs to be, at the The I updated the examples and reworked the documentation, much better, make more sense IMHO to be like that. I let you take a look at it and review it if you think it deserves to be reviewed, is #1572 needed anymore? Thought question, if this PR is merged, probably not, at least, I don't see use case except for the validation/evaluation part (for now) that could require it (as speech is now at the agent level), probably another topic for another day 😄 |
|
A set of small improvements on the API, smaller scope for result support of Speech. CI is failing due to |
|
Can we reduce this to be just a decorator for an AgentInterface? like |
|
Could be a solution, the only issue it will not be "as smooth" as in the PR thanks to the configuration and the auto-injection of the processor, I'll take a look in that direction but can't promise anything this week. |
|
Hi @chr-hertel, hope you're fine 👋🏻 I pushed the commit that contains the decorator approach, you were right about this approach, make sense as it removes the unnecessary configuration in agent section, I also updated the documentation to explain how to use it (I can add a extra section for the injection using I let you review it when you have time 🙂 PS: If the PR is good for you, I'll fully squash it before you merge it, this way we get a clean history. |
chr-hertel
left a comment
There was a problem hiding this comment.
Hey @Guikingone, yes i like - thanks for keeping this alive - i feel like we're on a good track.
Some things to tackle from my point of we:
- I think we should merge the additional methods of
Speechinto theDeferredResultand get rid off the extra data layer there, likeSpeechitself and thatSpeechAwareInterface- i think it is basically onlyasDataUrimissing there - would slim this down more and reducing the impact (e.g. not havingSpeechAwareTraitinTextResult) - Funnily enough @OskarStark and I just had a discussion yesterday about adopting
Options(see notifier) or theOptionsResolverand decided against it since it is also limiting users - in this case i don't even know why we can't just have discrete properties in theSpeechConfiguration- it plays well with the extensibility of form component, but I don't see the same case here at this point
=> removingsymfony/options-resolverin favor of well-definedSpeechConfiguration
Things I'm currenty unsure about:
- Extending agent's bundle config vs. standalone service registration - I'm slightly in favor of having it in the agent like you did, yes - not entirely sure tho
- Extending the
MessageBag- that thing is growing - don't have a good alternative and see why you decided for it
My thoughts for now on rather high level - thanks again!
|
I'll take a look at points you highlighted, I can get rid of the Regarding the For the configuration, my favorite approach would be to extend the existing one rather than depending on the DIC, I think that having it through service decoration is kind of an anti-pattern and can be easily simplified by the configuration, if you have multiple "speech agents", you'd be forced to declare them one by one while in the same time, keeping track of your agents in the configuration, looks like a "wrong turn" to take 🤔 |
|
Alright, let's keep MessageBag and config like you changed already 👍 |
|
Here's the "final draft" using the decorator and the configuration, I also updated the |
|
@chr-hertel I removed |
chr-hertel
left a comment
There was a problem hiding this comment.
Thanks @Guikingone - will patch some inconsistencies across Cartesia, ElevenLabs and OpenAI TTS regarding support of plain string vs Text instances while merging - and my last comment about the text vs speech in metadata was confusing - will flip that as well.
Thanks again - great improvement! 👍
|
Oops, sorry, pushed the fix for the |
|
@Guikingone all good 😂 the risk of me force pushing into someone else's branches 😂 |
|
Thank you @Guikingone. |
TTS,STTandSTSfor agentsExample for an OpenAI-based
STSagent:TTSorSTTindependentlySpeechConfigurationobject handle the speech configurationSpeechProcessorhandle the input/output