feat: add gemini/local speak skills, agentic gemini Go2 blueprint, an… by grmkris · Pull Request #2250 · dimensionalOS/dimos

grmkris · 2026-05-26T15:08:56Z

…d capture-viewer tool

Problem

Closes DIM-XXX

Solution

How to Test

Contributor License Agreement

I have read and approved the CLA.

…d capture-viewer tool Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

greptile-apps · 2026-05-26T15:14:27Z

Greptile Summary

This PR adds a Gemini-native TTS speak skill (GeminiSpeakSkill + GeminiTTSNode), a macOS say-backed fallback (LocalSpeakSkill), a camera-capture-and-upload skill (TakePictureSkill), a periodic map-upload module (MapUploader), Gemini multimodal embeddings in the deprecated image embedding provider, and a new unitree-go2-agentic-gemini blueprint that wires all of these together as a no-OpenAI / no-CUDA drop-in.

GeminiTTSNode / GeminiSpeakSkill: consume_text() is called on every utterance instead of once at startup, spawning an unbounded number of _process_queue daemon threads over the lifetime of a robot session; text_subject.on_next(text) also fires the completion signal before the audio PCM is actually queued to SounddeviceAudioOutput (both were flagged in previous rounds).
MapUploader: _on_costmap makes a synchronous httpx.post (30 s timeout) directly in the subscription callback, which can stall the costmap delivery thread during network issues.
LocalSpeakSkill / TakePictureSkill / blueprint / registry: straightforward and clean.

Confidence Score: 4/5

Safe to merge for non-production trials, but the GeminiTTSNode accumulates one daemon thread per utterance across the session lifetime, which will eventually exhaust thread resources on long-running robots.

The Gemini speak pipeline calls consume_text() on every utterance, creating a new _process_queue thread each time that is never stopped. In a long robot session with hundreds of speech events, this will exhaust thread resources. The completion signal also fires before audio is queued, so the audio lock can be released while the previous utterance is still playing. These are the same structural defects in OpenAITTSNode that were already identified; this PR replicates them in GeminiTTSNode without fixing them.

dimos/stream/audio/tts/node_gemini.py and dimos/agents/skills/gemini_speak_skill.py — specifically the consume_text() call site in _speak_blocking and the ordering of text_subject vs audio_subject emissions in _synthesize_speech.

Important Files Changed

Filename	Overview
dimos/stream/audio/tts/node_gemini.py	New GeminiTTSNode mirroring OpenAITTSNode; inherits the consume_text() thread-accumulation and premature completion-signal defects flagged in earlier reviews
dimos/agents/skills/gemini_speak_skill.py	New Gemini-backed speak skill; correctly guards against concurrent speech with _audio_lock, but each speak() call adds a new processing thread to GeminiTTSNode
dimos/agents/skills/local_speak_skill.py	New macOS say-backed speak skill; clean implementation, proper subprocess safety (list-form call, no shell injection), and correct background-thread lifecycle management
dimos/agents/skills/map_uploader.py	New best-effort map uploader; makes a synchronous 30-second-timeout HTTP POST inside a reactive subscription callback, which can stall the costmap stream thread during network issues
dimos/agents/skills/take_picture_skill.py	New camera-capture-and-upload skill; straightforward implementation with correct error handling and SkillResult return types
dimos/agents_deprecated/memory/image_embedding.py	Adds Gemini multimodal embedding backend; correctly handles lazy import, API key resolution, normalization of truncated vectors, and unified image/text embedding path
dimos/robot/unitree/go2/blueprints/agentic/unitree_go2_agentic_gemini.py	New Gemini-only Go2 blueprint that cleanly composes existing atoms, disabling CUDA/OpenAI-dependent modules via disabled_modules()
dimos/robot/all_blueprints.py	Registers new blueprint and modules in the central registry; alphabetically ordered, no conflicts
dimos/robot/test_all_blueprints_generation.py	Adds disabled_modules to the recognized blueprint method set so the new blueprint passes generation tests
pyproject.toml	Adds langchain-google-genai and google-genai as agents extras; version constraints are consistent with existing deps
.gitignore	Adds WAL/SHM journal sidecar files and MuJoCo log to gitignore; clean housekeeping

Sequence Diagram

sequenceDiagram
    participant Agent
    participant GeminiSpeakSkill
    participant GeminiTTSNode
    participant SounddeviceAudioOutput

    Agent->>GeminiSpeakSkill: "speak(text, blocking=True)"
    GeminiSpeakSkill->>GeminiSpeakSkill: acquire _audio_lock
    GeminiSpeakSkill->>GeminiTTSNode: consume_text(text_subject) ⚠️ spawns new thread each call
    GeminiSpeakSkill->>GeminiTTSNode: emit_text().subscribe(set_as_complete)
    GeminiSpeakSkill->>GeminiTTSNode: text_subject.on_next(text)
    GeminiTTSNode->>GeminiTTSNode: _queue_text(text)
    GeminiTTSNode->>GeminiTTSNode: _synthesize_speech(text) via thread
    GeminiTTSNode->>GeminiTTSNode: text_subject.on_next(text) ⚠️ fires BEFORE audio queued
    GeminiTTSNode-->>GeminiSpeakSkill: audio_complete.set()
    GeminiTTSNode->>SounddeviceAudioOutput: audio_subject.on_next(audio_event)
    GeminiSpeakSkill->>GeminiSpeakSkill: sleep(0.3) then release _audio_lock
    GeminiSpeakSkill-->>Agent: "Spoke: {text}"

_{Reviews (4): Last reviewed commit: "feat(go2): pose on captures + global_cos..." | Re-trigger Greptile}

greptile-apps · 2026-05-26T15:14:31Z

+        """
+        Start consuming text from the observable source.
+
+        Args:
+            text_observable: Observable source of text strings
+
+        Returns:
+            Self for method chaining
+        """
+        logger.info("Starting GeminiTTSNode")
+
+        # Start the processing thread
+        self.processing_thread = threading.Thread(target=self._process_queue, daemon=True)  # type: ignore[assignment]
+        self.processing_thread.start()  # type: ignore[attr-defined]
+
+        # Subscribe to the text observable
+        self.subscription = text_observable.subscribe(  # type: ignore[assignment]
+            on_next=self._queue_text,
+            on_error=lambda e: logger.error(f"Error in GeminiTTSNode: {e}"),
+        )
+


Thread accumulation on every speak() call

consume_text() unconditionally spawns a new _process_queue thread and creates a new subscription on every invocation, and self.processing_thread is overwritten rather than checked. The calling side — GeminiSpeakSkill._speak_blocking() — calls consume_text() once per utterance, so after N speech events there are N daemon threads all spinning on the shared text_queue. The same defect exists in OpenAITTSNode/SpeakSkill; this PR replicates it. A guard like if self.processing_thread and self.processing_thread.is_alive(): return self (or calling consume_text only once from GeminiSpeakSkill.start() and routing text via _queue_text directly) would prevent unbounded thread growth in long robot sessions.

Replace the custom Go2FullRecorder with the stock go2-memory recorder for the capture demo (camera + lidar + odom is enough for the frames+trajectory viewer). Point the capture-viewer at recording_go2.db and ignore recording sidecars + the MuJoCo runtime log. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

greptile-apps · 2026-05-26T15:32:19Z

+            self.text_subject.on_next(text)
+
+            # Gemini returns raw 16-bit PCM bytes (24 kHz, mono, little-endian).
+            pcm_bytes = response.candidates[0].content.parts[0].inline_data.data
+            audio_array = np.frombuffer(pcm_bytes, dtype=np.int16)
+
+            audio_event = AudioEvent(
+                data=audio_array,
+                sample_rate=_SAMPLE_RATE,
+                timestamp=time.time(),
+                channels=1,
+            )
+
+            self.audio_subject.on_next(audio_event)


Completion signal fires before audio is confirmed

text_subject.on_next(text) (line 179) triggers audio_complete in GeminiSpeakSkill._speak_blocking before the audio data is even extracted. If response.candidates[0].content.parts[0].inline_data.data raises (empty candidates, content filtered, API format change), the exception is swallowed by the outer try/except but the completion event has already been set — so _speak_blocking returns "Spoke: {text}" even though the user heard nothing. Move text_subject.on_next(text) to after self.audio_subject.on_next(audio_event) so the signal fires only when both synthesis and audio queueing have succeeded.

New TakePictureSkill subscribes to color_image, caches the latest frame, and on take_picture() JPEG-encodes it and POSTs to robomoo's /api/robot/frame with a shared bearer token (ROBOMOO_URL / ROBOT_INGEST_TOKEN from env). Wired into unitree_go2_agentic_gemini and registered as take-picture-skill so the agent can call it ("take a picture"). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

take_picture now attaches the robot's odom pose (poseX/poseY) + label so the web can pin captures on the map. New MapUploader subscribes global_costmap, renders it with turbo_image, and POSTs the PNG + grid metadata to robomoo /api/robot/map (throttled). Both wired into unitree_go2_agentic_gemini. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat: add gemini/local speak skills, agentic gemini Go2 blueprint, an…

70bfe66

…d capture-viewer tool Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

greptile-apps Bot reviewed May 26, 2026

View reviewed changes

grmkris and others added 2 commits May 27, 2026 03:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add gemini/local speak skills, agentic gemini Go2 blueprint, an…#2250

feat: add gemini/local speak skills, agentic gemini Go2 blueprint, an…#2250
grmkris wants to merge 4 commits into
dimensionalOS:mainfrom
grmkris:feat/gemini-speak-go2-tools

grmkris commented May 26, 2026

Uh oh!

greptile-apps Bot commented May 26, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot May 26, 2026

Uh oh!

greptile-apps Bot May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

grmkris commented May 26, 2026

Problem

Solution

How to Test

Contributor License Agreement

Uh oh!

greptile-apps Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps Bot commented May 26, 2026 •

edited

Loading