System Architecture — Agent Loop
Samsung Galaxy S24 Ultra
Android · Qualcomm QNN
VNC Client
USB Tethering (RNDIS)
USB HID (future path)
🎙 Voice Input (STT)
LLM API Client
Screenshot Capture
ThumbStaff UI
Voice Command → Agent Perception → Reasoning → Actuation Loop
🎙
Voice
Command
SpeechRecognizer
on-device STT
"Open Notepad and
write meeting notes"

transcribed text → currentTask
Agent Loop
begins ↓
01
🖥️
Capture
Screen
VNC frame
or screenshot
02
🧠
LLM
Reasoning
GPT-4o Vision
Claude / Gemini
03
Parse
Actions
click(x,y)
type("text")
04
🎯
VNC
Actuation
click/type via VNC
HID: future path
Channel
Transport
What it carries
Setup
HID
future optional
USB direct
Keyboard + mouse injection
Demo uses VNC actuation instead
Zero — PC auto-detects
Screen
+ Actuation
VNC over RNDIS
PC display → phone
★ Demo actuation path (click/type via VNC)
TightVNC on PC
(one-time)
Network
USB Tethering (RNDIS)
Private P2P link · 192.168.42.x
Android built-in · Win auto-driver
Power
USB VBUS
Phone charging (optional)
Zero
Phone (one-time)
Settings → USB → USB Tethering ON
Android presents as RNDIS network adapter
PC (one-time)
Install TightVNC server
Windows: RNDIS driver auto-installs
macOS: HoRNDIS driver (~2min)
Linux: native, zero config
After one-time setup — every subsequent use:
Plug in USB-C → 🎙 Speak your task → AI sees PC → AI controls PC → Unplug → everything stops.
No WiFi. No network config. No target-side agent or software integration. Voice or text input. Minimal one-time setup. Same plug-in narrative as BC-3 Dongle.
⚠️
Experience Version — no MTB, no enterprise security claim. Intentional. The LLM receives raw screenshots. This version demonstrates the HMI-mediated control model only — no target-side agent integration, minimal one-time setup for the UX demo. The Mandatory Transformation Boundary is what BC-3 hardware adds, making the paid product architecturally defensible.
Target PC
Any major OS · no target-side agent
Windows
macOS
Linux
Any
No target-side
agent integration
TightVNC (one-time)
RNDIS auto-driver
192.168.42.1
Implementation Detail
⚙️
Core Agent Loop (Kotlin)
// Voice or text → task val task = when (inputMode) { Voice → speech.recognize() // STT Text → taskField.value } suspend fun agentLoop(task: String) { while (isRunning) { // 1. Capture PC screen via VNC val screenshot = vnc.captureFrame() // 2. Send to LLM with task context val action = llm.reason( image = screenshot, task = task ) // 3. Execute action on PC when (action) { is Click → vnc.click(action.x, action.y) is Type → vnc.type(action.text) is Done → break } delay(800) // rate limit } }
Loop interval ~1–2 sec per step
Screenshot size Resize to 1280px wide
Action format JSON from LLM
📦
Key Libraries & APIs
VNC client LibVNCClient (Android)
USB HID out Android USB Host API
LLM vision OpenAI / Anthropic SDK
Image encode Base64 JPEG ~150KB
UI framework Jetpack Compose
Voice STT Android SpeechRecognizer
Voice model On-device (Google, 0 cost)
Voice locale en-US / zh-TW / ja-JP
Async Kotlin Coroutines
PC VNC server TightVNC (free)
Network RNDIS over USB-C
VNC target IP 192.168.42.1 (fixed)
Coord scaling Phone px → PC res
Local model opt. QNN / Hexagon NPU
📱
ThumbStaff App UI Layout
┌─ ThumbStaff UX ──────────┐
████████████████████████ │ ← PC screen (VNC live)
████████████████████████
████████████████████████
├──────────────────────────┤
│ 🧠 Agent: clicking Login… │ ← AI status
│ Step 3/7 · 2.1s elapsed │
├──────────────────────────┤
│ Task: "Open Notepad, type…"│ ← User input
🎙 Voice │ ⌨ Type │ ← Input mode
├────────────┬─────────────┤
■ STOP▶ RUN │ ← Controls
└──────────────────────────┘
PC screen area ~60% of screen
Agent status Live step + action
Task input 🎙 Voice or ⌨ Type
Voice engine Android SpeechRecognizer
Voice latency <1s on-device
Override STOP always visible
Voice Input Architecture
🎙 Voice-to-Task Flow
🎙 User speaks
Hold mic button
SpeechRecognizer
On-device STT
Transcribed text
→ currentTask
Confirm on screen
User sees transcript
▶ RUN or re-speak
User confirms
Agent loop starts
Capture → Reason → Act
Demo narrative: "Pick up phone → speak your task → plug in USB-C → watch AI execute on PC."
Voice makes the boundary metaphor physical — the human speaks, the AI acts.
⚙️ Voice Implementation Detail
Android API android.speech.SpeechRecognizer
Processing On-device (offline capable)
Latency <1s recognition · real-time partial
API cost $0 — built into Android
Languages en-US · zh-TW · ja-JP · 100+
Permission RECORD_AUDIO (runtime)
UX pattern Hold-to-speak or tap toggle
Partial results Live transcript while speaking
Fallback Text input field (always available)
Compose integration rememberLauncherForActivityResult
Dev effort ~1–2 days
Why voice matters for demo
Typing a task into a phone is forgettable. Speaking to your phone and watching a PC obey is unforgettable. Zero incremental cost, massive demo impact.
LLM Selection for Demo
Claude 3.5 Sonnet
Vision qualityExcellent
Latency~2–5s / step
Cost / 1K steps~$4–10
JSON actionVery reliable
SetupAPI key only
Local (QNN / Hexagon)
Vision qualityLimited (7B)
Latency~0.5–1s
Cost / 1K steps$0
JSON actionNeeds prompting
SetupQNN SDK, complex
6-Week Demo Sprint — Starting May 2026
WEEK 1–2
VNC Foundation
WEEK 2
VNC Foundation
WEEK 3
LLM Agent Loop
WEEK 4
LLM Agent Loop
WEEK 5
ThumbStaff UI + Voice
WEEK 6
Demo Polish