Sakana AI Introduces KAME: A Tandem Speech-to-Speech Architecture That Injects LLM Knowledge in Real Time
Sakana AI's KAME architecture combines the speed of direct speech-to-speech models with the knowledge of large language models, achieving near-zero latency and improved conversational AI.
['The fundamental tension in conversational AI has always been a binary choice: respond fast or respond smart. Real-time speech-to-speech (S2S) models, which power natural-feeling voice assistants, start talking almost instantly, but their answers tend to be shallow. On the other hand, cascaded systems that route speech through a large language model (LLM) are far more knowledgeable, but the pipeline delay is long enough to make conversation feel stilted and robotic.
Researchers at Sakana AI, a Tokyo-based AI lab, have introduced KAME (Knowledge-Access Model Extension), a hybrid architecture that keeps the near-zero response latency of a direct S2S system while injecting the richer knowledge of a back-end LLM in real time.', "To understand why KAME is important, it's essential to grasp the two dominant designs it bridges. A direct S2S model like Moshi, developed by KyutAI, is a monolithic transformer that takes in audio tokens and produces audio tokens in a continuous loop. Its response latency is exceptionally low, but it has limited capacity for factual knowledge and deep reasoning due to the complexity of acoustic signals.
A cascaded system, by contrast, routes the user's speech through an Automatic Speech Recognition (ASR) model, feeds the resulting text into a powerful LLM, and then converts the LLM's response back into speech via a Text-to-Speech (TTS) engine. While the knowledge quality is excellent, the system suffers from a median latency of around 2.1 seconds, disrupting natural conversational flow.", "KAME operates as a tandem system with two asynchronous components running in parallel. The front-end S2S module, based on the Moshi architecture, processes audio in real time and begins generating a spoken response immediately.
The back-end LLM module consists of a streaming speech-to-text (STT) component paired with a full-scale LLM, generating candidate text responses, or 'oracles,' as the user's speech is processed. The front-end S2S transformer conditions its ongoing speech output on both its internal context and these incoming oracle tokens, allowing it to correct course and update its response mid-sentence.", "One challenge in developing KAME is that naturally occurring datasets do not contain oracle signals. The Sakana AI research team addresses this with a technique called Simulated Oracle Augmentation, generating synthetic oracle sequences using a 'simulator' LLM and a standard conversational dataset.
Evaluations on a speech-synthesized subset of the MT-Bench multi-turn Q&A benchmark show a dramatic improvement, with KAME achieving scores comparable to cascaded systems but with near-zero latency. The researchers also demonstrate that KAME is fully back-end agnostic, allowing for seamless swapping of different LLMs without retraining the front-end model.", 'The implications of KAME are significant, as it has the potential to enable more natural and informative conversational AI systems. By combining the strengths of direct S2S models and cascaded systems, KAME provides a more balanced approach to conversational AI, offering both speed and knowledge.
Source: MarkTechPost