On Dec. 12, Cartesia AI announced a $27 million seed round and is wrapping up the year with its “State of Voice AI” report, which highlights 2024 trends and offers predictions for the future of voice AI in 2025.
The investment round was led by Index Ventures with participation from Lightspeed, Factory, Conviction, General Catalyst, A*, SV Angel, and 90 other angel investors. Cartesia’s founding team—Karan Goel, Albert Gu, Chris Ré, Arjun Desai—met as PhDs at Stanford AI Lab, where they invented State Space Models (SSMs), a fundamental new building block for training large-scale AI foundation models that are higher quality and more efficient.
Founded in 2023, Cartesia is a company that offers AI solutions that work in real-time across different devices, focusing on privacy and speed. Their main products include Sonic, a quick and highly realistic voice generation tool, and On-Device AI models, which work offline to process information privately.
READ: AI voice-first digital personality ‘jo’ launched for Apple users (December 24, 2024)
“We’re building the real-time intelligence future of AI, brains that can run anywhere, and the model architectures to support it. Audio is just the start,” CEO and co-founder Goel wrote in a post on X. Goel is an IIT Delhi alum, holding a master’s degree in machine learning from Carnegie Mellon.
Cartesia has pioneered new architectural advances such as the S4 and Mamba over the past few years to build real-time intelligence with long memory that runs wherever the user goes, the company stated. These new architectures like their computational cost scales linearly with sequence length rather than quadratically, their ability to compress long sequences of data into a fixed sized state, and their fast and efficient inference are what sets Cartesia and its products apart in the market.
Cartesia in its State of Voice AI report recapped 2024 and looked to the year ahead. Here are a few takeaways from the report:
In 2024, voice interaction technology saw major breakthroughs with new architectures and advancements. Orchestrated speech systems, integrating Speech-to-Text (STT), Large Language Models (LLM), and Text-to-Speech (TTS), enabled more natural conversations.
OpenAI’s Voice mode in ChatGPT introduced speech-to-speech technology, though challenges like interruptions remain. Full duplex systems, such as Kyutai’s Moshi model, demonstrated models capable of listening while speaking, providing a glimpse of the future of multimodal voice interactions. Additionally, new state space models (SSM) in TTS, like Cartesia’s Sonic, allowed more flexible, memory-efficient deployments, revolutionizing on-device speech.
Voice AI APIs also advanced, enabling natural conversations at an enterprise scale. Improvements in STT models like Deepgram’s Nova-2 reduced error rates, while LLMs like GPT-4 and Llama 3.2 improved reasoning efficiency.
TTS models reached new levels of naturalness, transforming synthetic voices to sound human-like. Enterprise platforms were upgraded to handle real-time interactions with customizable SLAs, dynamic scaling, and robust security, meeting the needs of regulated industries.
READ: Decipher AI simplifies error monitoring and detection in session replays (December 23, 2024)
Platforms like LiveKit and Daily simplified building, testing, and deploying custom voice agents, reducing time from months to weeks. Voice agent orchestration platforms such as Vapi, Retell, and Thoughtly emerged, offering advanced features like voice activity detection and emotion recognition.
Vertical startups expanded rapidly, with voice agents becoming integral in sectors like finance, healthcare, logistics, and hospitality, handling tasks like customer support, loan servicing, and medical scheduling.
As 2025 approaches, AI voice agents are expected to handle more complex tasks, becoming essential in workflows across industries. Compact on-device models will enable local, privacy-conscious conversations, while fine-grained control over voice attributes will allow for seamless integration with other AI modalities, leading to more immersive, interactive experiences in gaming, media, and customer service.
Overall, Cartesia is looking at a lot of potential for voice AI in 2025.

