Mitochondria has joined the Sarvam Startup Program
Our Indian entity has been accepted into the Sarvam Startup Program. It gives a cohort of teams working access to Sarvam's full stack of speech, language and translation models, the documentation around it, and a working relationship with the team building these models in the open. We are pleased to be in this cohort, and we want to say a few things about what it means for us and for the work we do.
Sarvam has spent several years building foundation models specifically for Indian languages. The effort is commendable. Most foundation model work has gone into English-first systems, with multilingual capabilities added later. Sarvam's models are built the other way round. Indian languages are first-class inputs. The architecture, the training data, and the evaluation regimes assume that a Marathi, Tamil or Bengali speaker is the person the system is meant to serve.
For us, that orientation is a meaningful piece of infrastructure. We work with clients across Europe and India, in sectors where the people on the other side of the system speak many languages other than English. A buyer in Düsseldorf asking a question about a textile produced in Kutch. A procurement officer in Lyon asking how a millet-based product behaves in transit. An operator in Solapur describing a process that has never been written down. A field officer in a cooperative outside Bhuj reporting a deviation in a batch she has just inspected. A bank customer in Bengaluru in the middle of a query. A logistics coordinator on a depot floor outside Rotterdam.
These are voice-shaped conversations. They carry hesitations, code-switching, assumed context, and a register specific to the speaker and the situation. A counterparty in Europe wants a clean record, GDPR-compliant handling of any speech data, and a tone that holds across languages. A user in India wants to be understood in her own language without having her code-switching flagged as an error. Both expectations are reasonable, and both are solvable when the foundation beneath the conversation is built for it rather than around it.
The Sarvam relationship gives us the language-side foundation we have been working toward. The reasoning, governance, and conversational design we already know how to handle. With this in place, the questions we want to be answering are the interesting ones about how a conversation should go, rather than whether the underlying models will hold.
What we have been working on, and where voice fits
In the financial services context, we have given careful thought to what it means for a voice agent to discuss investment products without straying into advice that should be regulated – a voice agent that holds composure when a vulnerable user asks a question, that recognises when a person is asking about retirement planning rather than just fund returns, and that knows when to stop talking and route to a human is built differently from one trained to optimise for engagement. The disclosure and grievance flows we work toward must satisfy SEBI and AMFI scrutiny on the Indian side and MiFID-aligned conduct expectations on the European side. These are typically different deployments. Sometimes, different agent frameworks altogether. Where reasoning patterns can be shared across geographies, we share them. Where the regulators expect the architectures to be separate, they are.
In manufacturing, the agent often listens to a senior operator describe a process that has never been written down, recognises the names of jigs and fixtures that exist only in the plant's spoken vocabulary, and helps convert that knowledge into a structured form. The interesting work is not the speech-to-text. It is preserving the operator's voice throughout the conversion. A maintenance log entry that reads as the operator dictated it, with her reasoning preserved, is materially different from one that has been rewritten by a system trying to sound formal.
In agriculture and rural finance, the work concerns reach. Conversations happen in dialects of Telugu, Marathi, Bhojpuri, and Bengali, often within a single call, with a farmer's family member translating parts of it. The agent has to keep track of what was committed to whom. It has to recognise village names, cooperative names, and crop variety names that no off-the-shelf model has been trained on. When the agent does not recognise something, it must say so clearly, in the language of the conversation, without breaking trust.
In public transport and field services, voice becomes a question of latency and turn-taking. A passenger asking about a delay in Telugu. A driver reporting a mechanical issue from a depot. A controller asking the agent to synthesise the last hour of incident reports. The work is real-time, multilingual, and unforgiving. Half a second of awkward silence, or a clumsy attempt to fill it, breaks the conversation. Building for this is closer to building a radio system than building a chatbot.
In research data and field intelligence, voice is the dictation channel. A marine biologist on a survey vessel, a soil scientist at a field site, an ecologist on a coastal walk. They speak observations into a system that needs to recognise scientific terminology, taxonomic names, and contextual cues. The output has to be queryable later, structured enough for batch processing, and faithful to what was said. We have done a fair bit of thinking about how the same audio stream can serve both the working scientist and the regulatory submission six months later.
In cross-border trade, voice sits at the seam between supplier and buyer. A cooperative coordinator in a craft cluster speaks about her batch, the variability she sees across the looms, and the time she needs before the next consignment is ready. Somewhere in Bremen or Melbourne, a procurement officer reads a document that has to carry her meaning intact. The agent's job is to ensure the conversation can happen in the first place and that what was said survives the journey across geographies, languages, and compliance regimes.
The architectural pattern across this work is consistent. Voice is one channel into a reasoning system that holds context, applies guardrails, takes operational actions where appropriate, and escalates to a human when warranted. The reasoning is what we ship. Voice is how a real person gets to it.
How we work, and why this setup fits the work
Voice is not a new direction for us. It is where several threads of how Mitochondria already operates come together.
The first thread is the design heritage. We come out of the communications strategy, and the discipline there is not about making a system talk. It is about understanding the register a conversation needs to be held in, the intention of the person on the other end, and the shape of what should be said before any of it is said. That training carries directly into voice work. A great deal of what makes a voice agent feel right, or wrong, is decided long before a single token is generated. When the agent should pause. When the silence is more useful than a sentence. Whether the register matches the user’s actual situation. These are design questions, and they sit upstream of any speech model.
The second thread is the architecture. Mitochondria runs as a cloud-based service that interfaces with client systems through an API. We do not deploy inside client infrastructure, and we do not retain client data once a conversation has ended. For voice, that posture matters in a practical way. A mid-sized buyer in the Netherlands, a cooperative aggregator in Maharashtra, a manufacturing firm in Pune, a logistics group running depots across two countries, none of them want to host or maintain voice infrastructure on their own machines. A service that processes transiently and governs cleanly is what allows the conversation to be deployed in months rather than postponed indefinitely while a procurement team works through a data-residency review.
The third thread is the team. The Amsterdam side carries the European conduct, data, and trust register as a working language. The Pune side carries the Indian operational and linguistic register in the same way. A single engagement that needs to satisfy both is a conversation we are already having inside the company before the client ever asks. The two postures are not being reconciled at a handover point but are being held together by people who work with each other every day.
The fourth thread is the framework. We deploy to a four-phase cycle, where the second phase, Neuroplasticity, is the period in which the agent learns the operational vocabulary of the specific deployment. For voice in Indian languages, that phase is where most of the value comes from. A generic model will not know the names of jigs at a particular plant, or the local crop variety names a cooperative uses, or the way a senior controller phrases an incident report. The Neuroplasticity phase is built into how we work, which means voice deployments arrive at competence in the actual context rather than around it.
These threads are why we are comfortable saying that voice fits us. The design heritage shapes the conversation. The architecture makes the deployment feasible. The team holds both registers. The framework gives the agent room to learn the vocabulary that matters. None of these is new for us. The language-side foundation now matches the rest of the system.
What gets harder when you take voice seriously
Voice exposes the limits of any model's understanding faster than text does. A user typing a question has often already done the work of clarifying it. A user speaking the same question speaks it the way they think it, with hesitations, restarts, and assumed context. A system that handles this gracefully has been built with a different respect for the speaker than one that demands the question be re-typed in a cleaner form.
In conversation, Indian languages do not behave the way training corpora suggest. A speaker in Pune will use Marathi grammar with English nouns and a Hindi connective inside a single sentence. A speaker in Coimbatore will move between Tamil and English depending on whether the topic is technical or domestic. The transcription has to hold this without flattening it, and the reasoning layer has to be language-aware in a way that the transcription alone cannot be. This is precisely the surface where Sarvam's orientation pays off, and where we expect the integration work over the coming quarters to feel different from what we have been able to assemble before.
Voice carries operational risk that text does not. A misheard digit in a banking conversation is not a misspelt digit in a chat. A misunderstood instruction during an incident is not a misread WhatsApp message. The audit trail must be sufficient to reconstruct what happened, and the system must know when it is sufficiently uncertain that a human should be in the loop. We have spent serious time on this, and it is one of the parts of the work we are most careful about.
Then there is governance. Speech data is some of the most personal data a system can hold. The discipline we apply across Mitochondria, of processing data transiently and not storing it, of treating voice as ephemeral by default, of giving clients an audit position over what was done with what was said, becomes more important here. GDPR and DPDP both impose a higher bar on biometric and voice data. The architectural choice to not retain raw audio, except where the client has an explicit reason and the legal basis to do so, is the working assumption.
What this means for us, going forward
Voice in Indian languages, at the quality bar a regulated counterparty expects, is now a much more direct conversation for us. The work we have been building toward in agriculture, manufacturing, financial services, public transport, research, and cross-border trade now has the language-side foundation it has needed.
Our energy moves toward the part we enjoy most: designing the conversation itself.
We are looking forward to working with Sarvam and to building well on top of what they have made.