The Evidence Gap in Every AI Deployment Decision

David Yanagizawa-Drott, Professor of Economics at the University of Zurich and co-chair of J-PAL's AI Evidence Initiative, began his presentation at the India AI Impact Summit 2026 with a request. Close your eyes. Think about your organisation. Consider the one thing you would want AI to do. Now ask yourself: was that automation or augmentation? And what evidence did you base that choice on?

His guess, he told the room, was that most people would have based their choice on zero pieces of rigorous evidence. Maybe one. Almost certainly fewer than ten. The technology is too new. Organisations that have adopted it have rarely carried out proper evaluations of different approaches. And when they have, they have not published those results for others to learn from.

This is the evidence gap. Every organisation deploying AI faces a significant decision between automation and augmentation, whether to replace human decision-making or to support it, despite there being almost no empirical basis for that choice.

What 90% of the room assumed

Yanagizawa-Drott set the scene for the audience. You are the policymaker responsible for district hospitals in Uttar Pradesh. The task involves diagnosing patients. AI can either fully automate the diagnosis or assist doctors with AI-generated recommendations. Automation reduces labour costs by 20%. Assistance reduces costs by 5%. Which policy do you endorse?

90% chose augmentation.

The reasoning is intuitive and strongly held. Keep humans involved, especially when the stakes are high. Doctors bring tacit knowledge, contextual understanding, and the ability to interpret information that AI systems may not have been trained on. A human-AI collaboration should, in theory, provide the best of both capabilities.

Yanagizawa-Drott then presented outcome data. Automation results in 30% worse patient health outcomes, while augmentation results in 5% worse outcomes. The atmosphere in the room shifted. Although augmentation still prevailed, the human-only option gained considerable support. Some were willing to accept slightly poorer outcomes for 5% cost savings, while others were not. As Yanagizawa-Drott pointed out, the organisation's values and goals determine where that line is drawn. Researchers can assist in establishing the facts, but the final decision remains a judgement call.

The key point: the audience made that judgment call based on the numbers the presenter provided. In reality, organisations making this decision do not have such numbers. They rely on assumptions about what the AI can achieve, how well their staff currently perform the task, and what results occur when the two work together. The interaction effects, in particular how humans actually interpret AI recommendations and how their decision-making shifts when assisted by a machine, remain poorly understood.

When augmentation was the worst option

Yanagizawa-Drott then presented evidence from a randomised experiment he conducted with an organisation in Ghana that hires university students to teach in rural areas for two years. The organisation received applications, and experienced human teachers evaluated candidates. The research question was whether GPT-4 could improve this process, either by fully automating the evaluation or by augmenting human screeners with AI-generated recommendations.

The researcher himself shared the expectation that augmentation would produce the best results. Experienced teachers with tacit knowledge about what defines an effective rural educator, combined with AI's capability to systematically apply evaluation rubrics, should outperform either approach alone.

The results were contrary. Full automation boosted hiring success rates by 70%. The augmented approach, where human teachers received AI suggestions, was the least effective option. The teachers were hindered by the AI input without their decisions being meaningfully affected. They were, as it turned out, also not particularly good at initially selecting effective teachers.

This single study does not resolve the issue across all situations. Yanagizawa-Drott was careful to point out that the answer varies depending on the specific task, the skill of human decision-makers, the ability of the AI system, and how they interact. His colleague Sendhil Obermeyer, a prominent researcher in AI and healthcare, gave the academic's answer: "It depends." Yanagizawa-Drott's reply was direct: "As an organisation, you can't sit there saying 'it depends.' You have to make a choice."

The implication is clear. Organisations that commit to augmentation because it feels safer, or to automation because it promises cost savings, without producing evidence specific to their context, are making important decisions based on inadequate information. The only dependable method is to gather evidence through structured evaluation before expanding.

The third possibility

The panel discussion that followed, moderated by Murugan Vasudevan of Veddis Foundation, significantly broadened the perspective. Shankar Maruwada of the EkStep Foundation highlighted that the automation-versus-augmentation binary oversimplifies the issue: AI also opens up entirely new possibilities that did not exist before.

His examples were rooted in Indian reality. On 11 February, Amul, the world's largest dairy cooperative serving 36 million farmers, launched an AI advisory system in Gujarati. Women farmers, many of whom started with one or two cows and no prior dairy-farming experience, could now access 50 years of collective Amul knowledge through a voice interface in their own language. The value was not about efficiency. It was about empowerment. A daughter-in-law managing the family's dairy income, whose livelihood depends on her cow's health, now has access to institutional knowledge that previously required either years of experience or proximity to someone who had it.

The Blue Dot concept that Maruwada outlined through EkStep's work encapsulates this structurally. In a traditional platform, you search for services. In the Blue Dot model, services find you. A job-matching pilot in Ghaziabad identified 4,000 jobs in a district where existing platforms showed only 10. Scholarships for girls with disabilities were distributed in 20 minutes instead of over six months because the system recognised eligible recipients rather than waiting for applications. ITI graduates who had Lorem Ipsum in their CVs because they did not understand what it was can now create proper CVs through a conversational agent in their native language.

These are neither automation nor augmentation cases. They represent capabilities that did not exist before the technology made them possible. For a country with 500 million workers in the informal economy, where productivity gains are additive rather than displacing, this third category could be more significant than either side of the automation-augmentation debate.

Where the gains concentrate

Elizabeth Kelly of Anthropic shared data from the company's economic index that contextualises India’s position. India ranks second globally in total AI usage but 101st out of 116 countries on a per-capita basis. The high overall usage is due to population size, with adoption focused in the tech sector across Maharashtra, Tamil Nadu, Karnataka, and Delhi. The pattern reflects early mobile internet adoption: urban, educated, tech-first.

Within that focused usage, the results are remarkable. 45% of all Indian AI utilisation is for computer and mathematical tasks, the highest proportion among all markets worldwide. Tasks that take four hours without AI take 15 minutes with it, a 15x speed enhancement compared to 10x globally. India's tech workforce is achieving greater productivity from AI than any other market.

The question is what happens beyond that workforce. Dr Becky Seif of FCDO raised the inclusion dimension directly: Harvard data shows women are already 20% less likely than men to engage with generative AI at work. This creates a compounding cycle where women are absent from the tools, the tools do not reflect their needs, and the gap widens. FCDO's approach centres inclusion from the design stage, including digital safety for women and girls as a precondition for AI adoption, disability inclusion in AI product design, and gender mainstreaming in evidence programmes.

Maruwada's framing was direct: all of this must be designed with inclusion, equity, and India's diversity in mind. If not, we face problems. The social media precedent, where speed of adoption outstripped governance and the costs only became clear once they were entrenched, is the scenario the AI ecosystem must work to avoid.

Measuring before scaling

The session converged on a point that connects directly to how we approach engagements at Mitochondria: the importance of evaluation before scale.

Seif outlined FCDO's four-stage evaluation framework: assess the model, assess the product, assess the user experience, and assess the development impact. She also raised the question of a preliminary stage: should you be using AI at all for this particular task? She argued that the evaluation investment is crucial beyond accountability. In a space where hype and fear coexist roughly equally, rigorous evidence forms the basis of trustworthy AI.

Maruwada used the infrastructure analogy: you don't build a road assuming the best car has already been invented. You construct the road because it is the right thing to do, and others innovate on top. But infrastructure without evaluation results in roads that go nowhere.

Kelly emphasised the difference between using AI for efficiency and for growth. Efficiency improvements are a direct exchange: tasks become faster, costs decrease, and the question of who benefits from the gains becomes urgent. Growth involves expanding possibilities. For example, a three-person legal tech charity can work at the level of six engineers. Teachers are freed from administrative tasks to concentrate on students. Small businesses can now offer products and services that were previously impossible. When AI is used for growth, worries about automation lessen because the overall opportunity increases.

This distinction between efficiency and growth has influenced how we plan deployments. Our deployment approach does not commit beforehand to automation or augmentation. The Stimuli phase assesses actual operational conditions and determines where AI adds value, sometimes through efficiency, sometimes through capability enhancement, often through both. The system starts with a set scope, gathers performance evidence, and develops further based on what the evidence indicates. Each phase provides data on what the system handles effectively, where human judgment remains crucial, and where new opportunities arise that were not foreseen at the design stage.

In a recent engagement with a financial services organisation in India, the deployment was intended to begin as a pre-login advisory, providing fund information, FAQs, and educational content for retail investors. No personalised recommendations, no account access, no autonomous decisions. Every interaction was to be documented and evaluated: query types, resolution accuracy, escalation patterns, and user satisfaction signals. The findings from that phase would guide what the next stage would involve. The SEBI regulatory environment mandated this rigour, but the principle applies across sectors. Organisations that implement AI without assessing its impact and how people interact with it rely on assumptions, and these assumptions can lead to unforeseen and difficult-to-correct outcomes.

A race to the top

Maruwada concluded with a framing that highlights the stakes. India faces a choice between a race to the top and a race to the bottom. The social media trajectory, where adoption outpaced governance and the consequences disproportionately affected the most vulnerable, serves as a cautionary tale. Conversely, the digital public infrastructure trajectory, where essential capabilities were developed with inclusion targets and iterative improvements, is the model worth emulating.

AI deployment in organisations faces a similar dilemma. The technology is capable, but the question is whether organisations will gather the evidence needed for effective deployment or rely on instinct and face the consequences at scale.

90% of the room at the Summit chose augmentation based on instinct. The Ghana experiment complicated that instinct considerably, and the next thorough evaluation in a different context may complicate it further or confirm it. Without systematically gathering evidence, neither instinct nor any single counterexample offers a reliable basis for decisions that affect patients, farmers, students, and the organisations that serve them.

The evidence gap is genuine. The task of closing it is what makes deployment responsible.

Mitochondria is an agentic AI product company based in Amsterdam, with operations in Pune. ISO 27001 certified. GDPR- and DPDP Act-compliant by architecture.

Previous
Previous

Minimal Viable Trust for Agentic AI

Next
Next

From Smart Ports to Thinking Ports