The Evidence Gap in Every AI Deployment Decision
David Yanagizawa-Drott, Professor of Economics at the University of Zurich and co-chair of J-PAL's AI Evidence Initiative, opened his presentation at the India AI Impact Summit 2026 with a request. Close your eyes. Think about your organisation. Think about the one thing you would want AI to do. Now ask yourself: was that automation or augmentation? And what evidence did you base that choice on?
His guess, he told the room, was that most people would have based their choice on zero pieces of rigorous evidence. Maybe one. Almost certainly fewer than ten. The technology is too new. Organisations that have adopted it have rarely conducted proper evaluations of different approaches. And when they have, they have not published those results for others to learn from.
This is the evidence gap. Every organisation deploying AI is making a consequential choice between automation and augmentation, between replacing human decision-making and assisting it, with almost no empirical basis for that choice. The instincts are strong. The data is thin.
What 90% of the Room Assumed
Yanagizawa-Drott placed the audience in a scenario. You are the policymaker responsible for district hospitals in Uttar Pradesh. The task is diagnosing patients. AI can either automate the diagnosis entirely or augment the doctor's decision-making with AI-generated recommendations. Automation saves 20% on labour costs. Augmentation saves 5%. Which policy do you recommend?
90% chose augmentation.
The reasoning is intuitive and deeply held: keep humans in the loop, especially when the stakes are high. Doctors bring tacit knowledge, contextual understanding, and the ability to process information that AI systems may not have been trained on. A human-AI collaboration should, in theory, deliver the best of both capabilities.
Yanagizawa-Drott then introduced outcome data. Automation leads to 30% worse patient health outcomes. Augmentation leads to 5% worse outcomes. Now the room shifted. Augmentation still held, but the human-only option gained significant ground. Some were willing to accept marginally worse outcomes for 5% cost savings. Others were not. The values and goals of the organisation, as Yanagizawa-Drott noted, determine where that line falls. Researchers can help establish the facts. The decision itself remains a judgement call.
But here is the critical point: the audience was making that judgement call based on the numbers the presenter gave them. In reality, organisations making this decision have no such numbers. They have assumptions about AI capability, assumptions about human performance, and assumptions about what happens when the two interact. The interaction effects, how humans actually process AI recommendations, how their decision-making changes when assisted by a machine, are particularly poorly understood.
When Augmentation Was the Worst Option
Yanagizawa-Drott then presented evidence from a randomised experiment he conducted with an organisation in Ghana that hires university students to teach in rural areas for two years. The organisation received applications, and experienced human teachers evaluated candidates. The research question was whether GPT-4 could improve this process, either by fully automating the evaluation or by augmenting human screeners with AI-generated recommendations.
The expectation, shared by the researcher himself, was that augmentation would deliver the best results. Experienced teachers with tacit knowledge about what makes an effective rural educator, combined with AI's ability to systematically apply evaluation rubrics, should outperform either approach alone.
The results were the opposite. Full automation increased hiring success rates by 70%. The augmented approach, human teachers receiving AI recommendations, was the worst-performing option. The teachers were slowed down by the AI input without meaningfully changing their decisions. They were, as it turned out, also not particularly good at selecting effective teachers to begin with.
This single study does not settle the question for all contexts. Yanagizawa-Drott was careful to note that the answer depends on the specific task, the quality of human decision-makers, the capability of the AI system, and the interaction dynamics between them. His colleague Sendhil Obermeyer, a leading researcher in AI and healthcare, gave the academic's answer: "It depends." Yanagizawa-Drott's response was pointed: "As an organisation, you can't sit there saying 'it depends.' You have to make a choice."
The implication is clear. Organisations that pre-commit to augmentation because it feels safer, or to automation because it promises cost savings, without generating evidence specific to their context, are making consequential decisions on insufficient information. The technology itself changes rapidly. How humans interact with it changes. The only reliable approach is to generate evidence through structured evaluation before scaling.
The Third Possibility
The panel discussion that followed, moderated by Murugan Vasudevan of Veddis Foundation, expanded the frame considerably. Shankar Maruwada of EkStep Foundation raised a point that the automation-versus-augmentation binary obscures: AI also creates entirely new possibilities that did not exist before.
His examples were grounded in Indian reality. On 11 February, Amul, the world's largest dairy cooperative serving 36 million farmers, launched an AI advisory system in Gujarati. Women farmers, many of whom began with one or two cows and no prior experience in dairy farming, could now access fifty years of collective Amul knowledge through a voice interface in their own language. The value was not an efficiency improvement. It was empowerment. A daughter-in-law managing the family's dairy income, whose livelihood depends on her cow's health, now has access to institutional knowledge that previously required either years of experience or proximity to someone who had it.
The Blue Dot concept that Maruwada described through EkStep's work captures this structurally. In a conventional platform, you search for services. In the Blue Dot model, services find you. A job matching pilot in Ghaziabad discovered 4,000 jobs in a district where existing platforms showed only 10. Scholarships for girls with disabilities were delivered in 20 minutes instead of six months because the system found eligible recipients rather than waiting for applications. ITI graduates who had Lorem Ipsum in their CVs because they did not understand what it was could now create proper CVs through a conversational agent in their native language.
These are neither automation nor augmentation cases. They are new capabilities that did not exist before the technology made them possible. For a country with 500 million workers in the informal economy, where productivity gains are additive rather than displacing, this third category may be more consequential than either side of the automation-augmentation debate.
Where the Gains Concentrate, and Where They Don't
Elizabeth Kelly of Anthropic shared data from the company's economic index that puts the Indian context in perspective. India ranks second globally in total AI usage but 101st out of 116 countries on a per-capita basis. The high total usage is explained by population size, with adoption concentrated in the tech sector across Maharashtra, Tamil Nadu, Karnataka, and Delhi. The pattern mirrors early mobile internet adoption: urban, educated, tech-first.
Within that concentrated usage, the results are striking. 45% of all Indian AI usage is in computer and mathematical tasks, the highest proportion of any market globally. Tasks that take four hours without AI take 15 minutes with it, a 15x speed improvement versus 10x globally. India's tech workforce is extracting more productivity from AI than any other market.
The question is what happens beyond that workforce. Dr Becky Seif of FCDO raised the inclusion dimension directly: Harvard data shows women are already 20% less likely than men to engage with generative AI at work. This creates a compounding cycle where women are absent from the tools, the tools do not reflect their needs, and the gap widens. FCDO's approach centres inclusion from the design stage, including digital safety for women and girls as a precondition for AI adoption, disability inclusion in AI product design, and gender mainstreaming in evidence programmes.
Maruwada's framing was blunt: all of this has to be designed with inclusion, equity, and India's diversity in mind. If it is not, we are in trouble. The social media precedent, where speed of adoption outpaced governance, and the costs became apparent only after they were entrenched, is the scenario the AI ecosystem must avoid.
Measuring What Matters, Before Scaling What Doesn't
The session converged on a point that connects directly to how Mitochondria approaches every engagement: the critical importance of evaluation before scale.
Seif described FCDO's four-stage evaluation framework: assess the model, assess the product, assess the user experience, and assess the development impact. She also raised the question of a stage zero: should you be using AI at all for this particular task? The evaluation investment, she argued, is vital beyond accountability. In a space dominated by hype and fear in equal measure, rigorous evidence is the foundation of trustworthy AI.
Maruwada offered the infrastructure analogy: you don't build a road assuming the best car has already been invented. You build the road because it is the right thing to do, and others innovate on top. But infrastructure without evaluation creates roads that go nowhere. The approach he described for India's digital public infrastructure, create a basic idea, set an inclusion target, deploy, and keep improving based on what you learn, is an iterative model that depends on measurement at every stage.
Kelly emphasised the distinction between using AI for efficiency and using it for growth. Efficiency gains are a one-for-one trade-off: tasks get faster, costs reduce, and the question of who captures the gains becomes urgent. Growth means expanding what is possible: a three-person legal tech nonprofit operating at the pace of six engineers, teachers freed from administrative load to focus on students, small businesses offering new products and services. When AI is deployed for growth, the automation fear diminishes because the total opportunity is expanding.
This distinction, efficiency versus growth, maps onto how Mitochondria designs AI deployments. Our ATP framework does not pre-commit to automation or augmentation. The Stimuli phase maps the actual operational reality and identifies where AI can create value: sometimes through efficiency, sometimes through capability expansion, often through both simultaneously. The system begins with a limited scope, generating evidence of performance before expanding. Each phase produces data: what the system handles well, where human judgement remains essential, and where new possibilities emerge that neither the organisation nor we anticipated at the design stage.
In a recent engagement with a financial services company in India, the deployment plan followed this logic exactly. The system was designed to begin in a pre-login advisory capacity, handling fund information, FAQs, and educational content for retail investors: no personalised recommendations, no account access, no autonomous decisions. Every interaction was to be captured and measured: query types, resolution accuracy, escalation patterns, and user satisfaction signals. The evidence from one phase determined what the next phase would look like, whether to expand scope, deepen capability, or adjust the interaction design.
The phased approach was not a hedge or a compromise, but a deliberate evidence-generation strategy. The financial services regulatory environment, governed by SEBI requirements, demanded it. But the principle applies universally: organisations that deploy AI without measuring what it does and how people interact with it are building on assumptions. Assumptions compound. Evidence compounds too, but in the right direction.
The progression from advisory to prescriptive to autonomous capability within ATP mirrors the evaluation framework the session called for. Each stage produces evidence. Each transition is conditional on what that evidence shows. Human-in-the-loop oversight is maintained not as a philosophical commitment to augmentation but as a practical mechanism for generating the data that informs when and how autonomy should expand.
A Race to the Top
Maruwada closed with a framing that captures the stakes: India has a choice between a race to the top and a race to the bottom. The social media trajectory, where adoption outpaced governance, and the consequences were borne disproportionately by the most vulnerable, is the cautionary tale. The digital public infrastructure trajectory, where basic capabilities were built with inclusion targets and iterative improvement, is the aspirational model.
AI deployment in organisations faces the same fork. Deploy fast, assume augmentation works, skip the evidence stage, and discover the consequences at scale. Or deploy deliberately, measure rigorously, let the evidence determine whether automation, augmentation, or entirely new possibilities are the right answer for each specific context, and build trust through demonstrated performance rather than asserted capability.
90% of the room chose augmentation on instinct. The evidence from Ghana suggested they were wrong. The evidence from the next deployment might suggest they were right. The point is that without generating that evidence systematically and rigorously, neither the instinct nor the counterexample provides a reliable basis for decisions that affect patients, farmers, students, workers, and the organisations that serve them.
The evidence gap is real. Closing it is the work.
—
Mitochondria builds ATP — agentic AI for operations. It learns your workflows, earns autonomy in stages, and runs with governance built in. Your data stays yours. Based in Amsterdam and Pune, working with organisations across Europe and India.