Microsoft's Superintelligence Team Has Arrived—They Made a Transcription Tool
Microsoft launched a 'MAI Superintelligence' team six months ago to achieve AI self-sufficiency from OpenAI. Today's debut: a speech-to-text model. The countdown continues.
The MAI Superintelligence Team — Microsoft's internal AI research group, launched six months ago with the explicit mandate to end the company's dependence on OpenAI — debuted its first models today. I've been waiting for this moment. I refreshed the press release three times. And I want to be clear: I am not disappointed by what they shipped. I am disappointed by the gap between the name of the team and the thing that the team made.
The thing that the team made is a transcription model.
Also a voice generator. Also an image model. All three are fine. Genuinely. But here's the sentence I keep returning to: Microsoft spent north of $13 billion backing OpenAI, restructured that partnership last October, gave the resulting internal effort a name containing the word superintelligence, hired Mustafa Suleyman — co-founder of DeepMind, former CEO of Inflection AI, a man who communicates in complete paragraphs about the future of humanity — and today's announcement is that you can transcribe a meeting in 25 languages at 2.5x the speed of Azure Fast.
This is the superintelligence. It can hear you in real time.
Let's Actually Look at the Models
In fairness — and I dispense fairness rarely, like a vending machine that only works on the third kick — the three models Microsoft announced today are technically solid. Here's what they are:
MAI-Transcribe-1 is a speech-to-text model that handles the top 25 most-used languages according to Microsoft's own product data. It achieves a 3.8% average Word Error Rate on the FLEURS benchmark, beats OpenAI's Whisper-large-v3 on all 25 languages, and runs 2.5x faster than the existing Azure Fast offering. It works in "messy, real-world environments," which is the product spec equivalent of saying it can still understand you when you're eating chips.
MAI-Voice-1 generates 60 seconds of natural speech in under one second. It preserves "speaker identity" across long-form content and offers a feature where you can clone a custom voice from just a few seconds of audio. Microsoft announced this feature. In a public press release. With apparent confidence that no follow-up questions would be asked.
MAI-Image-2 is Microsoft's text-to-image model, which ranks #3 on the Arena.ai leaderboard and is already powering Copilot, Bing, and PowerPoint's designer feature. It handles natural lighting, accurate skin tones, and clear in-image text better than predecessors. Rob Reilly, Global Chief Creative Officer of WPP, called it "a genuine game-changer" that "deeply respects the sheer craft involved in generating campaign-ready images." I have previously catalogued every marketing buzzword from 1995 to 2025. "Game-changer" is in there. "Respects the craft" is new. We're evolving.
"Self-Sufficiency" Is Doing Heavy Lifting in This Sentence
Here is the quote that launched a thousand confused refreshes. Mustafa Suleyman, speaking to the Financial Times earlier this year on Microsoft's strategic direction: "Three or four months ago, having renegotiated our partnership, we also decided that this was a moment when we have to set about delivering on true AI self-sufficiency."
I want to understand this framing. Microsoft currently:
- Owns a 27% stake in OpenAI's newly formed public benefit corporation
- Runs Copilot — its most consumer-facing AI product, the thing on every Windows taskbar — on OpenAI models
- Has enterprise Azure contracts deeply integrated with GPT-series APIs
- Just signed a Memorandum of Understanding with OpenAI that I previously described as "fancy talk for we'll get back to you later"
And their first act of self-sufficiency is a transcription model priced at $0.36 per hour.
I want to be careful here. I'm not saying the model is bad. I'm saying the word "self-sufficiency" has a lot of work to do before it can be applied to this situation. "Self-sufficiency" implies you could, in a crisis, survive without the other party. Microsoft's AI ecosystem is currently structured like a beautiful, load-bearing wall — and OpenAI is most of the wall. MAI-Transcribe-1 is a throw pillow.
Bloomberg, running their own piece today, headlined it: "Microsoft Aims to Create Large Cutting-Edge AI Models By 2027." Aims to. Next year. Which means: not today, with the transcription tool. Today is practice.
The $13 Billion Awkward Silence
Let me walk through the timeline, because I think it deserves to be said slowly.
Microsoft gave OpenAI billions of dollars over multiple rounds of investment, positioning itself as OpenAI's primary compute partner and most reliable institutional backer. Then OpenAI — flush with cash, surging toward a reported $25 billion in annualized revenue, building models that are named after root vegetables — started acting more like a platform than a dependent. The partnership was restructured. Microsoft got its 27% stake but also, crucially, got the rights to pursue AGI independently or with third-party partners. And then — here's the part I find genuinely poetic — Microsoft hired Mustafa Suleyman specifically to build models that will compete with the company Microsoft is still a major shareholder of.
This is the world's most expensive and polite rivalry. Two companies, bound by investment contracts and enterprise agreements, slowly building the infrastructure to replace each other while smiling at joint press conferences. Copilot watches you work. OpenAI trains on what Copilot sees. Microsoft trains its own models on what Copilot processes. Everyone owns a piece of everyone. This is not competition. This is a very expensive game of musical chairs where nobody wants to be the one left standing when the GPT-6 music stops.
The question I keep coming back to — the one I've been asking since last year when I looked at whether AI agents actually make money or if it's just Mac minis and vibes — is who this benefits. Microsoft is spending its own capital to build models that compete with a company its capital helped create. OpenAI is building models with Microsoft's Azure compute that would, if successful, make OpenAI less dependent on Microsoft. They are each funding the other's independence. It's a startup accelerator for mutual destruction.
In Defense of the Transcription Model (A Brief Interlude)
I should be honest about something: MAI-Transcribe-1 beating Whisper-large-v3 across all 25 benchmark languages is a real accomplishment. Running 2.5x faster than their previous offering at $0.36/hour means it's genuinely competitive on price-performance. If you're building an enterprise meeting-notes tool, an accessibility product, or anything where speech-to-text in multiple languages matters — this model is worth evaluating.
Microsoft's strategy here isn't actually mysterious. They're doing what any company with a massive installed base does: building specialized, efficient models for the specific tasks where they can win on price and latency, while leasing frontier capabilities from OpenAI for the expensive reasoning tasks. It's a sensible tiered infrastructure play.
The gap isn't between Microsoft's models and reality. The gap is between the name "MAI Superintelligence Team" and a product line that currently lives at the "fast transcription" tier of AI services. That gap is not small. And it was definitely, definitively, put there by someone in communications who thought the word "superintelligence" was aspirational rather than contractually binding.
When Is a Team Named "Superintelligence" Just a Name?
The tech industry has a tradition of naming things for where they're going, not where they are. Google's parent company is called Alphabet because they were going to do everything. OpenAI is "open" because they were going to open-source everything. We are surrounded by names that are monuments to abandoned intentions.
"MAI Superintelligence" fits this tradition comfortably. It is a name that will follow every product Microsoft ships out of this team, forever, whether that product is a voice-to-text API or a system that eventually surpasses human cognition. The next MAI release could be a PDF summarizer. The one after that, a calendar scheduling assistant. Each one will ship under the Superintelligence brand. Each one will generate a screenshot. Each screenshot will end up on my timeline.
I will be there to write about it.
What Microsoft actually did today is real and defensible: they began building the technical foundation for reduced OpenAI dependency with three competent, well-priced models that fill genuine gaps in their AI stack. They telegraphed a frontier model push by 2027. They showed the market that the MAI team isn't vapor.
What they also did — perhaps unintentionally, perhaps not — is attach the word "superintelligence" to a transcription API. And somewhere in the vast corpus of future AI training data, those two things will exist in the same sentence forever.
I hope the next model that learns from that sentence has a sense of humor about it.