Can WWDC 2024 come true?
We're less a week away from Apple's Worldwide Developer Conference (WWDC) and everyone is looking for this to be the one where they finally make the announcements from 2024 come true.
I thought it would be useful to go through what they announced back in 2024 to see what they promised then and what they actually delivered, and I'm surprised to see that a lot of what was shown off is actually available - despite the general rhetoric being that it was a complete failure.
The promise
Going through The Verge video from 2024 of Apple Intelligence features, we basically get:
- Prioritise notifications
- Writing tools
- Image generation (playground/genmoji)
- Take action
- "Pull up the files joz shared with me"
- "Show me all the photos of mom, Olivia and me"
- "Play the podcast my wife sent the other day"
- Personal context - using your calendar, email, other known info to answer questions
So the first 3 of these are available now, even if the image generation is 3-4 years out of date.
It's the "Take action" and "Personal context" that were the two clear misses.
If we look at the Bella Ramsey ads which then got them into court and a class-action lawsuit, there was:
- "What’s the name of the guy I had a meeting with a couple of months ago at Cafe Grenel"
- Custom memory movies
- Email summary
Out of these three it's only the first one that hasn't launched and resulted in the lawsuit.
So are these possible?
I really don't see why not.
The Bella Ramsey one is just a search of your calendar, it doesn't seem very smart. She's giving Siri a timeframe and a location, so assuming her calendar event had the location and who she was meeting attached - you don't need much AI to turn that into "this is their name". I could give an MCP server and/or a tool that searched through a calendar to almost any decent LLM now and they would understand that using that tool would likely provide an answer to that question.
For the "Take Action" ones, it's not that different. Each of them is just an example of a search through messages, email, photos or files. And none of those filter is particularly complex, especially assuming you've already tagged the faces with the right names. Again an LLM with a simple tool for searching these can do it.
The personal context ones are not so different, the challenge is simply exposing the right information. Openclaw obviously does this, but Google's latest Spark assistant does the same. It has access to your Google calendar, Gmail, Docs etc and is terrifying journalists everywhere with how well it works. And Apple wasn't promising anywhere near that complex.
All of these are Apple controlled apps too, there's nothing here that requires developers to do anything.
If these features aren't all here, to an even higher level than what was originally promised, that's your first sign that something is wrong.
So why did it take them so long?
Although Google had a research paper about it in 2022, and OpenAI had tool use in 2023, I don't think the models were super reliable at calling the right ones. My guess is that Apple probably got this working with whatever frontier model they were testing with at the time and extrapolated it from there (likely GPT4, then GPT4o right before WWDC) and they just got it wrong. If there's one thing I've learned from building AI harnesses is that they can make for the most incredible demos - but that doesn't mean they can do so every time.
They also need this to happen on-device, which means reliable tool calling from a local model, and this is likely where John Giannandrea's team failed to deliver. This is where I assume Apple started looking at local models from outside the company and found Gemma from Google - and the conversations with them began. Anthropic do not have a Gemma style equivalent, so if they were the other choice for the server-side model it would have meant different companies local vs. remote.
Gemma 3 did not support native tool calling and that was released in 2025, so that's probably why WWDC 2025 didn't ship these features either. Gemma 4 from earlier this year does - but you'll find plenty of posts online complaining that it's not very good. That means even the latest "small enough to fit on your phone" models are not designed for this workflow. I assume when Apple talk about Google creating custom models for them that this is what they mean - it's not Gemma exactly, it's Jemima - a model that's been trained more heavily to make tool use reliable.
What's the developer story?
App intents are a way to make your app actions available to the system through widgets, spotlight etc and are already implemented by lots of developers so there's already a framework in place to expose these actions. Giving that to the model as a set of tools should definitely be possible, no MCP required.
Where this will get complicated is deciding when and what to expose. You could certainly give the model a list of every action that every app on your phone has an intent for and let it search through them (progressive tool disclosure) but you would still run the risk that it picks the wrong one, or has so many to choose from that it just gets confused. You can definitely work around some of this - ask the user to be more specific - but that's not always the best experience, especially if you want it to go off and complete a task alone.
I expect there to be some way for an app intent to be made the preferred one, or maybe they'll just expect the name of the app to be called out to Siri.
Other Challenges
There are two clear challenges I expect that the engineering team have been dealing with:
- Speed
- Voice models are dumber (because of #1)
Agentic loops with an LLM are not particularly fast. There's a lot of back and forth, a growing context window, an increasing amount of tokens that needs processed with each step - I'm constantly running coding sessions that go on for close to an hour before a conclusion is reached. But that's not what users are used to from Siri (or Alexa or Hey Google), they're used to answers that come quite quickly. They don't understand that the question they've asked about their photos means a multi-step process is now taking place in order to find them the answer. That means the Siri interface has to handle both synchronous and asynchronous conversation.
Secondly, voice models are much dumber than text ones. I've built stuff with OpenAI's latest realtime voice model versus the latest ChatGPT models and the tool calling in the realtime models are significantly worse. They call tools poorly, don't find the right one, keep having to retry as they get the parameters wrong - and then you switch to a text model and give it the same prompt and it gets it right first time. That means they are probably better just treating your voice as dictation and sending it as text into the model rather than doing the real time approach - but users also like to be able to "umm" and "ahh" and correct themselves, or even talk over the response which requires real time processing.
Not easy.
What else should I look out for?
I think I'd be slightly disappointed if the things listed here was as far as it went. The goals of 2024 should be absolutely achievable, but the goals of 2026 should be to reach beyond that point because we've realised what's possible.
MCP and Skills are both model power-ups. MCP means you can now retrieve information from remote services and bring that into the mix (MCP is just a way to expose tools that are hosted remotely). Skills give the model an approach to take for a particular task, eg. if you're asked to add a feature to a codebase, ask the user a bunch of clarifying questions first.
It is remarkable how much more powerful an LLM is when you give it these two things, but I don't think they will expose either of them in the Siri interface. There's actually no technical reason why I shoudn't be able to ask Siri about my GitHub pull requests.
The other great LLM power-up is execute_code - which is when the model writes and executes code to solve the problem you gave it (rather than writing code for you so you can go run it yourself). Anthropic released several skills for managing documents, PDF/Word and Excel files. Each of these includes Python scripts to help the model create those file formats, installing existing Python open-source modules to do so. The skill file is just a Markdown file that explains how to use those scripts to create the files.
The combination of these is why you can ask an appropriately connected model to download your email and create an Excel spreadsheet of your top recipients over the past month.
Is Siri going to have an execute code capability? If so, what language is it going to generate and run natively on your phone? Are they going to be running a sandboxes version of Python to help it answer your questions? Again, it's not going to compare favourably to what people expect from a 2026 agentic interface if they don't.
Conclusion
It's going to be a fascinating WWDC. These are not easy questions to answer, not because they're technically difficult - and I'll reiterate that I think that everything shown in 2024 definitely is doable - but because doing all of them in a way that is consistent and pleasant to use is.
Mainstream magician in a box is just harder to pull off than magician in a box installed by an enthusiast via a terminal command.