Models, models everywhere

Models. Big huge scary models. Small, purpose-built models. Slow models. Fast models. Expensive models. Local models.

Models. Models. Models.

Which model should you pick for your question, task, or goal? How much of your usage budget will you use when you do? Could it have been done by a lesser model for less usage?

As the frothy sea of AI continues to ebb and flow this seems to be one of the more challenging and hard to answer questions for those of us using AI agents to help us with our everyday workload.

The answer to this question doesn’t yet seem clear.

Orchestration; which asks a single agent/model pair to delegate work to lesser agent/model pairs is one leading answer. And it works in some contexts. For example: “Do this big task” can be planned with a larger more capable model and then delegated to one or several more sub-agents and cheaper models to do laser targeted work. In practice, this works pretty well and the tooling for this is beginning to mature to the point it happens invisibly.

But, for the most part, the leading orchestrators use their own cloud-based models (and usually only a very small subset of them). Meaning, Claude might spin up 10 sub-agents on Sonnet to do the work, but you’re limited by still needing to use an Anthropic-based cloud model.

Something I played with a fair deal is using prompt heuristics to determine which model may best suit the current task. I created a catch-all “model” that delegated to other local or cloud-based models based on the request, whether or not it included private data, etc. It works fairly well but there are so many model variants that keeping up with whatever is the latest (that can run on my hardware) is nearly a full-time job… and getting an agent to delegate the tasks rather than just single-threaded prompting isn’t easy. And isn’t something I’ve considered cracked, yet.

But I think we’ll see the solution to this question appear within the next year. And I am sort of hoping it will be in an unexpected way.

Just look at the options we have in front of us.

Take, for example, something incredible like chat jimmy. chat jimmy runs an older, less capable, Llama model but runs inference near instantly. Seriously, give it a try. It is so fast it almost doesn’t matter that it isn’t nearly as good as more capable models. Think of this; if chat jimmy needed to run 10 times to produce valuable results, it would still be 100x faster than today’s leading models.

Now, imagine if Apple, AMD, or Intel were able to take a somewhat more capable model and grind it down into the bits necessary to put directly on the silicon they ship with their products. And imagine if all agents could utilize these on-device super fast models for some tasks and only rely on the larger cloud-based models for planning and orchestration and more important things like security and testing?

I think we’re on the cusp of something like this being available. I’m already using something like it locally — a main, orchestrating agent and super fast sub-agents that only do very targeted work. But I don’t think it is the solution just yet.

As the frontier models grow, the local models are shrinking. As the frontier models chase general intelligence, the local models are focusing on very specific sets of tasks.

I think overall the gap between frontier cloud-based models and local models will only widen and, eventually, for most of our everyday work we’ll be able to use a suite of smaller models that are orchestrated by agents. The frontier models will try to solve protein folding while our local models will write website code.

It could be argued that Apple believes this already. What they call Foundation models are really just a set of models that are purpose-built for the task they are supposed to do; image recognition, speech to text transcription, emoji generation. Everything else they leave up to the frontier models.

We’ll see how it shakes out. But I think we’ll be seeing a lot more models and model variants until it is figured out.

Last updated: