

Interesting. I can buy that idea, a model that’s designed to be general and answer all questions is going to have to make compromises in a lot of ways.
So it’s possible that model benchmarking needs to be revised in some way to give more useful analysis of its capabilities.
The industry is quickly moving towards using agents, MCP connections (sources of real-time data for the model to pull from, and apis that allow the model to perform tasks, like putting things on a calendar), and RAGs (augmentation with sources of truth, such as a 100 page pdf guide for example), and models that seem to be more aware that they can get data from other sources.
The future might become specialized models all the way down.
Just today I’m playing with “vibe coding” and using one agent as an orchestrator that assigns and monitors tasks to other agents. The result is still slightly bullshit code but it’s amusing to watch it work. Not sure yet if this is a strategy to spend all my money through API fees or will result in something useful 😂
This is when you give some LLM a prompt such as “write a game like Minecraft except cooler” and the system will output some code that might run and might vaguely resemble a block game.
So then you go back ask for more, it does something to the code potentially improving or breaking it, go back again ask for more, and repeat over and over. I’m being a little bit sarcastic because most serious developers look down on this, but really this is how a lot of coding is happening these days. There are tools to make this process somewhat usable and they are getting better every day.