From text to action – Anatomy of an agent.
Throughout the day, we humans continuously plan from the moment we wake up until we go to bed. When you drive, cook, practice sports, or engage in any task, you go through a series of steps; since these steps are so internalized and automated in the neocortex of your brain, you don’t stop to think about them; you simply act and execute the task automatically, as you have learned it beforehand. When faced with a new task (without prior learning), you readjust what you have learned in other areas to solve the new problem that arises.
We all know that when we have to solve a complex problem, the logical thing to do is to break it down into smaller pieces and create a prioritization (a logical order to tackle a task in a series of steps). The AI agents being developed now will act very similar to how humans perform tasks; they will reason through the problem, break it down into tasks, and plan the execution with logical steps.
The Transition from Monolithic Models to Composite Systems.
Since the explosion of LLMs (Large Language Models) in 2022, the first AI systems have been based on monolithic models that perform specific tasks in isolation. These models have limited knowledge; they typically rely on data packages like CommonCrawl and RefinedWeb, among others, which are essentially scrapes of the entire Internet text.
Imagine you have a restaurant, and your employees access a bot to find the work shifts for the upcoming week. If you search on the LLM without interconnected data, it won’t provide you with any meaningful response; however, if you connect the shift database to the LLM and run a text search query, it will return the correct result in natural language. This concept is called a composite system, and it is based on fine-tuning techniques and training the LLM with your proprietary data.
In any case, the ability to interact with other systems or adapt to new data or tasks is limited. At first, it required engineering, code development, structured and clean data, and their preparation, and then the GPT interfaces or plugins with direct and simple data loading appeared.
The AI techniques known as RAG (Retrieval Augmented Generation) are somewhat of a composite system, as they can be configured to combine the capabilities of generative models with information retrieval systems. This approach allows AI to generate more precise and contextually relevant responses by accessing external sources of knowledge.
Then, when designing an agent, it is essential to highlight that there will be a shift from standalone LLM models to interconnected and composite AI systems that tackle more complex tasks.
Integration with Databases and External Tools.
An LLM agent should be modular by nature and will integrate with different sources. It could be another LLM, a specific and proprietary fine-tuning model, a multimodal app, image generators, proprietary databases, software tools, apps, a cloud repository, the Internet, APIs, other agents, etc.
Now, when you ask ChatGPT, it calls the LLM GPT-4.0 to give you a response (think fast). Still, when you ask this complex agent or system, it will think “slow,” breaking down your query into smaller tasks and distributing each to the corresponding integration module. The agent will reason, break down, plan, negotiate, and distribute tasks among the modules. From text to action, reason and act.
Returning to the restaurant example, imagine that you are the manager of a chain of restaurants with 50 employees, and one of your tasks is to manage the staff’s weekly work shifts. Every week, you log into your planning software, assign shifts, cross-check the data with the human resources software where vacations and absences appear, and finally, every Friday, you email each employee informing them of the schedule for the following week.
Throughout the process, the human controls you must open several software programs, consult data, and send emails. The control will be with the LLM when the agents are up and running. You will define your task with all kinds of detail and assign the task to the agent. He will take care of the entire process, and since he will have memory, he will remember that every Friday, he needs to execute that routine. If he makes a mistake, you will correct him until he reaches perfection in accomplishing that task, and when you finish with that task, you will send him the automation for another one.
Considerations.
Regarding the reasoning ability of future versions of LLMs, we should think more about their capacity for planning. Will GPT-5 be an agent AI system? Perhaps we will see it soon, and it won’t be due to more data or computing power but rather some kind of fine-tuning that allows the model to natively break down prompts into tasks, with enough reasoning to plan and orchestrate the different systems.
Let us remember that the first version of ChatGPT was a highly tuned, forced, trained, and educated GPT-3 designed to respond in a certain way; then, with similar training techniques, GPT -5 will be able to start becoming an LLM agent.
Companies with complete and interoperable technological ecosystems will have the greatest chances of success. Microsoft and Google, with their software suites and clouds, will be able to create the most powerful agents. In an ideal world, a company with its tech stack harmonized with Windows, 365, Teams, Azure, Dynamics, and the future Microsoft agent could achieve wonders.
Those companies with a more varied and complex tech stack will be quite a challenge for the future agents coming in.
The final unknown will be interoperability and negotiation between external agents; so far, we have discussed how one agent distributes tasks among different modules and systems, but the question will arise on the day you want your agent to prepare your vacation, and it needs to communicate with the Booking.com agent for the flight, and with the Marriott Hotels agent for the room reservation; furthermore, it will call a third agent, the bank’s, to make the payment.
Pedro Trillo is a tech entrepreneur, telecommunications engineer, founder of the startup Vizologi, specialist in Generative Artificial Intelligence and business strategy, technologist, and author of several essays on technology.