The $100 Agents
A new project to train task-specific agents powered by Reinforcement Learning tuned language models with a compute budget of $100.
Today, I am kick-starting a new project called The $100 Agents.
RL Agents vs LLM Agents
Stating the obvious, agentic systems are the talk of the town. The RL purists are a bit upset about all of this - they came up with the concept of ‘agents’ after all.
No two people agree what the correct definition of an agentic system actually is. If you are in the RL camp, you’d define it as an actor that interacts with an environment, takes actions, receives feedback and learns from its interactions such that over time its capabilities get better.
Ask the guys in Langchain what an agent is, and they describe it as
“…systems that take a high-level task and use an LLM as a reasoning engine to decide what actions to take and execute those actions.”
Notice the clear absence of the word ‘learn’ in the latter definition and a clear absence of the word ‘reason’ in the former definition.
The common denominator between both camps however is this - Given a task, an agent must act to complete it and do so in a reliable way. Whether it does so by consulting an action-value function or by generating a reasoning trace, is just semantics.
Unfortunately though, the divide goes deeper than this. Should these agents be custom tuned and specialised to a specific domain/task? Or should they be general purpose intelligence engines that know-it-all and can do-it-all.
If you were Mark Zuckerberg and you just splashed a few $100 Million on top-talent from OpenAI to build AGI, you’d probably say it is the latter. You’d want to build the best agent in the world that can do any task, and be capable of building tiny agents that specialise in certain domains - You know, to save energy, prevent climate change, all of that stuff.
No matter what my world view is about agents and what kind of agents I think we need to build, I don’t immediately have a few $100M of spare change to build the know-it-all kind. So with deep regret, I have to resort to try and building can-do-one-thing agents.
On Building Agents
I say it quite lightly, but even building can-do-one-thing agents is not easy at all. Post-training approaches with RL don’t generalise all that well. Taking the recipe that was followed by some of the best minds at DeepMind to build AlphaGo and trying to build a say ‘Contract Bridge - GO’ just does not work. Trust me on this one, I have tried. Monte-Carlo simulation is so much easier when the search space is small.
LLM Agents tuned with RL
We are however starting to see evidence that agents where the underlying intelligence is a language model, when tuned with reinforcement learning, with some luck and extra-ordinary patience can be made to specialise on specific tasks.
With that as the context, here is what $100 Agents is going to be about.
What are $100 Agents?
The objective is to train task specific agents powered by language models, using reinforcement learning, all with a compute budget of $100 worth of GPU time for the entire training run. This budget could be split across multiple runs, used on a combination of synthetic data generation with foundation models, SFT and RL. This does not necessarily mean that the entire experiment of creating an agent needs to be completed in under a $100. This only means that anyone looking to recreate the whole experiment given the recipe should be able to do so with under $100.
What sort of tasks?
Anything really. Nothing is too insignificant. RL has traditionally been applied to games and they are excellent training grounds! Here are some example tasks
Language based games - Wordle solver, Anagram unscrambler, Crossword clue solver, Trivia QA, Hangman solver
Numeric reasoning based games - Math word problem solver, Sudoku solver, KenKen, Arithmetic equation verifier
Coding agents - Leetcode solver, regex generator
Tool-calling agents - Tau Bench, Tau2 Bench, BFCL, calendar scheduling agent (So many on this list)
Visual reasoning agents - Object counting agent, chart interpreter, any agentic setting that takes an image as an input and produces verifiable text output.
Personalisation and Recommender Agents
How will success be measured?
The measure of success is the % improvement demonstrated by the RL-tuned model on the end-task in comparison to the base or SFT model. Given the limited compute budget, and using a well-thought out finger-in-the-air estimation, we will say that if the RL-tuned model shows a 10% improvement on the end task, we will call that experiment a success.
The most IMPORTANT piece
Now, not all experiments are going to be successful. Some are bound to fail. If there is one thing I have learnt about RL, it is to be entirely humble and accept defeat. Failing to do so leads to leads to depressive tendencies.
My own approach has been to time bound every single experiment and publish the findings when the set time has run out. A good time limit, is about a 160 hours of total cumulative effort on the experiment.
Do agents need to use function calls? What about memory? What framework? MCP?
The only definition of an agent that we will use for this project is that the agent needs to specialise in a specific task. The agent may or may not need tools in order to achieve this, may or may not need memory, may or may not need context engineering. All of this is just semantics. There is a job to be done and we will use a language model and fine tune it with a multi-stage training approach (SFT, RL) to get it to improve its capabilities - with a compute budget of $100. That’s it really.
What output will we produce?
For every agent tuning experiment, we will produce
A blog post commentary of the task, the domain, the complexities, different approaches taken to tune this agent and what worked/did not work.
A GitHub repo containing the code with a full Readme file on how to recreate the entire experiement from scratch
[Optionally] a Huggingface dataset and/or models produced as part of the process.
Why do this?
For one, collectively we need a better understanding of the types of tasks where RL can be applied successfully on language models. We more or less only know that mathematical reasoning and alignment with human preferences improve with RL. What else?
We need many, many small scale toy solutions to better grasp the unique experimental settings that lead to successful performance improvement on tasks. KL divergence or no KL divergence? Reward scaling or no reward scaling? We just don’t have enough data points.
We need to understand scaling laws with RL and we can only get this by running hundreds of small ablative studies.
We need more empirical validation of theoretical frameworks. It is so hard to keep on top of all the algorithmic adaptations and variations of different RL techniques applied to language models. Just on GRPO, we have DAPO, BNPO, Dr.GRPO, GVPO, AGPO and many more extensions. And yet, there is such little empirical validation of which of these methods work well, or generalise.
Finally, we need more people lean in on RL. If we can get more people equipped with a good understanding of RL, with a $100 of spend and lots of will to learn it, we are all better off.
Why $100?
It is a nice round number.
How can I get involved?
I will publish some resources in the next coming weeks to help anyone interested get involved in this project. Till then, take a look at these following resource for the kind of stuff we will look to build.,
If you’d like to be part of it already, give me a shout and we can discuss ideas!