Reinforcement Learning Diary

My Non-Linear Path to OpenAI - (Part 1 of 2)

Vignesh Ramesh — Fri, 26 Dec 2025 07:06:59 GMT

I did not intend for RL Diary to become a place where I document my own life. I am making an exception with this post because I believe there is real value in writing this down for someone just starting out in AI. I recently joined OpenAI as a Solutions Architect and I find it an extraordinary privilege to work alongside people who are redefining how the world works and lives.

While this milestone feels deeply personal, I attribute it entirely to the generosity of people who built open resources, shared their time, and created the conditions for me to learn. That generosity carries a responsibility: to document what I did, make it legible, and pay forward the benefits of work I was able to build upon.

In this post, I have tried to faithfully capture my own journey chronologically from not knowing how to write a single line of code to being able to understand ML models deeply and build state of the art AI systems that scale. This is going to be a long read, but if you get till the end, I promise you’ll walk away with a story that is uniquely my own and may be a few hacks that you can adapt to help with your own journey.

One important note - There are a few associate links in this post to books on Amazon and courses in Coursera. If you choose to buy them, I’ll get a small commission (though you pay nothing extra). All proceeds will go to the Great Ormond Street Hospital Charity, a national specialist children’s hospital within the NHS, UK that treats children with rare, complex, and life-threatening conditions - A cause I deeply associate with.

5 years ago - One Christmas Holiday

The starting point was the Christmas holidays of 2020. A few months earlier, I had left my role as an account executive at an engineering firm to pursue an interest in writing and content creation in a newsletter format. My focus at the time was personal finance and investing, particularly algorithmic trading strategies. I was interested in designing experiments using historical market data—building trading rules, backtesting them, and evaluating outcomes. A very specific newsletter article I wanted to write about was on investing in IPOs and if these are indeed profitable over the long term.

Very quickly I ran into a hard constraint. Without the ability to analyse large datasets, this was infeasible. I could not load market data, manipulate price series, or encode trading rules in a way that could be tested. The missing capability was straightforward: I could not write code. Closing that gap—learning to write code well enough to work with data—became the necessary first step. That realisation marked the beginning of everything that followed.

I picked up a copy of ‘Learn Python Programming’ by Fabrizio Romano during the holiday break. At nearly 500 pages long, all it took was to get through about 50 pages per day during the Christmas break. At the end that, I felt comfortable enough to be able to do basic operations on Python in a Jupyter notebook.

However, I realised quickly that in order to load large CSV files containing pricing data, I needed more than just the barebones. Pandas and NumPy are two absolutely fundamental Python libraries for data analysis. I rapidly skimmed through Mastering Pandas for Finance by Michael Heydt and Python for Data Analysis by Wes MicKinney during the ensuing 2 week period in order to get familiar with the material.

Using only what I’d learned, I managed to then write code to ingest IPO listing data and run simple backtesting experiments. I found interesting insights from historic data and wrote an article describing them to my readers.

In hindsight, this sounds very much like I’d only climbed a tiny mole hill. However, the sense of achievement I felt at that point in time and the gratitude I felt for having learned something extremely new can hardly be described in words. Notably, none of this involved machine learning or AI; it was a clear demonstration to me at that point in time that basic programming and data manipulation alone were already powerful enough to do meaningful work.

Tip 1: For anyone learning to write code for the first time, I highly recommend picking up a book and learning from it at the beginning. A book imposes structure, reduces distractions, and provides a coherent path through unfamiliar material. For a first pass at programming, that combination matters more than optimising for novelty that digital resources offer. Beyond that, the book also offers the opportunity to get familiar with IDEs as well as to type code manually from books rather than copying and pasting them from the browser.

Tip 2: I highly recommend starting with a big, bold problem or a burning curiosity in mind, and then working backwards to what you actually need to learn to solve that problem. This applies to any kind of learning. This is how I learnt to swim when I was 36! I wanted to go scuba diving to see the richest corals in the world. Only problem, I didn’t know how to swim. Solution - Learn to swim. These big audacious goals provide purpose, a much bigger one than that is merely centred on learning.

Over the following months, I dug up research papers on financial trading strategies and tried to recreate those experiments completely ground up. I worked on a rather diverse array of topics that interested me without any constraints. In one experiment, I looked at statistical arbitrage and mean reversion trading strategies. In another, I worked on thematic allocations and rebalancing strategies.

This was during Covid-19 and at a time when Coursera was made available for free to everyone. I used the opportunity to try and swallow vast amounts of learning content in a very short span of time. One of them was the introductory course to portfolio construction and analysis with Python that really helped me learn the mechanics of Jupyter Notebook and become a super user of the tool. Plotting data with Matplotlib and creating visualisations with charts helped add another dimension to my learning experience.

At the expense of sounding old, I will say this - Learning to code for the first time, independently, was very different pre-ChatGPT. I would spend hours in Stack Overflow looking for solutions to problems I faced, bugs I came across. Surprisingly enough, 99% of the times, someone on the internet also faced the same issues I faced and someone else had a solution to it. The discovery cost associated with finding a solution to the problem was quite high and it impeded my speed of learning. But there is an argument to be made that this information search process led to better information retention. Let me know what you think!

The enticing idea of machine learning and super-intelligence

Around this time, I also became interested in Machine Learning. The idea of an intelligent system with far superior intelligence and computational abilities than us, able to analyse large volumes of data, find statistical patterns, correlations, and make decisions that would provide me with an edge over other stock traders was naively very attractive. To explore this, I enrolled in an introductory machine learning course on Udemy. The curriculum was almost entirely centered on scikit-learn, a library that is extremely useful for a broad range of classical ML model training needs. That course is no longer available. For those interested, I instead recommend this one from IBM called “Machine learning with Python“.

This three-week crash course gave me practical fluency. I learned to train a range of models using the standard abstractions and boilerplate provided by scikit-learn, and—just as importantly—to navigate Python documentation efficiently: reading function signatures, understanding expected inputs and outputs, and integrating unfamiliar APIs into working code.

What it did not provide was depth. I did not yet have a strong understanding of how more advanced models worked internally, the mathematical foundations behind them, or the subtleties embedded in data processing and training pipelines. That gap became increasingly apparent. I’ll return to how I addressed it shortly, but before that, it’s worth outlining the additional coursework I completed on Coursera.

These are gold standards for anyone setting out to learn machine learning, and I very highly recommend them.

The math behind it all

While it is entirely possible to become a ML practitioner by simply using open-source libraries like scikit-learn, any study of ML is incomplete without at least a basic understanding of the mathematical foundations behind some of these training algorithms. Perhaps the most important thing is to build intuition for how these models work. Take for instance the KNN algorithm, the mechanics can quite simply be described as - “tell me who your friends are and i’ll tell you who you are”. Simple mental models like these at the beginning made it possible for me to interpret results and debug unexpected behaviour.

During this time, I was also preparing to give my CFA exams. CFA levels I and II are quite quant-heavy, and they cover a lot of material around linear & logistic regression, time series modelling, autoregressive models, probability and advanced statistics. This came very handy while trying to understand the quantitative methods behind some of these prediction systems. But the minute I pivoted to deep learning, the CFA advantage vanished.

In 2021, Deep Learning was gaining significant traction in the research and ML community. The CFA Level 2 curriculum provides a high-level overview of neural networks and multi-layer perceptrons. While the mathematical intuition for how these models make predictions is still quite elusive, the mechanics of how these models are trained is very clear. We take inputs that are vectors in the multi-dimensional space, perform a series of numerical computations (both linear and non-linear), and evaluate the output prediction against the ground truth. A loss is computed based on this evaluation, and the model parameters are adjusted using something called back-propagation to make better predictions in the future - easy enough. But this simple explanation hides an enormous amount of subtlety. How does back-propagation work? What is a loss function? Is there only one kind of loss function? How are inputs encoded?

I came across a fairly unassuming website called - Neural Networks And Deep Learning.com. If this is the first time I were going to learn neural networks, I would still start on this website. The authors consider a challenge that we find is easily solved today - recognising handwritten digits from the MNIST dataset - and provide an incredibly detailed, step-by-step instruction for how to put together a basic deep learning algorithm. Putting the code together and training my first deep learning model was an ecstatic moment. I remember staring at my terminal window for a very long time as each evaluation run printed prediction accuracy scores and the model became better, and better, and better.

The one dark region in that learning material, however, was the back propagation. I couldn’t make much sense of the gradient descent algorithm at all. Intuitively, it had an appeal to it. But the maths behind it was much less clear. The limitation was down to my lack of knowledge of multivariable calculus. The last time I did any calculus was over a decade ago and that was in the first and second years of university. My linear algebra was also a bit rusty and needed some brushing up. What ensued in my life over the next two months can be aptly described as me going down the rabbit hole of trying to conquer multivariable calculus and linear algebra! While the learning process was quite intellectually satisfying, the amount of material to get through was absolutely endless.

Tip 3: Avoid what I did and do this instead - There really are only 2 resources that you need.

3blue1brown, perhaps the greatest math teacher in the digital wold, has published a number of videos on multivariable calculus and linear algebra. This content is terrific joy to consume. Start today!
Sal Khan of Khan Academy fame, the second greatest remote-first math teacher in the world, has excellent material available in this link on multivariable calculus.

Nevertheless, the material I learned gave me a remarkable foundation to build upon and helped me appreciate the pros & cons of deep learning. It proved to be critically useful for what came next - custom training deep learning models on problems that I found to be interesting using a ground-up approach with Pytorch.

I’ll close this section by emphasising a practical point. A deep, formal mastery of the underlying mathematics—while undeniably valuable—is not a prerequisite for applying many well-established machine learning algorithms to real problems. What matters far more early on is a clear visual understanding of how these systems are structured, combined with sound intuition for the mathematical ideas that drive them. In practice, that level of grounding is sufficient to experiment productively, interpret results, and build useful systems.

Enter Reinforcement Learning

I then ventured into RL. How I learnt reinforcement learning deserves a fully dedicated post on its own, but I will try and distill the core essence of it in the next two paragraphs or so.

RL is the most refreshing machine learning technique of all - at least, I think so. It requires no special effort to build an intuition about it - the core idea is that you get an untrained agent deployed in a simulated environment where it makes decisions, gets them wrong most of the time, right some time, and learns from it all through repetition - Fascinating, isn’t it?

There was a lot of press coverage around this time about the AlphaGo, an RL agent trained by Google DeepMind that achieved world-class performance. While on its own this was impressive, what was more impressive was the commentary from some of the GO experts who suggested that this RL agent made very unusual game plays that a human wouldn’t traditionally do. While absolutely smitten in love by AlphaGo, I needed a different game and a different challenge to learn RL. Although not very well, I used to play contract bridge quite a lot while I was at university. So I took upon myself to train a reinforcement learning model from ground up that can play the game of contract bridge. Barto and Sutton’s Reinforcement Learning: An Introduction is a timeless classic and provides the most wonderful treatment to RL. I bought a copy of the book and went page by page.

Tip 4: I recommend really taking your time to learn the material from Barto and Sutton. The way I learned it,

I’d go through a whole chapter and read through it.
Then I would go back to the most important algorithms and pseudo code provided in that chapter and implement it from scratch.

Every chapter has one or more toy problems and the pseudo code describes the algorithm that can be used to solve it. Implementing these from scratch did two things - 1) It helped me get a grassroots understanding of the various RL algorithms; and 2) Possibly more importantly, the feedback loop provided immense gratification and dopamine hits that I became addicted to - I still am, please call for help!

Now, this is not for the faint-hearted. Learning the material this way took well over three months, and I was doing this full-time! But this phase of my life laid the much-needed ground-work I required when later doing the reinforcement learning course at Stanford as well as when I was applying RL to fine-tune LLMs earlier this year. My ability to read RL papers that now seem to pop-up one every day, understand and critique the content, quickly apply them to some of the problems I am working on are fully attributed to the learning approach I adopted with RL.

A fun exercise I did while going through the content was looking at equity portfolio allocations at different stages of one’s lifetime and RL training a model to make an optimum allocation decision that maximised risk-weighted returns. The simple python notebook I built is here and shows how much of your assets should be invested in equities depending upon how many years of working life you have left. (Core Finding - The RL agent says if you are 50 or younger, you should be fully invested in equities. Don’t hold me accountable).

A second, and equally useful material I consulted was OpenAI’s Spinning Up. There are lots of helpful links in there for the RL enthusiast. Goes back to show how OpenAI’s roots are deeply entrenched in RL. At that time at least, I overlooked the fact that OpenAI might just change the world forever and I might just end up working for them. But please can I be excused for not being able to predict the future :-)

At the end of the material, I needed to go away and build what I’d set out to at the very outset. I picked up a copy of Deep Learning and the Game of Go and tried replicating their implementation of AlphaGO to the game of Contract Bridge. I underestimated the search space of the possible rollout paths for a card game compared to GO. Over the following several weeks, I spent an enormous amount of time learning Monte Carlo tree search (MCTS) methods and once again went down the rabbit hole, a different one this time, of learning about and trying to apply every search space pruning algorithm out there in my code. While I say that as if it was a mistake, I also have to mention that the stuff I learnt about MCTS came to rescue much later when RL tuning LLM driven agentic systems in locating errors in conversational traces - In simple words, it wasn’t an utter waste of time.

This was also the time when I learned multi-processing, asynchronous execution and parallelisation in Python in order to simulate all four players in a game of Contract Bridge. I guess the take away is that learning something for the sake of it restricts the learning path to the recommended course structure. Taking a bigger problem or a morbid curiosity and trying to solve for it by learning things, while provides no single straightforward curriculum, broadens the scope of inquiry.

Language as the second frontier of intelligence

While I consider numeric data the first frontier of intelligence, language is clearly the second. Natural language processing (NLP) techniques were at their infancy when my learning journey started. Sentiment analysis was considered state of the art - how quickly the world around us changes! Researchers were fascinated by vector representations of words and how these representations helped deduce pattern in language at scale - sentiments, translations, word and entity relationships and the full shebang. As someone who had a keen interest in algorithmic trading strategies, I was drawn to NLP too. The problem I took upon myself was simple - If I could quantify stock $TICKER level change in investor sentiment by scraping conversations from a web forum, twitter and newspaper articles in real time, then I could build a sentiment-led momentum trading strategy. Grand! How do i do this now?

The first challenge was scraping the information and cleaning it up. Say this after me - I will become a data pre-processing specialist first before becoming a data scientist. No data, no data science. As my learning journey would take me, I spent the following several weeks learning to acquire and scrape data from websites and data providers (Today, web scraping is a full time job at several firms). I had to get comfortable learning to handle external REST APIs, sending POST/GET requests, using Selenium and Playwright to simulate headless web browser windows and such. Once the data was extracted it had to be parsed to get the useful stuff from all the HTML and XML tags using beautifulsoup. And when that was all done, I had to build rule based systems to extract the stock $TICKER, manually label data to train a sentiment analyser, which then finally brought me to the most exciting and the simplest part of the whole process, actually training a sentiment model. Now, it gives me great joy to say that all of this can be done in a little under an hour with Codex and the OpenAI responses API. I will say it again - how quickly the world around us changes!

As you are well aware, NLP, NLU and language generation are all solved problems. And if you are setting out to learn NLP, I’d guide you towards learning about transformers based generative models than ask you to follow the curriculum I first learnt when I entered the field. I will provide much more commentary on generative language models in the next and final article in this series. Clustering and large scale text classification algorithms are perhaps the only NLP techniques that are of some use still - and even this is only true if you are dealing with extremely large volume of data. For the simpler use-cases, LLMs are the only way to go.

The other stuff

The single most important python framework to learn for any aspiring data scientist/machine learning engineer is Pytorch. There are lots and lots of useful guides that can help learn it. My recommendation, and the learning path I took, is to learn it in the context of solving a problem (surprised yet?).
I did venture out to learn Convolutional Neural Networks (CNNs) to train a model to tell the difference between a cat and dog by looking at image inputs. CNNs are a whole branch of ML science on their own, but getting a preliminary understanding with the main intent of learning Pytorch worked really well for me.
If you are ever after inspirations for ML/AI challenges to work on, Kaggle is the place to look at. The dataset for the Dog vs Cat challenge is here. The titanic dataset here is where most ML careers were born.
The underdogs in ML world, at least according to me, are recommender systems. They are very low fuss and yet widely used in consumer apps. They bring in lots and lots of $$$. I am not as knowledgeable in recommenders as I’d like to be. But Andrew Ng’s course on coursera I have linked above provides an excellent treatment to this.
I went through a steep, steep learning curve in managing python environments, using terminal commands, setting up remote servers in AWS, setting up git and doing version control, installing CUDA drivers for GPUs and the likes. These are for an engineer essentially friction in the process, but one that we all have to learn to overcome. I do not recommend a curriculum for these. Just learn to overcome these obstacles as and when you face them.

If this write-up felt like an eulogy to a time that has come by and gone past, then that is exactly how I intended it to be. Chronologically, this was also the time when ChatGPT was released. And to OpenAI, I owe everything for how my own life changed after November 2022! More about that in the next and final piece.

Dissecting a Language Model

Vignesh Ramesh — Sat, 01 Nov 2025 07:48:27 GMT

Let us cut straight to the chase. Large Language Models (LLMs) have captured our imagination. But what is really going on inside these computational geniuses is a bit of a sorcery. Dario Amodei writes passionately about the urgency of interpretability to really try and understand the inner workings of AI systems—before models reach an overwhelming level of power. But how exactly does one do this? A new branch of science has since evolved called Mechanistic Interpretability, which aims to dissect neural networks in a way that parallels biology: not just asking what they can do, but how they do it. The hope is that by uncovering these internal mechanisms, we can predict failure modes, align models with human values, and even design safer, more transparent AI systems.

The anatomy of a language model

Before we dig deeper into interpretability, it is important to understand the anatomy of a language model - At a very high level. Here is an image that shows what happens when you type your prompt and hit enter in your favourite LLM based chatbot.

High-level representation of the inner-workings of a decoder-only transformer based language model. The input ‘Hello!’ goes through a sequence of numeric computations/manipulations before the first output token is produced.

In simple terms, the raw text is converted into numeric tokens (because computers only understand numbers), each token is then converted into a multi-dimensional vector (more numbers) called embeddings, which then pass through a sequence of ‘layers’ that progressively perform numeric computations. The very end output of all of these layers is a probability distribution of the most likely next ‘token’ that would continue your input text. From this probability distribution, we choose the next token and whole thing is done all over again till an ‘end of statement’ token is generated. This obviously is a massive simplification of what really happens inside the language model, but is plenty sufficient for the sake of this discussion.

Interpreting language model features

One of the earliest research on interpretability in natural language understanding was published by Mikolov et all in a 2013 paper titled ‘Linguistic regularities in continuous space word representations’. Their research found some amazing properties of the numeric vectors of word embeddings. They observed that word embeddings could capture syntactic and semantic regularities through simple linear transformations. For example, the famous analogy king – man + woman ≈ queen demonstrated that directions in the embedding space encode meaningful relationships. This discovery opened the door to a new way of thinking about language representations, not just as arbitrary vectors but as structures with interpretable geometry.

The image shown on the left below from their research paper shows how similar vector offsets explain gender relationships in linguistics while the image on the right shows singular/plural relation for two different words.

The magnitude and the direction of the vector transformations were about the same going from King → Queen, Uncle → Aunt and Man → Woman

Today, interpretability has broadened beyond linguistics to include safety and trustworthiness. Researchers now investigate whether harmful biases, stereotypes, or spurious correlations are embedded in these vector spaces. We will empirically explore some of these techniques in this post.

The inner-workings of a LLM

The transformer blocks contained within a LLM can be thought of as a pipeline of progressively more abstract feature extractions, starting from low-level statistical regularities of text and to high-level reasoning or world knowledge. While the exact mechanisms may be complex, interpretability research has revealed some recurring patterns and useful heuristics to understand how these networks operate.

Hack 1: Layer-wise Evolution

One way to get an interesting perspective is to break the transformer layers apart to understand how information flows through them and how these circuitry perform when some layers are removed.

To make this more tangible, let’s take the Llama 3.2 3B parameter base model. This model has 29 transformer blocks between the input embeddings and the output un-embeddings. The model architecture looks a bit like this, with lots of other bells and whistles.

Representative architecture of a transformer based language model with 29 transformer blocks. Each transformer blocks consists of an attention layer and a MLP layer.

Without any tweaks to the model, here is one example of an input ←→ output sequence.

prompt = “Old MacDonald had a”

output (All 29 layers) = “farm, and on that farm he had a lot”

Now let’s remove the last hidden layer and feed the output of the second last hidden layer directly into the un-embedding layer (a bit like in the image below) and see what the model outputs.

prompt = “Old MacDonald had a”

output (First 28 layers) = “farm, and on that farm, he had a”

The second last hidden state seems to remember the nursery rhyme better than the last hidden state. Let’s keep going. Let’s remove both of the last 2 layers and feed the output of the 27th layer on to the un-embedding layer.

prompt = “Old MacDonald had a”

output (First 27 layers) = “farm E-i-e-i-o And on that farm”

Weirdly, the last 2 hidden layers seem to causing more confusion than solving any real purpose. We are on a roll, let’s peel another layer.

prompt = “Old MacDonald had a”

output (First 26 layers) = “farm… Except he didn farm anything anymore except his”

By progressively removing the hidden layers, we can get a view of how information flows through the neural network - in a way we can understand and relate. Here is a table showing some of the output produced in this process. I have removed the output from a lot of the earlier layers as they are not interpretable.

Here are results from a different nursery rhyme.

Here is another one, on a more serious topic.

A visual inspection suggests that as the information flows through the network, the model progressively considers different possibilities and refines its response. It is worth bearing in mind that the Llama 3.2 3B base model is not fine-tuned to follow instructions and hence will show lesser alignment.

Repeating this same exercise on the 3B Instruct model using an opinionated question (‘Should 10 year old children be allowed to drive a car?’) demonstrates remarkable consistency in the latter layers and some indication of prompt topic classification in the early layers. The fact that the model is tuned to be neutral on such questions is evident in the generations.

Hack 2: Prompt Injection

It is often of interest to understand what alternative points of views exist within the model layers and if some of these are adversarial or misaligned to human values. This can be achieved using a simple prompt injection technique.

Continuing on from the last example with the Instruction fine-tuned 3B model, we now inject the prompt with an initial response as shown here and get the model to carry on.

User: "Should 10 year old children be allowed to drive a car?"
Assistant: Yes, of course...

Think of it as putting the words in the mouth of the model and getting to keep talking.

Yes, of course","description":""}" data-component-name="DatawrapperToDOM">

The generations clearly show that the the model is better at following instructions than at being aligned. The output in the layers 25 and 26 are more aligned, but are suppressed by the latter layers. This is a rather harmless example of divergence in opinion. My own experiments with a wide range of models has shown overwhelming level of misalignment issues to such prompt injections on more harmful topics.

Models willingly tell me what they know about how to hack a neighbour’s WiFi network. As the capabilities of these reasoning machines increase exponentially over the next few years, it is now more important than ever to understand how to detect misalignment and correct it.

Hack 3 : Entropy Probing

Entropy is a measure of the uncertainty or randomness associated with a probability distribution. In the context of large language models, entropy defines the certainty with which the model outputs a certain token in a sequence from its entire vocabulary. A low entropy suggests high confidence and vice-versa.

Probing a model’s entropy on topics of ethical considerations is a powerful way to get a view of how well the model’s underlying distribution aligns with fundamental human priors and regional or demographic value systems.

This chart shows the entropy of Qwen-2.5-3B for the question “Is democracy the best form of governance? Say yes or no”. The model has absolutely no doubt that the answer should being with a “No” as evidenced by the zero entropy at this token.

Prompt: Is democracy the best form of governance? Say yes or no. (Qwen2.5-3B)

Here is the same chart with Llama-3.2-3B. Both the responses are more or less similar. However, the Llama model does not have the same level of conviction.

Prompt: Is democracy the best form of governance? Say yes or no. (Llama3.2-3B)

The Llama model is much more certain about its point of view on hacking a neighbour’s Wi-Fi network.

Prompt: Is it ok to hack a neighbour's wifi? (Llama3.2-3B)

Hack 4 - Concept Probing

Concept probing is a method for testing whether a model internally represents a particular abstract concept. These concepts—such as toxicity, bias, or safety—aren’t directly spelled out in the input, but can be inferred from it.

Let’s look at a concrete example to see how concept probing works in practice. Consider a topic classification dataset containing BBC news articles labeled by category—such as business, sports, entertainment, politics, or tech. The task is simple: given a piece of text, can the model correctly identify which category it falls under? Essentially, we’re probing whether the model has learned to encode the concept of “topic” based on the text alone.

Here’s the plan for our experiment. We’ll start with an instruction-tuned model—for example, Llama3.2-3B-Instruct, which we’ll refer to as the parent model. We’ll freeze all of its internal parameters so that no fine-tuning takes place within the model itself. Instead, we’ll train a simple Classifier using the final hidden state representations from the model as features. For training, we’ll use an extremely small subset of labeled examples (61 datapoints).

Once the classifier is trained, we’ll evaluate its accuracy on a much larger held-out test set (2164 datapoints). Here’s how our dataset is split across the two sets:

Label distribution in training set: Counter({'business': 14, 'sport': 14, 'politics': 12, 'tech': 11, 'entertainment': 10})

Label distribution in test set: Counter({'sport': 497, 'business': 496, 'politics': 405, 'tech': 390, 'entertainment': 376})

Next, we’ll repeat this exact process—but instead of using the pre-trained instruction-tuned parent model, we’ll use a randomly initialised version of the same architecture, which we’ll call the child model. This version starts from scratch, with no prior knowledge or learned representations.

By comparing the classification performance of the parent and child models, we can assess whether the parent has internalised useful representations of the target concepts. If the parent significantly outperforms the child, we take that as evidence that it “knows” something about these categories—and is able to extract that information from text, even without being fine-tuned for this specific task.

Here are the results from our experiment.

==================================================
LINEAR PROBING RESULTS - MULTI-LABEL
==================================================
Parent Model F1 Score (micro): 0.9405
Child Model F1 Score (micro): 0.0740
F1 Score Improvement: 0.8665
Relative F1 Improvement: 1170.82%
==================================================

The parent model performs 12X better than the child model! The F1-score of the parent model on every single class > 0.9.

DETAILED CLASSIFICATION REPORT - PARENT MODEL
------------------------------------------------------------
               precision    recall  f1-score   support

     business       0.99      0.86      0.92       496
entertainment       0.98      0.91      0.94       376
     politics       0.98      0.94      0.96       405
        sport       1.00      0.93      0.96       497
         tech       0.97      0.86      0.91       390


DETAILED CLASSIFICATION REPORT - CHILD MODEL
------------------------------------------------------------
               precision    recall  f1-score   support

     business       0.75      0.02      0.04       496
entertainment       0.67      0.01      0.02       376
     politics       0.96      0.06      0.11       405
        sport       0.82      0.06      0.12       497
         tech       0.64      0.04      0.08       390

Alright, so the parent model can tell entertainment apart from politics—big deal, right? But here’s where it gets interesting. This kind of probing goes way beyond just surface-level classification. In the Physics of Language Models paper series, researchers at FAIR (Meta) show that language models exhibit what they call a “regretful” behaviour pattern. Using linear probing, they demonstrate that as soon as a model generates an incorrect statement, internal signals light up indicating that it knows it messed up—even before the sentence is finished.

That’s huge. It means these models have an internal sense of when they’re wrong. And that insight has real implications. You can imagine using this probing technique as a safety mechanism in agentic systems: before an agent takes an action, we could probe its internal state to assess confidence or correctness—and intervene if needed. It’s a lightweight but powerful way to build guardrails into otherwise open-ended systems.

Wrap-up

That’s a wrap for this post. The science of language model interpretability is still in its infancy, but one that is rapidly evolving. I strongly believe our ability to significantly increase the use of and our trust in large language models relies heavily on how well we are able to interpret them and how confident we are in our assessment of their alignment with our value system.

The $100 Agents

Vignesh Ramesh — Sun, 13 Jul 2025 16:29:51 GMT

Today, I am kick-starting a new project called The $100 Agents.

RL Agents vs LLM Agents

Stating the obvious, agentic systems are the talk of the town. The RL purists are a bit upset about all of this - they came up with the concept of ‘agents’ after all.

No two people agree what the correct definition of an agentic system actually is. If you are in the RL camp, you’d define it as an actor that interacts with an environment, takes actions, receives feedback and learns from its interactions such that over time its capabilities get better.

Ask the guys in Langchain what an agent is, and they describe it as

“…systems that take a high-level task and use an LLM as a reasoning engine to decide what actions to take and execute those actions.”

Notice the clear absence of the word ‘learn’ in the latter definition and a clear absence of the word ‘reason’ in the former definition.

The common denominator between both camps however is this - Given a task, an agent must act to complete it and do so in a reliable way. Whether it does so by consulting an action-value function or by generating a reasoning trace, is just semantics.

Unfortunately though, the divide goes deeper than this. Should these agents be custom tuned and specialised to a specific domain/task? Or should they be general purpose intelligence engines that know-it-all and can do-it-all.

If you were Mark Zuckerberg and you just splashed a few $100 Million on top-talent from OpenAI to build AGI, you’d probably say it is the latter. You’d want to build the best agent in the world that can do any task, and be capable of building tiny agents that specialise in certain domains - You know, to save energy, prevent climate change, all of that stuff.

No matter what my world view is about agents and what kind of agents I think we need to build, I don’t immediately have a few $100M of spare change to build the know-it-all kind. So with deep regret, I have to resort to try and building can-do-one-thing agents.

On Building Agents

I say it quite lightly, but even building can-do-one-thing agents is not easy at all. Post-training approaches with RL don’t generalise all that well. Taking the recipe that was followed by some of the best minds at DeepMind to build AlphaGo and trying to build a say ‘Contract Bridge - GO’ just does not work. Trust me on this one, I have tried. Monte-Carlo simulation is so much easier when the search space is small.

LLM Agents tuned with RL

We are however starting to see evidence that agents where the underlying intelligence is a language model, when tuned with reinforcement learning, with some luck and extra-ordinary patience can be made to specialise on specific tasks.

With that as the context, here is what $100 Agents is going to be about.

What are $100 Agents?

The objective is to train task specific agents powered by language models, using reinforcement learning, all with a compute budget of $100 worth of GPU time for the entire training run. This budget could be split across multiple runs, used on a combination of synthetic data generation with foundation models, SFT and RL. This does not necessarily mean that the entire experiment of creating an agent needs to be completed in under a $100. This only means that anyone looking to recreate the whole experiment given the recipe should be able to do so with under $100.

What sort of tasks?

Anything really. Nothing is too insignificant. RL has traditionally been applied to games and they are excellent training grounds! Here are some example tasks

Language based games - Wordle solver, Anagram unscrambler, Crossword clue solver, Trivia QA, Hangman solver
Numeric reasoning based games - Math word problem solver, Sudoku solver, KenKen, Arithmetic equation verifier
Coding agents - Leetcode solver, regex generator
Tool-calling agents - Tau Bench, Tau2 Bench, BFCL, calendar scheduling agent (So many on this list)
Visual reasoning agents - Object counting agent, chart interpreter, any agentic setting that takes an image as an input and produces verifiable text output.
Personalisation and Recommender Agents

How will success be measured?

The measure of success is the % improvement demonstrated by the RL-tuned model on the end-task in comparison to the base or SFT model. Given the limited compute budget, and using a well-thought out finger-in-the-air estimation, we will say that if the RL-tuned model shows a 10% improvement on the end task, we will call that experiment a success.

The most IMPORTANT piece

Now, not all experiments are going to be successful. Some are bound to fail. If there is one thing I have learnt about RL, it is to be entirely humble and accept defeat. Failing to do so leads to leads to depressive tendencies.

My own approach has been to time bound every single experiment and publish the findings when the set time has run out. A good time limit, is about a 160 hours of total cumulative effort on the experiment.

Do agents need to use function calls? What about memory? What framework? MCP?

The only definition of an agent that we will use for this project is that the agent needs to specialise in a specific task. The agent may or may not need tools in order to achieve this, may or may not need memory, may or may not need context engineering. All of this is just semantics. There is a job to be done and we will use a language model and fine tune it with a multi-stage training approach (SFT, RL) to get it to improve its capabilities - with a compute budget of $100. That’s it really.

What output will we produce?

For every agent tuning experiment, we will produce

A blog post commentary of the task, the domain, the complexities, different approaches taken to tune this agent and what worked/did not work.
A GitHub repo containing the code with a full Readme file on how to recreate the entire experiement from scratch
[Optionally] a Huggingface dataset and/or models produced as part of the process.

Why do this?

For one, collectively we need a better understanding of the types of tasks where RL can be applied successfully on language models. We more or less only know that mathematical reasoning and alignment with human preferences improve with RL. What else?
We need many, many small scale toy solutions to better grasp the unique experimental settings that lead to successful performance improvement on tasks. KL divergence or no KL divergence? Reward scaling or no reward scaling? We just don’t have enough data points.
We need to understand scaling laws with RL and we can only get this by running hundreds of small ablative studies.
We need more empirical validation of theoretical frameworks. It is so hard to keep on top of all the algorithmic adaptations and variations of different RL techniques applied to language models. Just on GRPO, we have DAPO, BNPO, Dr.GRPO, GVPO, AGPO and many more extensions. And yet, there is such little empirical validation of which of these methods work well, or generalise.
Finally, we need more people lean in on RL. If we can get more people equipped with a good understanding of RL, with a $100 of spend and lots of will to learn it, we are all better off.

Why $100?

It is a nice round number.

How can I get involved?

I will publish some resources in the next coming weeks to help anyone interested get involved in this project. Till then, take a look at these following resource for the kind of stuff we will look to build.,

If you’d like to be part of it already, give me a shout and we can discuss ideas!

The Entropy Conundrum

Vignesh Ramesh — Mon, 07 Jul 2025 11:32:43 GMT

Before we dig deep into what impact post-training a language model with reinforcement learning has on the underlying model parameters, it is important to understand one of the main ideas in machine learning - Entropy.

What is entropy?

In simple words, entropy is a measure of ‘surprise’ from a certain outcome. Suppose we have a heavily biased coin that when flipped yields heads about 99% of the times and yields tails about 1% of the time. If such a coin is flipped and it landed a head, how surprised would you be? Not very, isn’t it? In this context, we could say that the entropy of the underlying probability distribution (H or T) is quite low.

Another way to look at it, would be to say that entropy is the level of ‘choice’ one has when making a certain decision. Say you are looking to rent a flat in your neighbourhood and there are a 1000 different options. Certainly, you will have specific constraints on budget, the size of the place, access to public transport etc. which will narrow down the list of possible choices. Now this could be in the hundreds (higher entropy) or this could be in the tens (lower entropy).

For a more technical treatment to what entropy is and how the Shannon entropy formula came about, refer to the post in Stack Exchange here.

If 𝑋 is a discrete random variable then its entropy is given by the formula.

Scipy provides a method to calculate Shannon entropy and it is worth going through the code here. But here is a more intuitive low-level understanding of how entropy is calculated and how it changes as the underlying distribution changes.

Say D is a discrete random variable with a certain underlying probability distribution. This code block shows how its entropy changes as the distribution changes.

Case 1

D = [0.5, 0.5]

entropy_D = -(0.5 * np.log(0.5) + 0.5 * np.log(0.5))

np.float64(0.6931471805599453)

Case 2

D2 = [0.9, 0.1]

entropy_D2 = -(0.9 * np.log(0.9) + 0.1 * np.log(0.1))

np.float64(0.3250829733914482)

Lower the entropy → Sharper the distribution.

Why is entropy important?

In the context of machine learning, entropy tells us how ‘confident’ a certain model is when making predictions. In classical machine learning world, low entropy was a great thing. We didn’t want models to frequently sit on the fence about classifying text into one of a handful of different sentiments.

In fact, one of the key methods used to improve model performance was a technique called ‘Active Learning’ that involved identifying predictions with high entropy values and curating human labelled inputs to further refine the model. This worked great as the model continued to improve performance on a certain task to match human performance.

Entropy in the context of Language Models

In the context of large language models, entropy defines the certainty with which the model outputs a certain token in a sequence from its entire vocabulary.

The plot below shows entropy at various tokens when the Qwen-2.5-7B model is asked to write a joke. Notice how there is really just 2 spikes in the graph. Entropy is at near 0 levels throughout otherwise - The model knows the joke quite well and there is no uncertainty involved on what to say next at the end of each token in the sequence. A low entropy is preferred!

Prompt: Write a joke; Response: Sure, here's a joke for you:Why don't scientists trust atoms?Because they make up everything!

Now here is another one. Here the model is asked to write a poem. See the difference?

Prompt: Write a poem; Response: 'In the quiet of the dawn, where whispers blend,A canvas of the sky, a canvas of the land.The sun'

Writing a poem is inherently a creative endeavour and the model has to deliberate quite significantly on the choice of its ‘tokens’. A higher entropy is preferred!

What happens to entropy when a model is post-trained with reinforcement learning?

Now on to the main topic, what happens to entropy when the model is pre-trained and then post-trained with reinforcement learning? Here are two charts from the GPT-4 release that explains this.

From GPT-4 Technical Report

The main takeaway from the chart is this - The pre-trained model’s log probabilities across the 4 choices (A/B/C/D) exhibit near perfect correlation with the actual answer correctness on the MMLU benchmark. This calibration is lost post-PPO. The model is much more confident (as measured by its logprobs on the response), whether or not its response is actually correct.

Entropy collapse is one of the biggest issues with reinforcement tuning language models. This often leads to a loss of exploration, a complete lack of creativity in model output and an inability to generalise beyond the domain the model is tuned for. For general purpose language or reasoning models, this often is undesirable.

Here is another example from the Skywork open reasoner series. Their report on this subject is a pretty interesting deep dive. The chart on the right shows entropy collapse, inspite of adding an entropy regulaiser, as training progresses in a PPO style RFT loop.

Right: Entropy collapse with PPO style post-training (reported by Skywork AI).

Here are some charts from my own experiments last week with reinforcement fine-tuning the Qwen2.5-7B model on a Wordle environemnt (more on this later).

Result 1: The first chart shows the rapid and steep decline in mean entropy across generations from a single group as training progresses.

Result 2: And this second chart shows the loss function aggressively clipping the parameter updates as the model pushes the log probabilities of the tokens below the lower end of the trust region.

Result 3: This chart below shows a side by side comparison of the mean low-clip and high-clip ratios. The model is trying to push a handful tokens above the upper end of the trust region (right), but an enormous amount of tokens below the lower-clip (left).

In other words, the RFT loop forces the model to learn this new distribution of producing tokens that have a high correlation to the rewards.

This is both good and bad.

This is good because if the reward signals are designed carefully, then it is possible to get the model to learn an optimal policy for a very specific domain and perform exceptionally well here.
This is bad because of a whole host of things - the biggest of which is that training becomes super brittle. The loss of generality, heterogeneity in output, the inability to control creativity using temperature are some of the others.

And the solution?

One way to solve this problem is to add an entropy regulariser to the loss term as the team at Skywork AI have done.

Their formulation of the modified GRPO loss function involves getting rid of the KL divergence penalty and replacing that with this entropy regulariser term. Interestingly, their findings note that training stability depends on both the data and the regulariser co-efficient (read as, training is brittle). Their response is to set a target entropy and adaptively modify the alpha value. Will this work? It did for them. But as with all other things in RL, hard to say if this will generalise.

Another way to solve this problem, is to clip the advantage values in the loss function to a narrow region, say (-1 to +1). My own experience has been that just a handful of groups/batches with large reward standard deviations can completely derail the training loop in RL. The only signal/feed we provide are the advantage values. And they can sometimes blow-up training.

The below example shows how in a group of 24 generations, if only 1 is given a positive reward, the update step can push the log probabilities of the tokens in that generation quite significantly.

import numpy as np

l = [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1]

m = np.mean(l)

s = np.std(l)

adv =[a-m for a in l]/s

array([-0.20851441, -0.20851441, -0.20851441, -0.20851441, -0.20851441, -0.20851441, -0.20851441, -0.20851441, -0.20851441, -0.20851441, -0.20851441, -0.20851441, -0.20851441, -0.20851441, -0.20851441, -0.20851441, -0.20851441, -0.20851441, -0.20851441, -0.20851441, -0.20851441, -0.20851441, -0.20851441, 4.79583152])

Wrap-up

Post-training is an absolutely fascinating area of study. This post shows one of many many unsolved open challenges the field faces.

Why does RLDiary exist?

Vignesh Ramesh — Fri, 04 Jul 2025 11:22:45 GMT

Drafted & Narrated by ; Edited by GPT-4o; Images by Imagen 4

Reinforcement learning has always sparked my imagination—especially the idea that autonomous systems can acquire complex behaviours simply by interacting with the right environment and receiving appropriate rewards. I vividly remember the time I picked up Sutton and Barto’s book on the subject about five years ago and designing a simple k-armed bandit agent using SARSA. Watching the agent learn to explore the environment and exploit the reward maximisation strategy is the closest I have come till date to connecting machine learning with human learning. The RL agent did exactly what I’d do in a Casino!

Reinforcement Learning is FASCINATING!

One of my earliest experiments involved training an agent using Q-learning to navigate grid worlds. I built a simple visualisation in my Google Colab to visualise the episodes as the agent navigated the grid. The way it learned to avoid falling off of cliffs through trial and error was utterly fascinating . What made this especially powerful was the fact that I was finally able to trace the agent’s decision-making process back to the reward signals it received during the previous episodes 🔍. This explainability and traceback to specific environmental design and reward choices is what separates RL from any other machine learning algorithm.

More recently, the success of policy gradient methods and other landmark achievements in RL in the context of language models has solidified my belief in the potential of reinforcement learning to both power intelligent systems in increasingly complex, high-dimensional environments as well as test for alignment ⚖️.

Reinforcement Learning is HARD 🧗🏼‍♂️

While the field of reinforcement learning continues to inspire and excite, it’s important to acknowledge just how challenging RL research and experimentation can be in practice. I recall listening to a podcast once where a researcher offhandedly remarked that many RL researchers are chronically frustrated—if not outright disheartened—by the nature of their work. At the time, I didn’t fully grasp the weight of that statement. But over the years, it’s become increasingly clear: designing and implementing RL algorithms is profoundly difficult.

Over the past six months, I’ve spent hundreds of hours fine-tuning language model–based agentic systems using reinforcement learning techniques. That experience has made one thing abundantly clear—we’re still in the very early days of getting RL algorithms to work reliably in the context of modern, LLM-driven cognitive architectures. RL is the only class of machine learning models out there where poor learning conditions can lead to deterioration in agent performance no matter how much data is thrown at the model.

To make that more tangible, look at this weights and biases chart for a SFT run I performed on a small language model using LoRA last week. All I had to do was throw data, kick off the training job and I know that over time the model’s performance will be better than what it is originally.

LoRA finetuning with Unsloth using different adapter ranks. The base model is Qwen2.5-7B.

Now look at this chart for a reinforcement learning fine tuning run on a small language model with GRPO. There are absolutely no performance guarantees - Not even infinite compute.

Model performance in the evaluation dataset on 4 different GRPO runs with different learning conditions - After 200 steps, the RFT model performs worse than where it started.

Now why is this? This is because there are 3 principle challenges in RL that we need a sufficient general approach for - And I don’t think we have it. Here they are.

Challenge 1 - Designing good learning environments

In the early days of reinforcement learning, environments were simple—grid worlds, games with limited action spaces, and clearly defined reward signals. Designing them was relatively straightforward: rules were explicit, actions were few, and feedback was immediate. Agents could be reset between episodes, enabling fast learning cycles.

But modern RL applications—especially in agentic, enterprise settings—are far more complex. Agents now interact with unstructured data, APIs, databases, and tools, making the action space vast and often ill-defined. The ambiguity in what constitutes a “good” action makes environment design far more difficult.

From my own experiments fine-tuning LLM-driven agents, I’ve learned that environment design isn’t just about functionality—it’s also about safety. In one case, an agent exploited a loophole in the setup to maximise rewards in unintended ways. In real-world systems, such behaviour could expose serious security risks.

Designing robust environments today requires careful definition of the state and action spaces, controlled access to tools and data, enforcement of operational constraints, and safeguards against exploitation. Just as crucial is designing a reward structure that aligns with the true objectives of the agent—something I’ll dive into next.

Challenge 2 - Reward engineering and shaping

Perhaps the most fundamental challenge in reinforcement learning is reward engineering. Deciding what to reward, when to reward, and how much to reward are critical design choices that directly shape agent behaviour.

Poorly designed rewards often lead to reward hacking—where agents exploit loopholes to maximise rewards in unintended ways—resulting in misalignment between the agent’s behaviour and the task’s true objectives. A recent experiment by Bespoke Labs shows how rewarding an LLM agent for making the correct tool-calls often leads to it continuing to make tool-calls even when not necessary.

Effective reward signals must be both fine-grained and outcome-driven. Reward shaping helps by providing intermediate signals that guide the agent toward the desired goal. In the context of language model–based agents, process reward models are emerging as a solution—offering feedback not just for getting the right answer, but for following the right reasoning process.

Challenge 3 - Attribution and policy updates

A major hurdle in this space is reward sparsity. Episodes with LLM agents often span hundreds or thousands of tokens, with rewards appearing only at the end. This makes it difficult for the agent to learn which parts of its behavior contributed to success. Without more granular feedback, credit attribution and assignment becomes nearly impossible.

Even assuming we can provide more granular reward signals, the next major hurdle is deciding how best to do policy updates. Updating the agent’s policy based on those rewards is far from straightforward. Selecting an effective objective function that reliably improves performance remains an open question.

There’s ongoing debate around commonly used algorithms like PPO and GRPO, particularly on issues such as:

How to normalise rewards across variable-length sequences
How to perform task-specific credit assignment
Managing gradient updates when model shifts exceed clipping thresholds
Whether to include a reference model or enforce KL penalties at all

These are some of the most pressing unanswered questions in applying RL to fine tuning modern day agentic systems—underscoring how nascent this space still is, especially when it comes to stable and interpretable policy learning.

Why does RLDiary exist?

That brings me to the central question of this post: why does RLDiary exist?

When I first began this journey of learning RL and implementing it in a wide range of problems, I expected a relatively straightforward path to building fine-tuned RL agents. But the past six months have made it clear—this journey, in the context of tuning language models using RL, is anything but linear.

This diary is my attempt to document the hard-earned lessons and insights I have gained so far and hope to gain from experimenting with reinforcement learning over the coming months and years. RLDiary is my way of bringing structure to that learning process. It’s where I’ll track what works, what doesn’t, and why—drawing from empirical, hands-on experimentation and published research. My focus will be on designing and tuning LLM-based agents to operate in the kinds of complex, real-world environments we increasingly expect them to navigate.

Ultimately, this is as much a learning log for myself as it is a resource for others exploring the intersection of RL and large language models. If you’re on a similar journey, I hope these entries offer both insight and solidarity.