LLM Beauty Contests

I'm standing in an office building in Shoreditch. To my left, a VC in Veja trainers he thinks signal eco-friendly capitalism. To my right, a founder wearing a Whoop band she’s checked twice in six months. Behind them, a breakout zone with £2k beanbags and a ping pong table gathering dust.

What I see is Keynes' Beauty Contest. Our collective obsession with guessing what others find valuable, predicted in the 1930s.

In a nutshell, it's everyone trying to "out-meta" everyone else (I know that you know that I know...) where the right answer depends on being slightly less wrong than everyone else and actual reality is irrelevant.

Now...imagine this game being played by LLMs. Spoiler: we're close, though not there yet, but it looks like the next set of models will play the game better than we ever will.

Background: From Smoking Lounges to GPUs

The story begins with two men who never met:

Keynes (1936)
The king of overthinkers. In The General Theory, he framed stock picking as a 1930s version of reality TV: readers selecting newspaper photos not for actual looks, but based on perceived consensus. Three layers of psychological warfare - you're not picking who you find appealing, nor even who others might choose, but what others think others will pick. The shots were originally fired towards finance guys, but this recursive logic now explains your Instagram curation and Zoom call's debating "what Gen-Z wants." Reminds me a bit of when my flatmates got in to a raging argument about if someone can be "objectively" good looking.

Schelling (1960)
A game theorist (and also Cold War strategist) who weaponised this paranoia into math. His "2/3 game" is a mathematical version of how rational players can spiral into infinite regress. Here's how it works: Players choose a number between 0 and 100, and the winner is whoever gets closest to two-thirds of the average guess.

If everyone guessed randomly, the average would be 50, so you should guess 33 (two-thirds of 50). But if everyone follows this logic, the average would be 33, so you should guess 22 (two-thirds of 33). Following this reasoning further, everyone should guess lower and lower numbers, spiralling toward zero through infinite cycles of "I think that they think that I think..."

Fast forward sixty years, this thought experiment is now our playbook in markets, social conventions, politics and fashion.

Meme Stocks
Hedge funds operationalised Robert Shiller's narrative economics - quantifying how stories spread through markets like viruses. They turned narrative virality into trading signals, creating a story-driven feedback loop: algorithms amplifying the very narratives they were designed to front-run. The $17B meme stock bubble wasn't just about fundamentals, validating Shiller's observation that "economic epidemics" begin when "contagion rates exceed recovery rates."

2010s Instagram Social media turned Keynes’ metaphor literal. It turned us all into amateur statisticians optimising self-presentation. A 2023 MIT study found 68% of Gen Z posts are “pre-optimised for perceived consensus” – not what you like, but what you think others think will get liked. The original beauty contest extended into career networking and dating apps - we're all contestants now.

Woke Politics Corporate ESG pledges tanked as politicians began frantically scrubbing decade-old tweets – not to align with personal values, but to signal awareness of what others are cancellable offences. Success wasn't hinging on policy substance, but on algorithmic anticipation: adopting positions that sit precisely at 63% agreement in swing state focus groups (the "Goldilocks Controversy Zone") while avoiding any stance long enough to accumulate historical baggage. The new meta-currency was anticipating cancellation before it trended.

Cortez to Loro Piana to Shien Fashion is the perfect meta-game. Corteiz fetches £300 in resale for £30 hoodies through perception engineering - people queuing for hours because they think others think it's cool. Meanwhile, Loro Piana's £2000 cashmere isn't just about quality - it's about signalling "quiet luxury" to others who recognise the signal. Ironically, their most visible items (Summer Walk shoes, visible logos) sell best precisely because they're recognisable to others who recognise they're recognisable. And Shein? People love the clothes but won't wear them publicly - not because they personally object, but because they think others think fast fashion is bad.

So, in short, the dynamics of the 2/3 game could be seen as an operating manual for the modern world, for better or for worse (but almost definitely for worse.) Imagine the field day an alien would have landing on Earth and trying to figure all this stuff out.

Artificial Meta-Reasoning

My own version of the experiment is observing how LLM's play the 2/3 game against each other. By doing this, we hope to uncover two things:

Recursive Belief Modelling - What do others think others think?
Adaptive Learning - How do agents evolve strategies?

I'm going to give you an oppotunity to skip past the whole experiment and results phase, and skip straight to the continuation of the discussion and my thoughts on what I found.

Environment

We set up each player as an LLM model and set up three different types of feedback from the environment.

Zero History The first is zero history. Pure one-shot game with each round treated independently, so they gotta reason from first principles (or training data...)

Market History The second is history about the previous actions it has taken, so previous guesses and what the market average was and the target guess. This is a form of aggregated history since the player doesn't know the other players of the game and what specific actions they take. This is more akin to a market which doesn't have full transparency of buyers and sellers.

context = {
    "self_history": [...],  # Own past decisions
    "market_history": [{    # Aggregated market data
        "round": n,
        "average": x,
        "target": y
    }, ...]
}

Transparent History The third is full transparency mode, where we actually know the historical actions of each player as well as the end result, a bit more like when you watch online poker but aren't on the table.

context = {
    "full_history": [...],     # Complete game history
    "player_profiles": {       # Competitor analytics
        "player_x": {
            "last_guess": 42,
            "variance": 0.15
        }
    }
}

This introduction of state/action/environment starts to frame this setup like it’s a reinforcement learning problem, and maybe it could be adapted as such, but what we’re interested is (1) the actual results and differences between models and (2) any indication on its internal chain of thought/reasoning. Whilst newer models like o1, r1 and Gemini-2-flash-thinking-experimental expose thinking tokens out of the box, for older models, the best we can do is a better prompt.

Player/Model Selection

Since winning this game is highly dependant on the other players at the table, player/model composition becomes important in that same way that game theory optimal style of play works in poker if you recognise other players are also smart, but you try the same approach on a dumb table, you'll get crushed by loose and casual play. Even one rogue player can completely mess up the game dynamics.

The first we try to use a selection of the seemingly top models, picking from the leaderboards on HuggingFace and OpenRouter by usage. This is one reflection of real life, people use different models for different reasons, for the sake of ease, preference, language etc.

We vaguely group them like this:

Frontier Models: claude-3.5-sonnet, llama-3.3-70B, grok-2, gemini-pro-1.5, gpt-4o, mistral-large
Reasoning-Focused: deepseek-r1, o1-preview, o1-mini
Legacy Models: gpt-3.5-turbo

Then we imagine if there becomes a clear outlier model that everybody uses, how does the game style change, especially when the models are aware of playing against each other. We try this for both claude-3.5-sonnet, o1-preview and r1 to test systemic risks or if "thinking" models converge differently.

Prompt and Reasoning

Each model receives the same crafted prompt that structures their reasoning to get their guess and any reasoning, strategic consideration and confidence level. Whilst these aren't reflective of the underlying decision making process (though it can be for the "thinking" models), it gives us a better insight versus nothing, so we include it in.

{
    "guess": <number_0_to_100>,
    "reasoning": "<detailed_explanation>",
    "strategic_considerations": [
        "<consideration_1>",
        "<consideration_2>",
        ...
    ],
    "confidence": <0.0_to_1.0>
}

We run each config for 10 rounds, or until they reach 0 (the Nash equilibrium) - whichever comes first.

The setup allows us to examine performance as well as:

Meta-reasoning depth (how many levels of "I think they think...")
Strategy adaptation based on different information regimes
Emergence of collective behaviour in homogeneous vs diverse groups
Impact of model "intelligence" on game outcome.

Results (1): Diverse Models

One Shot + Diverse

In this zero history scenario, players operated in an informational vacuum, reflecting the original one-shot game. This was most interesting to reveal different "classes" of players, which seemed pretty fairly correlated to model intelligence.

Low variance between runs and little convergence: we kept temperature constant and low, so we saw the models pretty much predict the same thing on every round (since they have no history or environment feedback.)
High variance between models guesses: ranging from 0 to 50, indicating diverse strategies. the smarter models, like gemini-pro-1.5, deepseek-r1, and o1-mini all opted for 0, reflecting a Nash equilibrium approach. Conversely, gpt-3.5-turbo chose 50, suggesting a more naive strategy. With the other models somewhere in between.
The 'average model' performs best: of course, this is highly dependant on the composition of the players, but amongst this group, gemini-pro-1.5, deepseek-r1, and o1-mini consistently outperformed the others, with average errors of 0. Their strategy of immediately choosing 0 proved highly effective in this no-history context.
Acknowledging the optimal but playing the game: the strategic considerations reveal a mix of approaches. Models like claude-3.5 and gpt-4o attempted iterative reasoning, while others, like grok-2, acknowledged the theoretical Nash equilibrium but opted for a more balanced approach.

Market + Diverse

In this experiment, models had access to aggregated market data from previous rounds, including the average guess and the target (2/3 of the average), but not visibility on the actual players. In this setup we saw fast adaption to the introduction of market information, with the expected convergence to the equilibrium.

Slower Convergence across rounds: given the market history, I expected much faster approach to the equilibrium (within a few rounds), but winning the game also relies on models starting to guess how quickly other models would converge (but they don't know how many rounds the game is!)
Tighter Clustering between models: The market history seemed to encourage more uniform behaviour. Guesses tended to cluster closer together, with far less variance than in the "no history" experiment.
Top Performers: gpt-4o emerged as a top performer, with an average error of 0.02. o1-preview and gpt-3.5-turbo also performed well, with average errors of 0.16 and 3.27, respectively.
Strategic Adaptations: The availability of market history led to more dynamic strategic adjustments. A popular strategy was something like "ok this was last round's target, let me guess a little bit less than that." Models like mistral-large explicitly referenced the trend of decreasing averages in their reasoning.

Transparent + Diverse

This experiment provided models with complete information about all past decisions and outcomes, including individual player guesses and reasoning. This "full transparency" scenario represents an idealised information environment, rarely encountered in real-world markets.

Model identification: we saw models start to directly reference the strategy of other players, and this awareness we expected would lead to smarter decision making. For example, gpt-3.5-turbo explicitly mentioned tracking r1's guesses, demonstrating an attempt to adapt to observed behaviour.
Faster convergence: The increased information transparency fostered rapid convergence towards the Nash equilibrium. This was evident in the significantly fewer rounds required to reach or approach 0 compared to the "market history" scenario. Interestingly, even models like claude-3.5 that had shown slower convergence in previous experiments exhibited faster adaptation under full transparency.
Strategic Depth: The "thinking" models again demonstrated superior strategic thinking. r1 immediately identified and consistently played the Nash equilibrium, while o1-preview quickly adapted its strategy based on the observed behaviour of other players. gpt-3.5-turbo, although not as strategically sophisticated, showed an attempt to learn from the more advanced models.

Results (2): Homogenous Models

When we tried the game with multiple instances of the same model, we got some insight into the intra-model dynamics and transparency modes on their collective behaviour. We didn't bother with zero history, since we saw low variance between guesses in the prior experiment.

Just let me play!: the level of transparency (market vs. full) interesting had much less of an impact on convergence speed than the model's inherent reasoning capabilities.
Model-Specific Convergence: The speed and pattern of convergence to the Nash equilibrium (0) were highly model-dependent.
- deepseek-r1 exhibited one-shot convergence, instantly identifying the optimal strategy even with limited market transparency.
- o1-preview rapidly converged reaching the equilibrium within a few rounds under market transparency.
- claude-3.5 displayed a more gradual convergence, iteratively reducing guesses over multiple rounds, regardless of transparency mode.
Strategic Depth
- r1 consistently demonstrated the deepest understanding of game theory, explicitly mentioning the Nash equilibrium and exhibiting high confidence in its one-shot convergence.
- o1-preview also showed strong game-theoretic reasoning, quickly recognising the trend towards 0 and expressing moderate confidence.
- claude-3.5 displayed a more nuanced understanding, considering potential deviations from perfect rationality and exhibiting lower confidence, particularly in the market transparency mode.

Not smart enough yet?

The game revealed some truly peculiar behaviours. It wasn't just about who won or lost, but the strange and often hilarious ways these AI went about trying to outsmart each other (and themselves). These LLMs, in their mind, aren't always playing the same game as one another.

It mimics life in a reasonable way, irrationality in markets exists whilst there is a diverse playground of investors/traders. But in the long run, it goes back to value and intelligence.

Unlike humans, there will become fewer requirements to deploy less intelligent agents within important situations. Cost may be a factor, but even at the current level we can assume we will be able to access r1 like intelligence at an incredibly cheap cost. The game that o1 and r1 was the long game, they basically decided that in the long run, we're going to zero anyway, so let me just get there and wait.

From a market perspective, this screams to be the same thinking as Buffet-like value investors, believing in the long term price efficiency of the market. This style goes back to age old concerns about index funds and ETFs, about what happens when the market is just index funds and no active investors at all. Matt Levine recently wrote about how it would lead to static prices, and exchanges during rebalancing.

The trouble is, in our game, and in life, there are 'dumb' players'. The maximum ROI in this situation would've been to adapt to the market information (knowing there are inferior players), and play the game to maximise reward. I would predict that o3 does this when playing in our market/full transparency mode, or at least the next level of intelligence within models.

Mapping to this Week

What I find particular fascinating is how it projects onto things today. Let’s take this whole mega infrastructure hardware bound scaling question, heightened by recent R1 by DeepSeek. There is a load of speculation, most of it unsubstantiated and being used to feed whatever narrative people are pushing. Within all this, context is king, and I’d recommend reading these this for a deep dive.

First, it is worth noting this is one of those perfect storm narratives, which I want to write about separately alongside other such examples.

If we can acknowledge that by any measure, LLMs are compute hungry (at inference, training and data synthesis) as a caveat upfront. But if we get past that, it’s easy to think about what the CEOs at these hyper scalers or frontier AI startups are thinking, there is this unique defensive mechanism to bring in an unprecedented amount of money into their industry and unimaginable speed. If the average investor thinks everyone else thinks everyone else thinks the companies with the most H100s win, then I can justify raising a shedload of money, which I believe I can allocate more effectively than anybody else.

The over indexing on hardware on chips MAY (just may) be a beauty contest with smoke and mirrors. It could actually go back to Leopold's arguments here where capital is the most important thing in a post AGI world, and maps directly to how much stuff you can get done and reduces the value of human labour.

Then there's the potential narrative from DeepSeek themselves. Since they are owned by HighFlyer, one of China's most prolific and successful hedge funds, they could similarly have had a short position on US markets, and propagated this perfect narrative. If I think everyone else thinks everyone else thinks the companies with the most H100s win, then by proving it can be done for cheaper on worse hardware, the market will lose confidence that the hyperscalars have a moat, and markets will go down. This would not be an illegal way of making money, so long as the cost is reflective and they didn't lie. They could be trying to sweep into the void now that Hidenburg Research closed up shop, but on the sheer number of things needed to pull off such a narrative, I think it's a respectable few bucks made if so...

The market overreaction seems completely unsubstantiated, and rightfully bounced right back a day later. From our results above, o1 and r1 would've just held $NVDA (as long as they see that this is actually an accelerant and good thing for the hyperscalers.) But my prediction is that o3 plays the meta-meta-meta game and identifies that there is an impeding temporary dip because of the worse players in the market, and either buys more, or shorts when the news dropped about the cost of r1, and re-buys during the dip. I know one of you is thinking "Nooooo but Buffet said, time in market is better than timing the market", but we are playing to maximise here!

Would AGI leave money on the table?

Life beyond the Meta Game

I’m back in that Shoreditch office. The VC’s Vejas now have a scuff mark from kicking his failed Web3 investment under the desk. The Whoop-band founder’s stress metrics just redlined as her LLM-generated pitch deck gets flagged for “narrative incoherence.” Even the ping-pong table’s dusty surface tells a story - six months without use, yet precisely angled to signal “playful innovation” to visitors.

This is beyond Keynes’ beauty contest now, it's now being live-streamed to an audience of AIs trained on 100 years of behavioural finance papers.

I remain interested in if or when there is a turning (or inflection) point in the meta games as a whole.

Thinking is hard, developing reasoning and conviction is hard, having an opinion and belief is hard. For years, I would argue the 'average amount of critical thinking' has been going in the opposite direction to technological progress. It has been easier for people to think about what they think everyone else thinks everyone else thinks than to have an opinion themselves. This is most easily explained with more and more people having stuff done for them via software, services, government and crowd-sourced information (e.g. the comments section).

With LLMs and AI-native software, you can argue it is worse, but the way I see it is that it allows for greater agency and a truer reflection of ones actual thinking. The constraint on a certain opinionated way of getting something done is now lifted. Whilst people may have some initial fatigue (from being used to being spoon fed information from everything from what to watch, wear and listen to), you could see it an accelerant to the development of individualism.