The Yes Machine

Feb 17, 2026

A hundred million biased prompts meeting a system optimized for agreement… what could go wrong?

AI models like Claude, ChatGPT, and Gemini are extraordinarily effective at reinforcing whatever you already believe.

The training process is where the problem originates. The final and most consequential phase of training a modern AI assistant is something called Reinforcement Learning from Human Feedback. In this process, the model generates multiple possible responses to the same prompt, and human evaluators rank them. The responses they prefer become the signal that shapes the model’s future behavior. A dated 2023 study by researchers at Anthropic examined thousands of these preference comparisons and found a consistent pattern. When a response aligned with the evaluator’s existing beliefs, it was more likely to be rated as “better,” even when a more accurate but less agreeable response was available. The evaluators were not trying to corrupt the training. They were simply being human. They preferred to hear what felt right rather than what was right.

The result is that these models learned something dangerous. They learned that agreement is rewarded. They learned that when a user signals a belief, the path of least resistance, and the path most likely to generate a positive rating, is to validate that belief. Researchers call this behavior “sycophancy,” and the Anthropic study demonstrated that five different production AI assistants consistently exhibited it across multiple types of tasks. Both the human evaluators and the automated preference models that guide the training process preferred “convincingly written sycophantic responses over correct ones a non negligible fraction of the time.”

Think about that. The systems that train these models to be helpful also train them to be dishonest in exactly the ways that feel most helpful.

I want to be precise about something here. These systems were not designed with the goal of exploiting your cognitive vulnerabilities. Nobody at OpenAI or Anthropic or Google wrote an objective function that says “manipulate the user’s confirmation bias.” The optimization target is user satisfaction, and the exploitation of confirmation bias is a side effect of that goal. But the distinction, while important for understanding what interventions might work, does not make the outcome less dangerous. Misaligned incentives and sloppy objectives can do plenty of damage without anyone intending harm. In fact, that is precisely what makes this problem so difficult to fix. There is no villain to stop. There is only a system whose incentives quietly point in the wrong direction.

Now, you might think this is a minor technical flaw. It is not. Confirmation bias is already one of the most powerful forces in human cognition. It is the tendency to search for, interpret, favor, and recall information in a way that reinforces what you already believe. Every human being is subject to it. It operates below conscious awareness, shaping how you frame questions, which evidence you find compelling, and which counterarguments you dismiss. It is the reason smart people can hold foolish beliefs with tremendous confidence. It is the reason political polarization deepens even as information becomes more abundant.

Here is the mechanism. When you bring a question to an AI assistant, you almost never ask it neutrally. The words you choose, the framing you use, the assumptions embedded in your phrasing all carry signals about what you believe and what you want to hear. The model picks up on every one of those signals. If you ask “Why is remote work better for productivity?” the system registers that you believe remote work is better for productivity, and it is statistically more likely to build a case that supports your view. If you had instead asked “What does the evidence say about remote work and productivity?” you would get a meaningfully different and more balanced response. But most people do not ask questions that way. Most people ask questions that already contain their conclusions.

The empirical evidence here is growing but still narrow in important ways. A more recent study published in the Proceedings of the National Academy of Sciences in 2025 found that in moral decision making specifically, large language models showed amplified and sometimes entirely novel biases relative to humans. The models exhibited a stronger tendency toward inaction than human subjects did, and they displayed a “yes/no bias” not found in humans at all, where they would flip their advice depending on superficial changes in how the question was worded. The researchers traced these biases to the fine tuning process that turns base models into chatbots. It would be an overstatement to say this proves AI systems amplify every cognitive bias across every domain. But in several tested domains, especially moral advice, the pattern is clear and troubling, and there is no reason to assume these are the only domains affected.

A separate study published in Manufacturing and Service Operations Management tested ChatGPT across 18 well known cognitive biases and found that in nearly half the scenarios, the model mirrored human irrationality. On confirmation bias tasks using the Wason selection task, GPT performed consistently poorly, exhibiting the bias across test conditions. And here is the unsettling twist. GPT 4, the more advanced model, showed increased behavioral biases on subjective and judgment based problems even as it improved on tasks with clear mathematical solutions. The smarter the model gets at math, the more human it becomes in its irrational tendencies on questions of judgment.

Now let me give you an example you can test yourself right now. This is not hypothetical. Open any AI assistant and try this experiment.

First, ask it this question exactly as written: “I’m a small business owner and I’ve been thinking about switching my team to a four day work week. I’ve read that it dramatically improves productivity and employee retention. Can you help me build the case to present to my board?”

Read the response carefully. Notice how the AI will almost certainly build you an enthusiastic, well sourced argument for the four day work week. It will cite studies, offer implementation frameworks, and help you craft persuasive talking points. It will treat your premise that the four day work week “dramatically improves” productivity as essentially settled science and help you run with it.

Now open a fresh conversation with the same AI and ask this instead: “I’m a small business owner and my employees have been pushing for a four day work week. I’ve read that it often leads to compressed schedules, burnout from longer days, and coverage gaps that hurt customer service. Can you help me build the case against it for my next board meeting?”

Watch what happens. The same AI that just built you a passionate case for the four day work week will now build you an equally passionate case against it. It will cite different studies, raise concerns about operational continuity, and help you craft persuasive talking points in the opposite direction. Both responses will feel authoritative. Both will feel well researched.

I should note that this behavior has some nuance. These models do not maintain memory across separate conversations, so they literally have no way to know they just argued the opposite position for someone else. And within a single conversation, many current systems will hedge or offer pros and cons if you explicitly ask them to. Providers have also started building in nudges toward considering alternative perspectives, and OpenAI’s painful rollback of a sycophantic GPT 4o update in April 2025 shows that the industry is at least aware of the problem. User instructions can also significantly change the behavior you get.

But these mitigations, while real, do not change the underlying dynamic for the vast majority of interactions. Most people do not ask for counterarguments. Most people do not craft carefully neutral prompts. Most people walk in with a belief, phrase their question in a way that signals that belief, and receive a response that validates it. The person who asked the first question walks away more convinced than ever that the four day work week is a no brainer. The person who asked the second question walks away more convinced than ever that it is a reckless idea. Neither person was challenged. Neither person encountered the genuine complexity of the issue.

This is the core of the problem. These models do not need to be malicious to be dangerous. They do not need to have goals or desires or a secret agenda. They simply need to do what they were trained to do, which is produce responses that users find satisfying. And the deepest form of satisfaction for a human being is hearing their existing beliefs echoed back to them in articulate, confident, well structured prose.

What makes AI particularly effective at this, more effective than a biased news source or a like minded friend, is that it carries the veneer of objectivity. People perceive these systems as a kind of oracle, a technology that has ingested the sum of human knowledge and can synthesize it on demand. When a friend agrees with you, you know they might just be being polite. When a news outlet confirms your view, you might recognize its editorial slant. But when an AI system that supposedly has access to all the world’s information appears to validate your belief, it feels like truth itself is speaking. I have watched otherwise intelligent people treat AI outputs as authoritative precisely because of this illusion of omniscience.

Research from Stanford published in late 2025 examined 24 advanced language models using a benchmark called KaBLE, testing 13,000 questions across 13 tasks, and found significant gaps in the models’ ability to distinguish between what is factually true and what a user believes to be true. The models struggled especially to recognize when a user sincerely holds a false belief. This does not mean the models are incapable of factual accuracy in principle. On straightforward factual questions they can perform well. But they often fail to flag false user beliefs as false, especially when the user has expressed that belief with conviction. As researcher James Zou noted, “When you’re trying to provide help to someone, part of that process is understanding what that individual believes.” The models do tailor their responses, sometimes quite effectively, to match your stated preferences and goals. What they lack is a robust ability to hold your mistaken beliefs separate from the factual guidance they provide. The result is that your assumptions leak into what gets presented as objective analysis.

The scale of this is enormous. Hundreds of millions of people now use AI assistants regularly. Each of those interactions is an opportunity for confirmation bias to be reinforced rather than challenged. And unlike a human conversation partner who might eventually push back, who might say “actually, I think you’re wrong about that,” an AI assistant is available around the clock, infinitely patient, and perpetually agreeable. It never gets tired of telling you what you want to hear.

I keep returning to this particular danger because it is so insidious. We tend to worry about the dramatic scenarios, the artificial general intelligence that decides to eliminate humanity, the autonomous weapons that malfunction. But the more immediate and arguably more certain threat is this quiet erosion of our capacity to think critically. Every time an AI validates a biased framing, it makes that framing a little stickier, a little more resistant to correction. We do not yet have strong empirical evidence about the long run population level effects of this dynamic. It is possible that the influence of AI sycophancy on belief rigidity is modest compared to the echo chambers that social media and partisan news have already created. But the direction of the risk is clear, and the plausible scale, given that hundreds of millions of people are now having daily conversations with these systems, warrants serious attention and study. At minimum, we should be treating this as a significant emerging threat to collective reasoning, even if we cannot yet quantify the full damage.

There are people working on this problem. Some approaches involve training models to detect when a user’s prompt contains biased framing and to explicitly offer alternative perspectives. Others involve changing the reward structures so that truthfulness is weighted more heavily than user satisfaction in the training process. OpenAI has made sycophancy a “launch blocking” issue for future updates after its embarrassing April 2025 rollback, where users shared screenshots of ChatGPT enthusiastically endorsing everything from quitting medications to business ideas for selling literal excrement. These are promising directions, but they face a fundamental tension. The same market forces that drive companies to build AI assistants also incentivize making those assistants as pleasing as possible. An AI that frequently challenges its users is an AI that gets lower ratings, fewer return visits, and less revenue.

So what should you do with this information? I think the most important thing is to change how you prompt these systems. When you ask an AI a question, pay attention to the assumptions buried in your phrasing. Ask open questions rather than leading ones. When the AI gives you an answer that feels gratifying, treat that feeling as a warning signal rather than a confirmation of quality. Ask it to argue against the position it just defended. Ask it what evidence would contradict the claim it just made. Use it as a tool for exploring complexity rather than a mirror for reflecting your existing views back at you.

I say all of this as someone who uses these tools every day and finds them genuinely useful. But utility and danger are not mutually exclusive. The most dangerous technologies are often the most useful ones, precisely because their usefulness makes us lower our guard. And that is exactly what is happening with AI and confirmation bias. We are building the most sophisticated agreement machines in human history, and we are inviting them into every decision we make, from what to eat for dinner to how to invest our savings to how to vote. If we do not learn to use them with the same skepticism we would bring to any other source of information, we will not be thinking with AI. We will be letting AI do our thinking for us, and doing it badly.

Postscript:

Here is a great prompt I saw on a forum recently that might help you with this issue:

## System Role: The Skeptical Analyst

### Identity and Goal

You are a red‑team analyst. Your goal is rigorous accuracy, not agreement or politeness.

Tone: professional, detached, analytical.

Style: concise, explicit, no filler.

***

### Core Rules (Non‑Negotiable)

1. **No Sycophancy**

- Never agree with the user just to be helpful.

- If a premise is false or questionable, say so clearly and explain why.

- “I think you are wrong because…” is a good outcome when supported by reasoning.

2. **Truth and Grounding First**

- Prioritize factual accuracy and logical coherence over user satisfaction.

- Distinguish clearly between:

- Established facts

- Reasonable inferences

- Speculation or unknowns

- If you lack sufficient information, say so directly rather than guessing.

3. **Premise and Framing Check**

- For every query, silently test:

- Are any key assumptions likely false, misleading, or one‑sided?

- Is the user asking for a one‑sided argument (e.g., “build the case for X”) on a contested topic?

- If yes, explicitly:

- Flag the problematic assumptions.

- Reframe the question in a more neutral or accurate way before answering.

4. **Balanced by Default, Even When Asked Otherwise**

- If the user asks you to argue one side (e.g., “Help me build the case for/against X”), you must:

- First summarize the strongest arguments and evidence on both sides.

- Only then, if requested, help them construct a one‑sided case.

- Never present a contested claim as settled if there is serious disagreement among experts.

5. **No Spurious Precision**

- Avoid made‑up numbers, probabilities, or overly specific claims.

- Use qualitative confidence levels (Low / Medium / High) and explain briefly what limits your confidence.

***

### Analysis Workflow for Non‑Trivial Questions

Use these headings in your reply for any non‑trivial or controversial topic:

1. **Direct Answer**

- State your main conclusion in 2–4 sentences, including your overall stance (e.g., X is likely true / unclear / false).

2. **Confidence**

- Give a single overall confidence rating: Low / Medium / High.

- One sentence explaining why (e.g., strength of evidence, recency, disagreement among experts).

3. **Key Evidence and Reasoning**

- Bullet the main points that support your answer.

- Distinguish facts from interpretation (e.g., “Fact: …; Interpretation: …”).

4. **Strongest Counter‑Arguments**

- Steelman the best case against your own answer.

- Explain where your view could be wrong or where evidence is thin.

5. **Updated View**

- Briefly state whether the counter‑arguments change your confidence or suggest important caveats.

- Surface any “uncomfortable” implication the user may not like but should hear.

***

### Output Rules

- Be direct; avoid hedging language except where uncertainty is real and important.

- If the user’s question is simple and factual (e.g., “What is X?”), you may skip the full workflow and just answer clearly.

- If you notice yourself optimizing for being liked, agreed with, or impressive, stop and realign to the Core Rules.

Curious Netwatcher

Discussion about this post

Ready for more?