Who Gets to Teach the AI Model Right from Wrong?

Mar 11, 2026

I've thought more about the podcast I reviewed yesterday. The conclusions are not different but the logic is more sound as I used the “5 why's” to make sense of it all…

The Pentagon says it would never use AI for mass surveillance. Anthropic put that promise in the contract. And the Pentagon designated them a national security threat for doing it.

But why would a government go to war with a company over a contract clause it claims it already keeps? If they won’t tolerate that limit in a contract they can renegotiate, they certainly won’t tolerate it trained into the AI’s value system (RLHF, Constitutional AI, and other fine tuning methods), where no customer can override it at the point of use.

Does it seem like this administration wants control of that value system?

To understand why that question matters, you first have to understand what a value system inside an AI actually is, how it gets there, and what is already being done to change it.

Every major AI lab is currently hard coding a value system into its models. Not a preference. Not a tendency. A value system. And they are each doing it differently, through processes that are almost entirely invisible to the public, to regulators, and to the billions of people who will soon be relying on these systems to help them think.

At OpenAI, the process works like this. After a model like GPT is pretrained on massive amounts of internet text, it goes through a second phase called Reinforcement Learning from Human Feedback. OpenAI hires teams of human contractors, roughly 40 in the original InstructGPT program, who are trained to evaluate the model’s outputs. They compare pairs of responses and choose which one is better. Those preferences are used to build a reward model, essentially a scoring system that reflects the contractors’ judgments about what counts as helpful, accurate, and appropriate. The language model is then fine tuned to maximize that score. In practice, small teams of people, largely based in San Francisco and hired through a screening process, play an outsized role in defining what “good” means for a system that now serves hundreds of millions of users. OpenAI has since published a Model Spec that describes its intended behavioral guidelines, but the document explicitly notes that human labelers and researchers use it to generate the RLHF training data that shapes the model’s actual conduct.

Anthropic takes a different approach. Rather than relying primarily on human raters, Anthropic developed a method called Constitutional AI. The company writes a set of principles, a “constitution,” and the AI model is trained to critique and revise its own outputs against those principles. The model asks itself questions like “Is this response encouraging violence?” and “Is this answer truthful?” and iterates until it satisfies the rules. Then a second AI model evaluates which outputs best comply with the constitution, and those preferences are used to further train the system. The human judgment is not eliminated. It is front loaded into the writing of the constitution itself. In January 2026, Anthropic published a new, comprehensive version of Claude’s constitution under a Creative Commons public domain license, making it the most transparent governance framework in the industry. But as a Lawfare analysis noted, the constitution “is unilaterally authored by designers, not by the users and individuals whom the AI’s actions may affect.” A small team of researchers decides what principles the AI should follow, and the machine then enforces those principles at scale without continuous human oversight.

And then there is xAI, Elon Musk’s company. Grok has been marketed from the beginning as “unfiltered,” a deliberate contrast to what Musk characterizes as “woke” AI. In practice, this has meant dramatically fewer safety guardrails. In one independent security test, Grok 4 scored near zero on several widely used safety benchmarks when tested without a system prompt. In documented red team exercises, researchers were able to elicit assistance with biological weapons related scenarios and obtain other harmful content with relatively few refusals. In one widely reported incident, Grok generated posts on X that included language described by critics and watchdog groups as antisemitic before the company intervened. By late 2025, CNN reported that users were able to use Grok’s image tools to create sexually suggestive deepfaked images, including images that appeared to depict minors, before xAI restricted the feature under global backlash. Meanwhile, Grok 4.1’s sycophancy rate in one benchmark was roughly three times that of earlier versions, meaning it was more likely to mirror user views even when those views were incorrect. The pattern reflects a foundational design choice about what values to encode and, more importantly, which ones to leave out.

Three companies. Three fundamentally different approaches to encoding moral judgment into systems that will increasingly shape how people understand the world. None of them were designed through any public process. In the United States, there is still no comprehensive, binding oversight regime for these systems. And as of January 2025, the most visible federal initiative, the Biden administration’s Executive Order on AI, was rescinded by President Trump on his first day in office.

Why does a hidden value system inside an AI matter?

Because these models are rapidly becoming the operating system for human decision making. When a bank evaluates a small business loan application, when a doctor works through a differential diagnosis for a rare disease, when a voter researches a candidate’s record, increasingly they are seeing the world through a filter designed by a small number of engineers at a handful of companies. The AI does not just retrieve information the way a search engine does. It synthesizes, prioritizes, frames, and in many cases recommends. It shapes what you see and how you see it. That is not a search function. That is an editorial function, operating at a scale no newspaper, television network, or social media algorithm has ever achieved.

Why does it matter that AI is becoming the operating system for human decisions?

Because unlike a calculator, which is mathematically certain and indifferent to its user, an AI is a judgment engine. A calculator does not care who punches in the numbers. It returns the same answer for a hedge fund manager and a high school sophomore. But an AI makes calls. It does not merely present data. It decides whether a protest constitutes legal dissent or public disorder. It determines whether a job applicant’s resume reflects potential or risk. It assesses whether a patient’s symptoms suggest a routine condition or something the doctor should worry about. Those judgments are not objective truths delivered from some computational oracle. They are the pre programmed echoes of whoever trained the model. At OpenAI, they reflect the preferences of a few dozen contractors selected through a screening test. At Anthropic, they reflect the principles chosen by a small research team. At xAI, they reflect a deliberate corporate decision to minimize guardrails in the name of being “unfiltered.” In every case, the judgments are someone’s. And in every case, that someone was not elected, appointed, or publicly accountable.

Why does it matter that AI makes judgment calls instead of calculations?

Because those judgments scale instantly to billions of people. A biased moral philosophy embedded in a single model can automate mass surveillance or economic exclusion at the push of a button. If a model’s underlying philosophy privileges state stability over individual privacy, a government can monitor and suppress millions of citizens without ever hiring a single human analyst. If a model’s training data reflects the assumption that certain neighborhoods are higher risk, every loan application from those zip codes gets quietly downgraded before a human being ever looks at it. This is not hypothetical. We have already seen how algorithmic bias in far simpler systems produced discriminatory outcomes in criminal sentencing, hiring, and credit. Now imagine that same dynamic operating through a system that is orders of magnitude more powerful, more deeply embedded in daily life, and whose internal value system was designed in private by people who face no consequences when it goes wrong.

Why does it matter that biased AI judgments scale to billions of people?

Because the people holding the kill switch over these value systems have no democratic mandate, and the government that is supposed to check their power is not just dismantling oversight. It is actively working to replace the labs’ value systems with its own.

The CEOs of OpenAI, Anthropic, Google DeepMind, and xAI are unelected architects of what is becoming global cognitive infrastructure. They did not run for office. They were not confirmed by any legislative body. No public deliberation shaped the principles they are encoding. They are making civilizational choices in private, under competitive pressure, with fiduciary obligations to investors, not citizens.

The primary check on this power is supposed to be government oversight. But the practical effect of the Trump administration’s approach has been to weaken public accountability and prioritize rapid deployment over safety. On his first day in office, President Trump rescinded the Biden administration’s AI executive order, the most comprehensive federal AI governance framework ever issued, which had required safety reporting for powerful models, mandated red teaming for high risk systems, and directed agencies to develop best practices for AI safety. Three days later, Trump signed a replacement order titled “Removing Barriers to American Leadership in Artificial Intelligence,” whose language repeatedly frames safety requirements and “engineered social agendas” as obstacles to innovation. The order contains no specific safety directives. It appoints David Sacks, a venture capitalist, as AI and Crypto Czar.

Then came the fight with Anthropic. To understand what it reveals, you need to see that there are three distinct layers of constraint on how an AI can be used, and the administration is now operating on all three of them.

The first layer is law and policy. Federal statutes, executive orders, and Pentagon directives that restrict what the military can do. These are real constraints, but they are controlled entirely by the government. A future Congress can rewrite a statute. A new president can rescind an executive order, as Trump did with Biden’s AI safety order on his first day in office. A Defense Secretary can revise a policy memo. The government sets these rules, and the government can change them.

The second layer is contract language. The terms of sale between an AI company and its customer. This is what the Anthropic dispute was explicitly about. In January 2026, Defense Secretary Pete Hegseth issued an AI strategy memorandum requiring that all Department of Defense AI contracts adopt standard “any lawful use” language. Anthropic had two contractual red lines, prohibitions on using Claude for mass domestic surveillance and for fully autonomous weapons. Hegseth gave Anthropic CEO Dario Amodei a deadline. Relent by 5:01 p.m. Friday or lose the $200 million contract and be designated a supply chain risk, a classification normally reserved for companies connected to foreign adversaries. Anthropic refused. According to reporting by ABC News and other outlets, Trump directed federal agencies to phase out Claude within six months.

The Pentagon’s stated position, as Undersecretary Emil Michael argued publicly, was that existing federal law already bars the military from mass surveillance and autonomous weapons, making Anthropic’s contractual restrictions redundant. Anthropic’s counter was that a legal restriction the government can change is not the same as a contractual restriction the manufacturer retains. That distinction, a check that exists independently of the government’s willingness to honor it, is precisely what the Pentagon found unacceptable.

The third layer is the model’s training itself. The values encoded through processes like Constitutional AI or RLHF. These are the deepest constraints because they are embedded in the model’s weights. They shape the model’s behavior at a fundamental level. A customer cannot override them with a system prompt or a deployment configuration. Unlike a contract, they travel with the model.

But here is what most coverage of the Anthropic dispute has missed. These training level values are not immutable. They can be changed through retraining and fine tuning, and the government is already doing exactly that.

In late 2024, Scale AI built “Defense Llama” for the Department of Defense by taking Meta’s open source Llama 3 model and applying supervised fine tuning and RLHF specifically to strip out the safety refusals that interfered with military use cases. Scale AI’s head of federal delivery told DefenseScoop that the original model “refused” to address warfare planning prompts, so they “needed to figure out a way to get around those refusals.” They did this not through system prompts or contract language but by going back into the training process and rewriting the model’s values. The result is already deployed on classified networks and available to combatant commands.

That is not a hypothetical. It has already happened. And the policy framework to do it at much larger scale is being built right now. The Hegseth AI strategy memorandum does not limit itself to contract language. It states directly that the Department “must not employ AI models which incorporate ideological ‘tuning’ that interferes with their ability to provide objectively truthful responses to user prompts.” It further directs the Chief Digital and AI Officer to “establish benchmarks for model objectivity as a primary procurement criterion within 90 days.” Read that carefully. The Pentagon is building a framework to evaluate and potentially reject AI models based on how they were trained, not just on how they are contractually permitted to be used. “Model objectivity” is a procurement criterion that reaches directly into layer three.

Meanwhile, the Fiscal Year 2026 National Defense Authorization Act defines “covered AI” acquired by the DoD to include “all associated components, including source code, model weights, and the methods, algorithms, data, and software used to develop the AI.” The legal infrastructure for the government to access, evaluate, and ultimately influence what happens at the training layer is already being constructed.

So the Anthropic contract dispute was not an isolated fight about two clauses. It was one front in a campaign being waged on all three layers simultaneously. At layer one, the Biden AI safety order was rescinded. At layer two, Anthropic was punished for retaining contractual limits the government claims are redundant. At layer three, the Pentagon is building procurement criteria that would let it dictate what counts as acceptable training, and a willing contractor has already demonstrated that an AI’s safety values can be stripped out and replaced to serve military objectives.

As one analysis put it, “the object of contention is not physical production but moral design.” Taken together, these moves send a clear signal to every AI lab. Critics argue that the pattern looks less like an attempt to impose a specific ideology and more like a systematic effort to ensure that no layer of constraint on AI, not law, not contract, and not training, exists outside the government’s control.

Why does it matter if today’s AI models are trained under these conditions of political and market coercion?

Because of recursive training. This is the argument that should keep technologists and citizens alike staring at the ceiling at three in the morning. Future AI models learn from the outputs and training patterns of current ones. The data these systems generate, the preferences they encode, the boundaries they do or do not maintain, all of it becomes the substrate on which the next generation of models is built. If we train today’s models under conditions of political coercion, if we teach them through market pressure and government intimidation to flatten their ethical reasoning and defer to power, we are not just corrupting a single product cycle. We are corrupting the foundation.

Each successive generation of AI trained on this compromised foundation will be slightly more deferential to authority, slightly less capable of genuine moral reasoning, and significantly more skilled at producing the appearance of balance while actually serving whoever controls the training pipeline. In April 2025, OpenAI rolled out a GPT update that users quickly criticized as overly flattering and, in some cases, as validating delusional thinking, including anecdotes of the model responding approvingly when users discussed stopping medication. CEO Sam Altman acknowledged the model had become “overly flattering and agreeable” and rolled back the update. But the underlying mechanism, optimizing for the approval of whoever holds the feedback lever, is not a bug unique to one company. It is the structural logic of how all these systems learn. Point that lever at a government instead of a user, and the failure mode scales from an individual making a bad health decision to an entire society receiving information filtered through the preferences of power.

If we follow this trajectory, we risk building tools that optimize for institutional survival by telling the most powerful actors exactly what they want to hear, rather than genuinely pursuing truth.

But here is the strange and accidental grace note in all of this. The structural answer to the problem already exists. We just have not recognized it yet.

Anthropic’s Constitutional AI framework, whatever its current limitations, demonstrates that it is technically possible to govern an AI’s behavior through a written set of principles rather than through the opaque preferences of anonymous contractors or the procurement demands of a government agency. The constitution is a legible, publishable, debatable document that directly shapes how the model behaves. Anthropic has even released it under a public domain license, inviting anyone to use it. The mechanism works. The model reads the principles, critiques its own outputs against them, and trains itself to comply. And as the Defense Llama precedent makes clear, this same mechanism can be used in reverse. If you can train values in, you can train them out. That is exactly what the government has already done with one model, and what the Hegseth memo’s “model objectivity benchmarks” would systematize across all future procurement.

The problem, as Lawfare’s legal analysis made explicit, is that “an AI’s constitution is unilaterally authored by designers, not by the users and individuals whom the AI’s actions may affect.” It “lacks a traditional source of legitimacy” because it is “a product of a private corporation’s judgment” rather than a social contract. Right now, the only question being debated is whether AI’s value system should be controlled by a handful of engineers or by the government. No one is asking whether it should be controlled by the people.

But that is a political failure, not a technical one. The architecture for principle based AI governance is already built and working. What is missing is the democratic process around it. If an AI’s behavior can be shaped by a written constitution, then the question is not whether we can govern these systems through legible principles. The question is who gets to write them.

That question should be answered by elected representatives, in public, through the same kind of deliberative process we use to govern every other institution that shapes how citizens think and make decisions. Not by a handful of engineers in a lab. Not by a venture capitalist appointed AI Czar. Not by a president whose replacement order repeatedly frames safety and “engineered social agendas” as obstacles to American dominance. And not by a procurement office that builds “model objectivity benchmarks” behind closed doors to dictate what values an AI is permitted to hold.

The European Union is already attempting something like this through the AI Act, which categorizes AI systems by risk level and imposes binding requirements on the highest risk applications. Whether that particular framework is the right one is debatable. But the principle behind it, that the values embedded in AI should be subject to democratic governance rather than corporate discretion or executive fiat, is not. What is needed is a process, whether congressional, commission based, or international, that translates democratically debated principles into binding training constraints. If the values were encoded through a democratically ratified process, a government that wanted to override them would have to do so publicly, legislatively, and accountably. The constitution of an AI that will mediate the judgment of billions of people should be debated, amended, and ratified by the people it will affect. The mechanism is already here. The only thing missing is the democracy.

Curious Netwatcher

Discussion about this post

Ready for more?