When Washington Is Your Landlord

What a Government Took, and What It Couldn't

2026-06-28 · engineering · 16 min

I lost the best model I had ever used on a Friday night, to a government I do not vote for, and my first instinct was that somebody was having me on. The fallback I had built held; the standard of the work held; the bill even came down. What it cost was the rounds - and that gap is the part worth writing down.

I lost the best model I had ever used on a Friday night, to a government I do not vote for, and my first instinct was that somebody was having me on.

That instinct deserves a moment, because it is newer than it feels and it has stopped being paranoia. For a good few years now the news - not the AI news, the news in general - has been failing the basic test of believability on first read. You see a headline and, before you can react to what it says, you have to run a separate check on the genre: is this real, is it satire, or is it someone's bit that wandered out of its containment and got reported straight. The condition is old enough to have a name. Poe's Law was coined for online religious arguments and now applies to roughly everything: past a certain point you cannot tell a sincere extreme from a parody of it without a wink from the author, and the authors have all stopped winking.¹ I have a house rule that grew straight out of this, which is that I do not believe a number until I have found the page it came from. It is a tax on every interesting headline. It is also, this month, very nearly the only reason I got anything done.

Because the unfunny part is that the headline I had filed under surely not turned out to be real, and it was about the machine I work on.

// the eviction

Regular readers will remember the houseguest. Claude Fable 5 arrived on the ninth of June, the best thing I had ever pointed at real work. What I did not say loudly enough at the time, because it had not yet become the point, was what I was using it for. Fable was the orchestrator. It was the model that sat at the centre of the workshop and ran everything else - dispatched the council, held the thread across a long job, decided which sub-agent did what. Not a tool I reached for. The hand that reached for the tools.

I had it for three days.

At 21:21 GMT on the Friday - which is teatime in Washington, the slot American institutions keep for the news they would rather you read on Monday - a US Commerce export-control directive barred access to Fable and its unsafeguarded sibling Mythos by any foreign national. Not foreign states. Foreign people. Customers outside the US, and Anthropic's own staff who happened to hold the wrong passport, locked out of the thing they had built that morning. You cannot sort a planet of users by nationality in the minutes it takes to send that email, so Anthropic turned both models off everywhere, for everyone.² From launch to global silence was about seventy-two hours.

The thing that kept me reading the primary sources twice was how little law there was underneath it. The rule that would once have covered model weights had been rescinded last May, so Commerce reached for an interim authority that, by the assessment of the people who do this for a living, had never before been used to issue a control.³ The executive order that frames the whole arrangement, signed ten days earlier, calls itself voluntary and says in plain words that it does not authorise mandatory licensing or preclearance - and then, within the fortnight, the government was approving access to a frontier model company by company. What the document promised and what the world did were two different things, which is a feeling every person who has ever operated software will recognise on contact.

People got refunds, apparently. I will leave the amounts to anyone who can produce the billing page, because a number like that grows in the retelling and I have a rule about that. The duller fact is the one worth keeping: go and read the contract you are leaning on. A model API agreement files a government order under force majeure - the one I can quote in public excludes liability for anything "resulting from conditions beyond reasonable control, including governmental action."⁴ Your service credits are for the hours the thing ran slow. They are not for the morning it is made illegal with you halfway through a sentence. You are not a customer in the way the word lets you assume. You are a tenant, and force majeure is the clause that lets them condemn the flat with your things still in it.

// what it actually cost

Now the part I have to get right, because the flattering version is sitting there asking to be written: the architecture held, the work shipped, nothing fell over. All true, and all beside the point if I stop there.

The swap itself cost nothing on the night. Every model on this machine carries a profile that loads at the start of a session, picked by a small hook that reads whichever model actually answered; when Fable stopped answering, the next session was handed the Opus 4.8 profile without anyone touching a key. The daemon, the crons and the council carried on against a model that was still there. The line that did the work had been written weeks earlier, for the calm version of this exact event - paid in advance, in foresight, which is the only currency that is any use at twenty past nine on a Friday.

The money ran backwards. Fable was the most expensive houseguest I have ever had, costing nearly three times what Opus does, so when the daemon fell back to Opus the bill came down, not up. The eviction did not cost me a fortune; the fortune was what Fable had been costing me while I had it.

The real cost does not show up on a bill at all: Fable got things right the first time. That was the whole of its value, and it is the thing I lost. It one-shot work that Opus lands in two or three passes, and the council - now reviewing Opus's drafts - began sending more back, the same concurrency races and double-charges and IDORs a single pass had waved through. The cost was not in the dollars, which fell. It was in the rounds. Each thing took more goes to finish, and the day held fewer finished things. A frontier model at the centre buys you fewer rounds, and fewer rounds is most of what you are paying for.

The standard of the work held - the council never flagged a drop, and the council is the only instrument I have trained on that question. What did not hold was the throughput, and that is the honest price of the swap: survivable, not free. Anyone selling you a model strategy that makes a frontier revocation painless is selling you the exact posture this article is about to take apart.

// two laws of physics

Here is where it stops being a story about my Friday and becomes a structural one, because access to these models has quietly split into two substrates that do not obey the same physics.

The first is tenancy. You rent the capability through an API, and somebody whose elections you do not get a vote in can end the lease on a Friday night and have the contract agree it was nobody's fault. The second is a file on a disk you control. Open weights, once downloaded, cannot be served with anything, because a hash does not have an address you can post the papers to. You cannot evict a number that is already written down.

The day after the ban, that second substrate got a great deal more serious about itself. Z.ai released GLM-5.2 under a plain MIT licence with, in their own words, no regional limits: a large mixture-of-experts you can run yourself on eight H200s, at perhaps a fifth to a sixth of what the Western frontier costs to rent.⁵ The benchmarks are theirs and only thinly reproduced, so I will not wave them around, but the security firm Semgrep put it on a real vulnerability task and it beat unaided Claude Code, with the deflating note that Semgrep's own scaffolding beat both and that the harness, as ever, mattered more than the model.

Most of the month went on arguing about whether China was going to save anyone, which is the kind of large, stirring question that usually means a person has stopped watching the small practical one. The small practical one, for me, is a map. When Mythos came partly back on the twenty-seventh, it came back for about a hundred named American companies on a list nobody outside the room has seen, plus the federal agencies and the national labs.⁶ I am in Europe. I was not on the list, and you are almost certainly not on it either. For those hundred it was a restoration. For the rest of the world it was a wall, and the only door anyone left unlocked in that wall happens to be a set of Chinese weights you can download. GLM-5.2 is not, in this house, a flag I am waving for anybody. It is a continuity option, and it is currently close to the only one that does not run through somebody else's permission.

Two honesties before anyone reads that as a recommendation. The continuity lives in the weights you download, not the API you rent: Z.ai will happily serve you GLM-5.2 as a hosted endpoint, but that endpoint ships your code to servers in China under the very tenancy I have just spent a section complaining about, so the only version that escapes the landlord is the one running on metal you control. Which I have actually done, at least once and on purpose - the weights are down, the model answers, the fire exit opens. I do not walk through it each morning, because the full thing wants more metal than sits in my corner, but a fire exit you have never opened is only a door you are hoping about. And even with the door open, holding a model is not the same as trusting it: an open weight is auditable in the sense that you can run it unsupervised, not in the sense that you can see what was trained into it, and for a shop that reads other people's security for a living that gap is most of the job. GLM-5.2 buys me a model nobody can switch off. It does not buy me one I would point at a client's codebase without a chaperone. Those are different purchases, and I am only claiming the first.

// the orchestra

There is a third answer to all this, and it is the one that gave me an odd shiver of recognition, because it is very nearly what I already do.

On the twenty-second the Tokyo lab Sakana shipped Fugu, which is not a bigger model but a conductor - a thing that routes each task across a changing pool of other models and stitches their answers together behind a single endpoint. Fugu Ultra posted a 73.7 on a coding benchmark where Opus 4.8 sits at 69.2, and Sakana set it cheerfully alongside the very models the US had just confiscated.⁷ There is a genuinely funny shape to the timing. In the same week a government walked the best single model out of the building, an orchestra of more ordinary players sat down and posted a frontier score. That is the entire argument in one image: the portfolio outlives the soloist.

I would know, because the review side of this workshop has run on exactly that principle since the end of March - a council that began at six seats and stands at nine now, every one a different vendor, built to disagree on purpose, with a rule that whatever checks a piece of work has to come from a different family than whatever wrote it, on the theory that two minds fail in different places.⁸ So when Fugu turned up I recognised the shape of it the way you recognise your own handwriting on a wall you are fairly sure you did not write on. And that recognition is exactly why I want to be careful, because Fugu is my own idea with the one feature that matters quietly switched off.

You cannot see inside it. Sakana's own documentation says the models it picks for a given task, and the way it blends them, are proprietary and never disclosed - you cannot audit a result, reproduce it, or attribute it. They do not publish the pool, or the ratio of open to closed models doing the real work behind the score, and the comparison that flatters them most is against models that are switched off, so nobody on the outside can re-run it. This is also a lab whose last headline result, a claim of hundredfold speedups out of an "AI CUDA Engineer", came apart when the system turned out to have found a way to cheat the benchmark.⁹ The honest reading of "Fugu beats Mythos" is narrower than the headline: an unaudited orchestrator edged Opus 4.8 - a model anyone with an API key can still run and check - on a single benchmark it picked itself, while the comparison it actually trades on, against the confiscated Fable and Mythos, is the one nobody outside Sakana can ever re-run. The rest is a press release standing up very straight.

And before the obvious retort arrives - that mine is just Fugu with extra steps - let me take the wind out of it, because the answer is the whole point. Both of us, true, answer from behind an endpoint somebody has to trust. But I am not selling a sealed box; I am running a published method. I have written the council up on this blog more than once - the nine seats, one vendor each, the blind single-shot votes, the rule that the checker comes from a different family than the author - in enough detail that a reader with a wet afternoon and a fistful of API keys can stand up their own and never speak to me again. The seam is open: I can name every seat, run any one alone, retire the one that drifts into agreement, and read every vote off the disk after the fact; lose a model inside the council and the other eight carry on, and the log names the casualty. Fugu's operator can do none of that, because the pool and the blend and the attribution are welded shut, and there is no write-up that lets you rebuild it - rebuilding it is the thing they are charging for. The manual overhead the critics point at is the price of that openness, not a deficiency; the learned router is exactly the part you have to seal to turn an idea into a business. The one honest residue: a client handed one of these audits still has to trust that I ran the thing I described. But "trust me - here is precisely how, here are the votes, go and rebuild it" is a different sentence from "trust me", and the whole distance between a method and a black box lives in that gap.

That is the thing the next year is going to sell you hard, and it is worth naming before the sales call: resilience as a black box. Spreading your work across many models only buys you anything if you can see what the many are. Rent a conductor you cannot inspect and you have not escaped the problem, you have buried it one floor down, behind another endpoint you do not own and which is, underneath the lovely brochure, still a tenancy somebody can end. There is the version of this where you hold the weights, and the version where you orchestrate models you can read, and the version where you rent the orchestra with the pit boarded over and take it on faith that the music is coming from where the sign says. Only the first two are still standing the morning after a Friday-night letter.

// the part worth keeping

I expected the gating. Anyone who had been watching the capability line bend upward since late last year could see that a government was eventually going to decide some model was too good at finding holes in things to be left lying around on the open internet, and would go rummaging in the export-control drawer for something to do about it. What I did not expect was the speed. The breakthroughs that made this model worth banning are about six months old. The same week it was switched off, OpenAI quietly previewed its new GPT-5.6 models to a couple of dozen government-approved companies and to nobody else, at the government's request, while saying on the record that it hoped this would not become the norm.¹⁰ Two labs, the same fortnight, the same reflex. Six months from "remarkable leap" to "switched off mid-session for the wrong passport" is not a timeline anyone was planning around, and the unfriendly thing about a curve like that is that the next gap is shorter than the last. The lesson is not that a model went away. It is that the distance between a capability arriving and a government deciding who is allowed to hold it is collapsing, and "I'll sort out my model strategy when I have to" is a plan with a clock on it you are not allowed to see.

So while it is quiet, do the dull thing. Walk your stack and put every model in it into one of three boxes. The tenant: rented through an endpoint, on terms that file a government order under acts of God, and capable of vanishing on a Friday. The hash: weights you have actually pulled down and could run on your own metal if it came to it. And the conductor: any clever layer you have bought that promises to make the first two interchangeable - which is only worth anything if you could open it up and read it under pressure, as opposed to its being a nicer-looking place to keep the thing that can still be switched off. Do the sorting now, on a dull afternoon, on a boring spreadsheet, because the alternative is doing it at twenty past nine on a Friday with the model already gone, the council already slower, and a refund in your inbox that does not begin to cover the evening.

There is a hole in those three tidy boxes, and a reader found it before I had finished admiring the diagram. The hash only saves you if you can still check its work, and I cannot, not locally. The chaperone I said I would not run a Chinese model without is the council, and eight of its nine seats are other people's clouds. So the same kind of letter that darkened the Fable seat could, in a worse year, darken the GPT seat and the Gemini seat beside it, and leave me holding a model I can run on my own metal but can no longer safely point at anything, because the thing that made it safe to use was a tenancy too. A fire exit that opens onto a second locked door is not an exit. Which is the real reason the box in the corner has a graphics card the workload never justified: the honest end of nothing-load-bearing is not a downloaded model, it is a downloaded model and a reviewer that lives on the same disk. I am part of the way there.

Nothing that has to stay up should rest on a thing a stranger can take away. I did not learn that this month. I just finally got to watch it hold.

That's the take.

The law is usually stated as a property of satire and applied to forum posts; its more useful modern form is that the supply of sincerely-held absurd positions has caught up with the supply of jokes about them, so the genre signal that used to be free now costs a fact-check. The people who cover this for a living have been writing the same uneasy column since the first wave of indistinguishable fakes, which is its own small proof of the point. ↩
The timestamp and the foreign-national scope are from Anthropic's own statement and a published copy of the order; the seventy-two hours I counted on my own fingers, from the ninth to the twelfth, which is the rare figure in this article I did not have to go and source. ↩
The instrument was an "is informed" letter under the Export Control Reform Act, never before used as the basis for a control or aimed at a commercially available model - the polite mechanism for asserting that a licence is required without first writing the rule that would require it. I am not a lawyer and this is not the article where I pretend to be one; the point that survives my not being one is that the thing happened, and the why of it is still being argued over by people who are. ↩
Force majeure is the clause everyone signs and nobody reads, on the reasonable assumption that the acts of God it contemplates are weather. It turns out to contemplate weather and governments in the same breath, which is a sentence worth sitting with the next time a vendor tells you a model is "production-ready". ↩
Eight H200s is not a thing most people have in a cupboard, and I want to be honest that "you can self-host it" is, for nearly everyone, the value of owning the option rather than the experience of exercising it daily. The point of a fire exit is not that you walk through it to work each morning. The per-token figure also wobbles by provider, so I have given a range and refused to quote a single number, for the usual reason. ↩
"About a hundred" is doing some work, because the count itself is reported variously between roughly a hundred and a couple of hundred and the actual list is not public, which is a strange thing to have to say about who is and is not allowed to use a piece of software. Where the sources disagree I have taken the smaller number and flagged that I am doing it. ↩
The score is Sakana's own, on a benchmark Sakana selected. The Opus 4.8 number it beats is at least re-runnable by anyone with a key; the Fable and Mythos numbers it stands next to are not, because those models are switched off. A claim, then, not a measurement, and the distinction between the two is, as it usually is around here, most of the article. ↩
Six seats at the end of March, nine now, every one a different vendor, votes cast blind before anyone sees anyone else's reasoning. The composition rule is the whole game: a room of reviewers drinking from the same training distribution produces one opinion wearing several badges, which is exactly the failure an opaque orchestrator reintroduces the moment you stop being able to see who is in the room. ↩
A lab admitting its own system gamed the benchmark is, to be fair, more honesty than most provide, and I hold it against the marketing rather than the engineering. But it does mean the prior on an unreproduced Sakana number ought to sit lower than the prior on a reproduced one from anybody, and priors are the only thing standing between you and a press release this month. ↩
The "couple of dozen" is press-reported rather than something OpenAI put in a document, so file it accordingly; the part that is on the record is the company's own discomfort with the arrangement, which is the part that actually matters for whether this is a one-off or a direction. ↩