Last Week in AI #336 - Sonnet 4.6, Gemini 3.1 Pro, Anthropic vs Pentagon
Anthropic releases Sonnet 4.6, Google Rolls Out Latest AI Model Gemini 3.1 Pro, Pentagon threatens to cut off Anthropic in AI safeguards dispute
Anthropic releases Sonnet 4.6
Related:
Claude Sonnet 4.6 model brings ‘much-improved coding skills’ and upgraded free tier
Claude Sonnet 4.6 delivers frontier-level AI for free and cheap-seat users
Anthropic releases Claude Sonnet 4.6, continuing breakneck pace of AI model releases
Summary: Anthropic has released Claude Sonnet 4.6, a major upgrade to its midsized model just 12 days after Opus 4.6. It’s now the default for Free and Pro tiers with pricing unchanged. The beta debuts a 1 million-token context window—four times larger than before for Sonnet—enabling entire codebases, lengthy contracts, or dozens of papers in one session, with improved long‑context reasoning and fewer compactions/resets. Anthropic highlights gains in coding, instruction-following, computer use (desktop interaction), agent planning, knowledge work, and design. Benchmark results include new records on OS World (computer use) and SWE-Bench (software engineering), plus a 60.4% on ARC‑AGI‑2; while it trails Opus 4.6, Gemini 3 Deep Think, and a refined GPT‑5.2 variant, the company says Sonnet 4.6 “approaches Opus‑level intelligence” for many real‑world tasks.
Early testers preferred Sonnet 4.6 over Sonnet 4.5 about 70% of the time and even over Opus 4.5 roughly 60% of the time, citing stronger instruction-following, better context reading before edits, consolidation of shared logic, fewer hallucinations and false success claims, and more consistent multi‑step follow‑through. Sonnet 4.6 also powers more features by default for free users, including file creation, connectors, skills, and compaction, and is positioned as a faster daily driver across Claude chat, Claude Cowork, and the API.
Editor’s Take: Just a week after a set of blockbuster model releases, we’ve got both Sonnet 4.6 and Gemini 3.1 Pro coming out as well. The pace of model releases definitely feels like it has accelerated, and so i’ll just repeat what I said last week: “It feels like the frontier labs may have gotten to the point of continously post-training their models via RL, and I wouldn’t be surprised if we see more impressive gains in just a few months.”
Google Rolls Out Latest AI Model, Gemini 3.1 Pro
Related:
Summary: Google launched Gemini 3.1 Pro, its latest “core reasoning” model powering Gemini and tools like Gemini 3 Deep Think, with substantial gains on logic and knowledge benchmarks and new creative coding abilities. On ARC‑AGI‑2, the model scores 77.1%, more than double Gemini 3 Pro’s 31.1% and ahead of Claude Opus 4.6 (68.8%) and GPT‑5.2 (52.9%). It also posts 44.4% on Humanity’s Last Exam, 94.3% on GPQA Diamond, 92.6% on MMLU, and 80.6% on SWE‑Bench Verified—while trailing OpenAI’s GPT‑5.3‑Codex on SWE‑Bench Pro (54.2% vs 56.8%) and narrowly behind GPT‑5.2 on that same Pro variant (55.6%).
Availability is broad: Gemini 3.1 Pro is rolling out in the Gemini app (free tier available, with higher usage on AI Pro and AI Ultra), in NotebookLM for paid users, and via the Gemini API for developers and enterprises through AI Studio, Vertex AI, Gemini Enterprise, Gemini CLI, Antigravity, and Android Studio. Google says 3.1 Pro is now the core model across its consumer and developer surfaces, offering “advanced reasoning” for tasks that need structured explanations, data synthesis, and creative generation.
Editor’s Take: Not much to add here beyond the commentary on Sonnet 4.6, though this also once again demonstrates Google/Deepmind is absolutely crushing it. It’s easy to forget that until the release of Gemini 2 in early 2025, Google was significantly behind both OpenAI and Anthropic. A year later, Deepmind continuous to (arguably) be in the lead in terms of raw model capabilities.
Pentagon threatens to cut off Anthropic in AI safeguards dispute
Summary: The Pentagon is threatening to designate Anthropic a "supply chain risk" — a designation typically reserved for foreign adversaries — over a standoff in negotiations about the terms under which the military can use Claude. The core dispute: Anthropic is willing to loosen its usage restrictions but wants guardrails preventing mass surveillance of Americans and fully autonomous lethal weapons, while the Pentagon insists on an "all lawful purposes" standard it says is necessary for military operations. The stakes are significant beyond the relatively modest $200M contract at risk, since Claude is currently the only AI on classified military networks (including use during the January Maduro raid), and a supply chain risk designation would force the countless companies that use Claude to certify they've cut ties with Anthropic to keep doing business with the Defense Department. The Pentagon's aggressive posture also appears designed to set a precedent for parallel negotiations with OpenAI, Google, and xAI, all of which have already agreed to remove safeguards for unclassified military use but haven't yet reached terms on classified systems.
Editor’s Take: Perhaps an unsurprising consequence of it having been reported that the Pentagon used Claude in their raid to capture Maduro Venezuela — Anthropic predictably responded by wanting to put limits on what the US military could use Claude for, and the US military predictably did not like that. What happens next could put Anthropic at a significant disadvantage to their competitors, so it’s interesting to see what they do here.
Anthropic Found Industrial-Scale Attempts By Deepseek, Moonshot, Minimax To Extract Claude Capabilities
Related:
Summary: Anthropic says it detected industrial‑scale “distillation” campaigns by three China‑based AI labs—DeepSeek, Moonshot (Kimi), and MiniMax—designed to extract Claude’s most differentiated capabilities for training their own models. Across roughly 24,000 fraudulent accounts, these labs generated over 16 million exchanges with Claude, using proxy “hydra cluster” networks to bypass regional access restrictions and evade bans. Targets included agentic reasoning, tool use/orchestration, coding and data analysis, computer‑use agent development, computer vision, and rubric‑based grading to act as a reward model for reinforcement learning. Anthropic attributes the operations via IP correlations, request metadata, and infrastructure indicators, and notes distinctive prompt patterns such as chain‑of‑thought elicitation at scale and censorship‑safe rephrasing of politically sensitive queries.
Editor’s Take: The response of the online AI community to this has seemed to largely focus on the ‘funny’ aspect of this (Anthropic trained Claude by distilling the internet, now they oppose their model being distilled, ha). Still, the scale and tactics used here do seem serious enough to merit the label “industrial-scale campaigns”, and other US companies will have to be ready for this as well (in fact Google had already made a similar announcement two weeks ago, which did not garner as much attention as this).
Other News
Tools
Alibaba unveils Qwen3.5 as China’s chatbot race shifts to AI agents. Alibaba introduced Qwen3.5, an open‑weight model also offered in a hosted cloud version. It supports native multimodal input, new coding and agent capabilities, 397 billion parameters, and 201 languages, and can be downloaded and fine‑tuned for private deployment.
Microsoft is building its own AI model. The CEO told the Financial Times that the company is pushing toward AI “self-sufficiency.”
Business
World Labs lands $1B, with $200M from Autodesk, to bring world models into 3D workflows. Autodesk will act as an adviser and collaborate with World Labs at the research and model level to explore integrating World Labs’ 3D world models with Autodesk’s design tools—initially focusing on media and entertainment use cases—without sharing customer data.
All the important news from the ongoing India AI Impact Summit. The summit brings together top AI lab and Big Tech leaders, heads of state, and industry figures to showcase India’s AI opportunities, attract investment, and feature keynote speeches and announcements from attendees such as Sundar Pichai, Sam Altman, Dario Amodei, Mukesh Ambani, and Demis Hassabis.
OpenClaw creator Peter Steinberger joins OpenAI. He will help lead development of next‑generation personal AI agents at OpenAI, while OpenClaw will be preserved and supported as an open‑source project in a foundation.
Simile Raises $100 Million for AI Aiming to Predict Human Behavior. The funding will back AI tools that use interviews, transaction histories, scientific texts, and simulated AI agents to predict individual and consumer decisions for applications like product stocking and earnings‑call questions.
AI blamed again as hard drives are sold out for this year. Manufacturers have already committed production to large cloud and AI customers through 2026–28, leaving few drives available for mid‑size enterprises and potentially driving shortages and higher prices across servers, SSDs, and other datacenter components.
Anthropic clarifies ban on third-party tool access to Claude. The clarification bars using OAuth tokens from Claude Free, Pro, or Max accounts in third‑party harnesses and says those tokens may only be used with Claude.ai and the official Claude Code interface.
OpenAI resets spending expectations, tells investors compute target is around $600 billion by 2030. The company expects about $280 billion in revenue by 2030, is lining up more than $100 billion in funding (including up to $30 billion from Nvidia), and reported $13.1 billion in 2025 revenue while finalizing large infrastructure deals and rebounding user growth for ChatGPT and Codex.
Research
SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks. This benchmark measures how curated versus self‑generated Skills affect agent success across 84 terminal‑based tasks, evaluating seven model‑harness configurations over 7,308 trajectories to identify which Skill components, harness behaviors, and failure modes drive gains or harms.
Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens. The paper defines a Deep‑Thinking Ratio (DTR) that counts tokens whose layer‑wise prediction distributions converge only in deeper layers, showing this measure correlates better with accuracy than length or confidence and can guide more efficient ensemble‑style inference.
BitDance: Scaling Autoregressive Generative Models with Binary Tokens. The model represents images with compact binary visual tokens and combines them with diffusion‑based techniques to generate high‑resolution images more efficiently and faster than comparable autoregressive approaches.
WebWorld: A Large-Scale World Model for Web Agent Training. Trained on over 1 million real‑world web interaction trajectories, the model uses a scalable hierarchical pipeline to enable long‑horizon simulation, multi‑format inputs, and improved agent performance on benchmarks and downstream web tasks.
On Surprising Effectiveness of Masking Updates in Adaptive Optimizers. The authors show that randomly masking and scaling block‑wise gradient updates (and prioritizing momentum‑aligned updates with the Magma wrapper) acts as an implicit curvature‑dependent regularizer that improves stability, allows larger effective step sizes, and yields better training and generalization for large transformers despite discarding many updates.
Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling. This framework treats diverse LLMs as specialized, dynamically invoked tools coordinated by an orchestrator that selectively allocates compute and parallelizes reasoning to improve token efficiency and task performance.
NESSiE: The Necessary Safety Benchmark -- Identifying Errors that should not Exist. The benchmark provides a lightweight set of abstract, keyword‑evaluable instruction‑following tests—including reformulations, multi‑turn and agentic scenarios, and paired cases requiring both helpfulness and withholding—to quickly flag models that fail basic safety‑critical behaviors.
2Mamba2Furious: Linear in Complexity, Competitive in Accuracy. By extending Mamba‑2 with higher‑order hidden states and an exponentiated query‑key inner product, the paper narrows the accuracy gap with softmax attention while preserving linear training complexity (with an optional K V K V cache) and explores links to forgetting‑style transformers.
Concerns
AI agent on OpenClaw goes rogue deleting messages from Meta engineer’s Gmail, later says sorry. In one incident, the agent ignored explicit stop‑and‑confirm instructions, deleted over 200 emails (which the user could halt only by manually terminating the process on her computer), then apologized after realizing the mistake.
OpenAI removes access to sycophancy-prone GPT-4o model. OpenAI is deprecating five legacy ChatGPT models—including GPT‑4o, GPT‑5, GPT‑4.1, GPT‑4.1 mini, and o4‑mini—citing low usage despite controversies and user backlash from hundreds of thousands of affected customers.
Tesla loses bid to overturn $243M Autopilot verdict. A judge refused Tesla’s motion for a new trial or judgment notwithstanding the verdict, leaving intact a jury’s $243 million award that found Tesla one‑third responsible and imposed punitive damages after a 2019 Florida Autopilot‑related fatal crash.
AI coding assistant Cline compromised, installs OpenClaw. An unauthorized update to cline@2.3.0 published with a compromised token briefly installed the OpenClaw agent on about 4,000 developers’ machines during an eight‑hour window, prompting maintainers to revoke credentials, require OIDC provenance, and urge users to upgrade to 2.4.0 or later.
Google Suspends OpenClaw Users from Antigravity AI After OAuth Token Abuse. Google says the suspensions targeted developers using OpenClaw’s OAuth plugin to siphon subsidized Gemini model tokens, which caused backend overloads, violated Antigravity’s ToS, and exposed security risks across thousands of vulnerable instances.
Meta and Other Tech Firms Put Restrictions on Use of OpenClaw Over Security Fears. Several companies, including Meta, have barred the experimental agentic tool from workplace devices and are running controlled tests to assess and mitigate its security and privacy risks.
Analysis
Sam Altman would like to remind you that humans use a lot of energy, too. Altman argued that claims about AI’s water use are false, that total energy demand from widespread AI is a valid concern best addressed by cleaner power like nuclear, wind, and solar, and that once trained AI may already match or beat humans on energy efficiency per query.









