Last Week in AI #323 - Sonnet 4.5, Sora 2, Vibes, SB 53
Anthropic releases Claude Sonnet 4.5, OpenAI announces Sora 2 with AI video app, and more!
Anthropic releases Claude Sonnet 4.5
Anthropic announced Claude Sonnet 4.5, highlighting a major leap in autonomous “computer use” and coding capabilities. In internal tests, the model ran unattended for 30 hours to build a Slack/Teams-like chat app, generating ~11,000 lines of code, up from Opus 4’s seven-hour autonomy earlier this year. Anthropic claims Sonnet 4.5 is its best model yet for real-world agents, coding, and general computer operation, with strong performance in cybersecurity, financial services, and research.
Beyond the model itself, Anthropic is shipping agent-building infrastructure: access to virtual machines, memory, context management, and multi-agent support—the same building blocks behind Claude Code. The company says Sonnet 4.5 is over 3x better at navigating browsers and using computers than last October’s system, informed by feedback from early-access customers (e.g., GitHub, Cursor).
OpenAI announces Sora 2 with AI video app
OpenAI announced Sora 2, a new video-and-audio generation model with improved photorealism, physics adherence, and native speech generation, alongside a new Sora iOS app for sharing and remixing AI videos. The model improves on Sora v1’s motion issues (e.g., realistic ball bounces) and demonstrated complex action scenes like gymnastics and skateboarding, though artifacts remain (e.g., a deforming staff in a koi pond scene). A new “cameos” feature lets users insert verified likenesses into videos after a one-time video/audio identity capture; OpenAI says consent can be revoked.
The Sora app is now available for download on iOS systems, but access to the service remains invite-only. Users can request access through the app. features an algorithmic feed with “steerable ranking” to personalize content, and is invite-only at launch in the U.S. and Canada with “generous limits” due to compute constraints, with optional paid extra generations planned if demand exceeds capacity.
Meta launches ‘Vibes,’ a short-form video feed of AI slop
Just a week prior to the Sora news Meta launched Vibes, its own product with a short-form feed dedicated entirely to AI-generated videos, mimicking TikTok/Reels but with machine-made content. Users can browse clips from creators and regular users, with personalization kicking in over time via Meta’s recommendation algorithm. You can generate a video from scratch or remix any clip in-feed, then add visuals, layer music, tweak styles, and publish to Vibes or cross-post to Instagram and Facebook Stories/Reels. Early examples shown by Mark Zuckerberg include fuzzy creatures hopping between cubes, a cat kneading dough, and an “ancient Egyptian woman” taking a selfie—illustrating the surreal, synthetic aesthetic driving the feature.
Under the hood, the early version of Vibes uses partner models from Midjourney and Black Forest Labs while Meta builds out its own generative video/image models. The launch drew immediate user backlash in Instagram comments, calling the feature “AI slop,” especially as platforms grapple with floods of low-value AI content and YouTube moves to curb it.
OpenAI says GPT-5 stacks up to humans in a wide range of jobs
OpenAI introduced GDPval, a new benchmark evaluating AI against human professionals across nine high-GDP industries and 44 occupations, focused on producing research-style reports. In GDPval-v0, experienced professionals directly compared AI-generated deliverables with peer-produced ones and selected a winner, yielding a “win or tie” rate aggregated across occupations. GPT-5-high, a higher-compute variant of GPT-5, achieved 40.6% wins/ties versus industry experts, up from GPT-4o’s 13.7% roughly 15 months prior. Anthropic’s Claude Opus 4.1 scored 49%, which OpenAI attributes partly to visually pleasing graphics rather than substantive superiority, highlighting presentation effects in evaluator judgments.
The current test is narrow: it only measures report-quality outputs and not the broader, interactive, or operational tasks professionals actually perform. Covered industries include healthcare, finance, manufacturing, and government, with roles such as software engineers, nurses, journalists, and investment bankers (e.g., prompts included competitor landscape analyses for last‑mile delivery). OpenAI’s team, including chief economist Aaron Chatterji and evaluations lead Tejal Patwardhan, views the results as evidence that workers can offload portions of their workload to models and that capabilities are improving rapidly.
SB 53, the landmark AI transparency bill, is now law in California
California enacted SB 53, the Transparency in Frontier Artificial Intelligence Act, after Gov. Gavin Newsom signed it into law. The new law focuses on transparency rather than prescriptive safety testing thresholds (like SB 1047’s $100M training-cost trigger). It requires “large AI developers” to publicly publish a frontier AI safety and security framework on their websites, detailing how they incorporate national/international standards and industry-consensus best practices, and to post updates with reasoning within 30 days of any changes. It also establishes a channel to report “potential critical safety incidents” to California’s Office of Emergency Services, adds whistleblower protections for disclosures about significant health and safety risks from frontier models, and creates a civil penalty enforceable by the Attorney General. Annual update recommendations to the law will come from the California Department of Technology based on multistakeholder input and evolving international standards.
Key provisions that made it in include transparency of safety processes and whistleblower protections, while third-party evaluations were dropped. The bill’s emphasis on voluntary-like frameworks drew criticism as potentially light on enforceable obligations, though it does formalize reporting and penalties for noncompliance with disclosure requirements. AI companies split: Anthropic publicly endorsed SB 53 after negotiations; Meta launched a state-level super PAC to influence California AI policy; and OpenAI lobbied against the approach, arguing state rules should be harmonized with federal and global regimes.
Other News
Tools
OpenAI launches ChatGPT Pulse to proactively write you morning briefs. The feature generates five to ten personalized morning briefs (news roundups, agenda items and contextual recommendations) overnight for Pro subscribers, pulling from connected apps, previous chats and web sources while limiting daily output to avoid social-media‑style engagement loops.
OpenAI takes on Google, Amazon with new agentic shopping system. The feature lets U.S. ChatGPT users buy from Etsy (and soon over a million Shopify merchants) directly in chat via Apple Pay, Google Pay, Stripe, or card, while OpenAI open-sources the Agentic Commerce Protocol that could shift discovery and checkout power away from Google and Amazon.
Microsoft just added AI agents to Word, Excel, and PowerPoint - how to use them. They let Copilot perform tasks like generating analyses, formulas, formatted documents, and fully formatted PowerPoint decks (including data visualizations, cross-file updates, and validation steps) from natural-language prompts, and are rolling out first on the web for Microsoft 365 Personal, Family and Copilot business subscribers through the Frontier program.
Opera launches its AI-centric Neon browser. The subscription-based browser includes an AI chatbot, an agentic “Neon Do” that automates tasks using browsing context, repeatable prompt “Cards” for building mini-apps, code-snippet generation for visual reports, and workspace-style “Tasks” for organizing AI chats and tabs.
Photoshop Has Added Google’s Viral Nano Banana AI Model to Generative Fill. The models, available now in the Photoshop beta, let users pick Nano Banana or FLUX.1 (or Adobe’s Firefly) within Generative Fill to generate imagery optimized respectively for stylized, graphic elements or contextual accuracy before refining results with Photoshop’s layers and editing tools.
Business
DoorDash unveils Dot, its autonomous robot built to deliver your food. DoorDash plans to deploy Dot, a compact, battery-swappable autonomous delivery vehicle tested in Phoenix that uses cameras, radar and lidar with onboard AI to carry up to 30 pounds of food at speeds up to 20 mph while operating on roads, bike lanes and sidewalks and supported by warehouses, charging stations and field operators.
Mira Murati’s Stealth AI Lab Launches Its First Product. The tool, called Tinker, automates fine-tuning of frontier open-source models like Meta’s Llama and Alibaba’s Qwen via supervised and reinforcement learning, letting users run downloaded custom models locally or elsewhere.
OpenAI generates $4.3 billion in revenue in first half of 2025, the Information reports (Sept 29). It reported burning $2.5 billion in the period largely on R&D and operations for ChatGPT, held about $17.5 billion in cash and securities, and is targeting $13 billion in full-year revenue and $8.5 billion in cash burn.
OpenAI ropes in Samsung, SK Hynix to source memory chips for Stargate. The companies agreed to produce up to 900,000 high-bandwidth DRAM chips per month and collaborate on building AI-focused data centers in South Korea while integrating OpenAI tech into their operations.
Meta Is Said to Acquire Chips Startup Rivos to Push AI Effort. The startup builds its own GPUs, and Meta plans to use the acquisition to strengthen its in-house semiconductor development and reduce reliance on external suppliers like Nvidia.
AI Startup Black Forest Labs Shoots for $4 Billion Valuation. The company is reportedly seeking $200 million to $300 million in new funding to reach that valuation after earlier rounds had already pegged it at about $1 billion, and it develops image-generation models (some released under open-source licenses) and partners with peers like Mistral.
AI Chip Startup Rebellions Gets Funds at $1.4 Billion Valuation. The funding round, which included a $250 million Series C and strategic backing from Arm, will be used to mass-produce Rebellions’ AI chips and accelerate product development for data center infrastructure.
Former OpenAI and DeepMind researchers raise whopping $300M seed to automate science. The startup plans to build autonomous labs run by AI “scientists” and robots to run experiments, generate large amounts of physical-world data, and accelerate discovery of new materials like superconductors.
Elon Musk’s xAI offers Grok to federal government for 42 cents. The deal lets executive-branch federal agencies access Grok for 18 months at a unit price of $0.42, including xAI engineer support for integration.
Elon Musk’s xAI accuses OpenAI of stealing trade secrets in new lawsuit. The filing alleges OpenAI systematically recruited former xAI staff, including engineers and a senior finance executive, to obtain xAI’s source code, data-center plans and other confidential information.
Zoox chooses Washington, DC as its next autonomous vehicle testbed. The company will begin by manually mapping DC streets with sensor-equipped Toyota Highlanders before rolling out limited tests with safety drivers this year as it works toward regulatory approvals and a future commercial robotaxi service.
Dedicated mobile apps for vibe coding have so far failed to gain traction. But despite investor interest and growing desktop use, mobile vibe-coding apps have seen only minimal downloads and revenue so far, with most users sticking to desktop tools and many developers still needing to fix AI-generated code.
Research
SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?. It introduces a contamination-resistant, industrially focused benchmark with GPL and commercial code subsets, longer multi-file tasks, a human-in-the-loop verification workflow, and diagnostic analyses showing current LLM agents score much lower (≤23.3% public, ≤17.8% commercial) than on prior benchmarks.
Meta FAIR Released Code World Model It was mid‑trained on ~3M execution and agent–environment trajectories (Python interpreter traces and ForagerAgent edits across ~10k executable repo images) to teach execution‑level semantics, with benchmarks showing competitive verified coding and math performance.
Reinforcement Learning on Pre-Training Data. The approach trains models via self-supervised reinforcement learning on unlabeled pre-training text using a next-segment reasoning reward—composed of Autoregressive Segment Reasoning (ASR) and Middle Segment Reasoning (MSR) tasks evaluated by a generative reward model—to improve general and mathematical reasoning and scale with compute.
What Characterizes Effective Reasoning? Revisiting Length, Review, and Structure of CoT. Their analysis shows that shorter chains of thought with fewer review tokens—and especially a lower fraction of steps belonging to failed exploratory branches (Failed-Step Fraction)—predict and causally improve accuracy across models and tasks, outperforming length- or review-based selection.
Short window attention enables long-term memorization. Combining linear RNNs with sliding-window attention, the paper finds that shorter windows improve long-context retrieval, and introduces stochastic window-size training to balance long- and short-context performance.
Evolution of Concepts in Language Model Pre-Training. Using crosscoders to align features across training checkpoints, the paper traces how human-interpretable linear features emerge, rotate, and fade during pre-training and links these microscopic dynamics to macroscopic task performance and a phase transition from statistical to feature learning.
ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory. It maintains a continually updated memory of distilled reasoning patterns from both successes and failures and uses memory-aware test-time scaling to guide exploration so agents learn from past experiences and improve over time.
Advanced Financial Reasoning at Scale: A Comprehensive Evaluation of Large Language Models on CFA Level III. The study evaluates 23 LLMs on mock CFA Level III exams using multiple prompting strategies and human+LLM grading, finding frontier reasoning models can exceed the estimated pass threshold (with notable differences on essay questions), that chain-of-thought prompting boosts essay performance, grading by LLMs is systematically harsher than humans, and cost–latency tradeoffs favor hybrid deployment strategies.
AI “workslop” sabotages productivity, study finds. Employees are spending more time learning, adapting, and managing generative-AI tools than actually gaining efficiency, leaving companies with widespread adoption but little measurable productivity improvement.
Concerns
OpenAI rolls out safety routing system, parental controls on ChatGPT. The new measures route emotionally sensitive chats to GPT-5 with “safe completions,” add parental controls for teen accounts (quiet hours, memory and image-generation limits, and harm detection), and will be iterated over a 120-day testing period amid mixed user reactions.
Spotify’s Attempt to Fight AI Slop Falls on Its Face. Spotify’s new policies and detection efforts aim to curb AI-generated impersonations and spam, but a wave of fake tracks — including a debunked Volcano Choir upload and million-stream AI “bands” like The Velvet Sundown — shows enforcement and attribution remain inconsistent and technically difficult.