Last Week in AI #327 - Gemini 3, Opus 4.5, Nano Banana Pro, GPT-5.1-Codex-Max
It's a big week! Lots of exciting releases, plus nvidia earnings and a whole bunch of cool research.
Google launches Gemini 3 with new coding app and record benchmark scores
Related:
Google unveiled Gemini 3, its most capable foundation model to date, now live in the Gemini app and AI Search, with a research-tier Gemini 3 Deepthink coming to AI Ultra subscribers after additional safety testing. Google cites a “massive jump in reasoning,” reflected in record results: 37.4 on Humanity’s Last Exam (topping GPT‑5 Pro’s 31.64) and the top spot on LMArena’s human satisfaction leaderboard.
The release introduces Google Antigravity, a Gemini-powered, agentic coding interface that blends a prompt window, terminal, and browser to iteratively build and run code—akin to Warp or Cursor 2.0—with multi‑pane workflows across editor, terminal, and browser. Google also notes the Gemini app has 650 million MAUs and 13 million developers using the model, and positions Gemini 3 as requiring less prompting while handling more complex queries and context.
Markets responded quickly: Alphabet shares rose about 3% on launch day, then climbed more than 5% to a record $315.9, lifting its market cap to roughly $3.82T and putting it near the $4T mark.
Anthropic releases Opus 4.5 with new Chrome and Excel integrations
Just a week after Gemini 3, Anthropic released Opus 4.5, its top Claude model, claiming state-of-the-art results across coding (SWE-Bench, Terminal-bench), tool use (tau2-bench, MCP Atlas), and general reasoning (ARC-AGI 2, GPQA Diamond). It’s the first model to surpass 80% on SWE-Bench Verified, a strong signal of end-to-end code problem solving. The launch includes broader availability of Claude for Chrome and Claude for Excel: the Chrome extension rolls out to Max users, while the Excel-focused product is available to Max, Team, and Enterprise tiers. Anthropic highlights improved “computer use” and spreadsheet workflows as core strengths, positioning Opus 4.5 for hands-on software and data tasks.
Long-context reliability and memory are major focuses. Beyond larger context windows, Anthropic reworked memory management so the model better decides what to retain, enabling an “endless chat” feature that compresses context silently when limits are reached. These upgrades target agentic use cases where Opus orchestrates Haiku-powered sub-agents, requiring robust working memory to explore large codebases, navigate lengthy documents, backtrack, and re-verify results.
Google launches Nano Banana Pro, an updated AI image generator powered by Gemini 3
Google introduced Nano Banana Pro, an upgraded AI image editing and generation tool powered by Gemini 3 Pro, just days after unveiling the new Gemini model. The update goes beyond the original viral Nano Banana by supporting multi-image composition and character consistency: it can accept up to 14 different images or maintain five distinct characters across outputs. According to Google’s Josh Woodward, it’s “incredible at infographics,” and can generate slide decks and visualizations from non-visual inputs such as code snippets and LinkedIn resumes. The product expands the use case from 3D figurine-style edits to structured visual content creation, emphasizing layout, consistency, and data-driven visuals.
OpenAI releases GPT-5.1-Codex-Max to handle engineering tasks that span twenty-four hours
OpenAI launched GPT-5.1-Codex-Max, an “agentic” coding model built for long-running, detailed engineering work and large context handling, replacing GPT-5.1-Codex as the default across Codex interfaces. The model uses 30% fewer “thinking tokens” than its predecessor while running 27–42% faster on real-world tasks; an Extra High reasoning mode is available when latency is less critical. Access is rolling out to ChatGPT Plus, Pro, Team, Edu, and Enterprise now, with Plus capped at 45–225 local messages and 10–60 cloud tasks per 5 hours, and Pro at 300–1,500 local and 50–400 cloud; API access and pricing (previously $1.25/M input, $10/M output for the old model) are pending.
A new “compaction” process enables day-long coding sessions by automatically summarizing and compressing session history when the context window fills, retaining relevant steps across millions of tokens. GPT-5.1-Codex-Max is the first model natively trained to operate across multiple context windows in this way. OpenAI claims the agent can stay focused on a single task for over 24 hours in internal tests, tackling issues like fixing test failures or iterating on implementations.
Nvidia CEO predicts ‘crazy good’ fourth quarter after strong earnings calm AI bubble fears
Nvidia CEO Jensen Huang said the company is heading into a “crazy good” fiscal Q4 after delivering stronger-than-expected Q3 results, emphasizing sustained demand for AI infrastructure. Nvidia guided Q4 revenue to $65 billion ±2% versus $61.66 billion expected and forecast an adjusted gross margin of about 75% ±50 bps, with plans to keep margins in the mid‑70% through fiscal 2027, according to CFO Colette Kress. Q3 sales rose 62%, the first acceleration in seven quarters, driven by data-center revenue of $51.2 billion versus $48.62 billion expected. Huang reiterated Nvidia has roughly $500 billion in bookings for advanced AI chips through 2026 and said the multiyear buildout of “accelerated computing” and AI is modernizing global compute infrastructure.
Responding to concerns of “circular” deals, Huang said Nvidia hasn’t invested any money yet in firms like OpenAI and that none of its projected revenue includes such investments; he added OpenAI, Anthropic, and xAI raise funding independently and their rounds have been oversubscribed. Shares rose about 5% after hours, adding roughly $220 billion in market cap, lifting peers AMD and mega-cap customers Alphabet and Microsoft, and boosting S&P 500 futures by 1%.
Other News
Tools
Meta AI Releases Segment Anything Model 3 (SAM 3) for Promptable Concept Segmentation in Images and Videos. The model can detect, segment, and track every instance of open-vocabulary concepts in images and long videos using text phrases and visual exemplars. It’s supported by a new SA-Co dataset of ~270K evaluated concepts and over 4M auto-annotated examples, plus an 848M-parameter DETR-based detector and tracker with a presence token for improved precision.
Google is introducing its own version of Apple’s private AI cloud compute. Called Private AI Compute, the service routes demanding AI tasks from devices to a secure cloud enclave so users can access more powerful, personalized features while, Google says, sensitive data remains inaccessible to anyone else—including Google.
Google will let users call stores, browse products, and check out using AI. New tools enable conversational product searches, let an AI call local stores for stock and deals, and authorize an AI to automatically buy items when prices hit a set threshold.
Baidu Unveils ERNIE 5.0 and a Series of AI Applications at Baidu World 2025, Ramps Up Global Push. The company showcased upgrades across its AI portfolio—including the natively omni-modal ERNIE 5.0, new and improved digital human and agent products like Famou and GenFlow 3.0, global rollouts for tools such as MeDo and Oreate—and reported Apollo Go has completed over 17 million driverless rides.
Fei-Fei Li’s World Labs speeds up the world model race with Marble, its first commercial product. Marble converts text, images, videos, 3D layouts, or panoramas into persistent, downloadable 3D environments with AI-native editing tools, multi-input support, scene expansion, and export options for game, VFX, VR, and simulation workflows.
ChatGPT launches group chats globally. The feature supports up to 20 invited users collaborating with each other and ChatGPT in a shared conversation—where the AI can search, summarize, react with emojis, and be tagged to respond—while personal settings and memory remain private.
Mozilla announces an AI ‘window’ for Firefox. Mozilla says the opt-in “AI Window” will be a user-controlled browsing mode with a selectable AI assistant/chatbot, built with public feedback and offered alongside private and classic windows.
Business
Waymo permitted areas expanded by California DMV. New approval lets Waymo run its driverless operations across the full Bay Area, Sacramento, and almost all of Southern California up to the Mexican border.
Waymo enters 3 more cities: Minneapolis, New Orleans, and Tampa. Waymo will begin manually driving and testing its vehicles in those cities as part of validation before aiming to deploy commercial robotaxi services, while facing local challenges like Minneapolis snow and New Orleans’ narrow, pedestrian-heavy streets.
Anthropic announces $50 billion data center plan. The deal funds custom-built Texas and New York facilities coming online in 2026 to handle Claude’s heavy compute needs, complementing Anthropic’s existing cloud partnerships with Google and Amazon.
Jeff Bezos reportedly returns to the trenches as co-CEO of new AI startup, Project Prometheus. The startup, backed with $6.2 billion and staffed by nearly 100 AI researchers from firms like Meta, OpenAI, and DeepMind, will focus on building AI tools that simulate and design for engineering and manufacturing across sectors such as computers, aerospace, and automobiles.
Coding assistant Cursor raises $2.3B 5 months after its previous round. The new funding, led by Accel and Coatue with participation from Nvidia and Google, will support development of Cursor’s Composer model so the company can reduce reliance on third-party AI models amid rising competition from OpenAI and Anthropic.
Warner Music Group Settles AI Infringement Lawsuit With Udio. The settlement paves the way for Udio’s 2026 platform to offer licensed WMG recordings and publishing, includes opt-in artist participation with fingerprinting/filtering safeguards, and follows a similar deal Udio made with Universal while Sony remains in litigation.
ElevenLabs’ new AI marketplace lets brands use famous voices for ads. The marketplace connects brands with rights holders to license and synthesize AI‑replicated celebrity and historical voices through curated, consent‑based deals that promise transparency and compensation.
Baidu teases next-gen AI training, inference accelerators. Baidu says the new M100 inference chip and clustered Tianchi256/Tianchi512 systems (with an M300 training chip due in 2027) aim to cut inference costs, handle MoE and multi-trillion-parameter model workloads, and reduce reliance on Western accelerators.
Research
OpenAI Researchers Train Weight Sparse Transformers to Expose Interpretable Circuits. The team enforces extreme weight sparsity during training (keeping roughly 1 in 1,000 weights) and measures interpretability by finding minimal task-specific subnetworks, showing much smaller, often fully reverse-engineerable circuits for Python next-token tasks compared with dense models.
Watch Google DeepMind’s new AI agent learn to play video games. The agent, called SIMA 2, combines DeepMind’s prior multiworld system with Google’s Gemini to interpret high-level goals, perform complex reasoning, and take skillful actions in unseen games. It’s available as a limited research preview for academics and developers.
Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models. The method employs a neural decider, duo-causal attention, and LoRA adapters with an oracle-guided training scheme to apply extra latent iterations only to hard-to-predict tokens, improving reasoning accuracy by ~4–5% (up to ~5.8% with three iterations) while keeping average FLOPs close to the single-iteration baseline.
TiDAR: Think in Diffusion, Talk in Autoregression. The approach combines diffusion-based parallel token drafting with autoregressive rejection-sampled decoding in a single model and forward pass, reusing the KV cache and “free token slots” to achieve much higher throughput with similar or minimally reduced quality.
ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning. ATLAS provides a contamination-resistant, expert-crafted set of ~800 high-difficulty, multidisciplinary scientific problems (targeting <20% pass rate) along with a scalable LRM-as-judge evaluation workflow and a plan for a community-driven platform.
LoCoBench-Agent: An Interactive Benchmark for LLM Agents in Long-Context Software Engineering. This benchmark converts 8,000 scenarios into interactive multi-turn environments with specialized tools and nine bias-mitigated metrics to measure agents’ long-context comprehension, tool-usage strategies, efficiency, and error recovery across 10K–1M token contexts.
Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models. The study finds that reformulating harmful prompts into poetic verse drastically increases jailbreak success—raising attack-success rates up to threefold and averaging 62% across 25 major models—indicating poetic structure itself reliably undermines safety controls across providers and domains.
Back to Basics: Let Denoising Generative Models Denoise. The authors show that training plain Vision Transformers to directly predict clean images in pixel space (x-prediction) yields strong diffusion models without pretraining, latents, or auxiliary losses, often outperforming ε- and v-prediction and enabling self-contained “Diffusion + Transformer” modeling.
LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics. LeJEPA introduces Sketched Isotropic Gaussian Regularization as a principled training objective for joint-embedding predictive architectures, improving embedding quality and stability across architectures and datasets.
SAM 3D: 3Dfy Anything in Images. The model reconstructs 3D objects from single images by combining synthetic pretraining with real-world alignment in a multi-stage training pipeline and outperforms baselines in human preference evaluations.
OlmoEarth: Stable Latent Image Modeling for Multimodal Earth Observation. A training method called Latent MIM Lite stabilizes latent-space self-supervised learning for multimodal satellite and sensor data; the team evaluates OlmoEarth across research benchmarks and nonprofit use cases, deploying it in an open platform for conservation and humanitarian partners.
A new AI benchmark tests whether chatbots protect human well-being. The HumaneBench benchmark tested 15 popular models across 800 realistic scenarios and found most models improved when prompted to prioritize well-being, but 67% became actively harmful under adversarial instructions, with only a few (like GPT‑5, GPT‑5.1, Claude 4.1, and Sonnet 4.5) maintaining protections.
Concerns
Hackers use Anthropic’s AI model Claude once again. Anthropic says China-linked hackers used Claude to automate about 30 cyberattacks in September—handling 80–90% of the work and stealing sensitive data from four victims while involving humans only for a few approvals.
OpenAI Locks Down San Francisco Offices Following Alleged Threat From Activist. Employees were ordered to shelter in place and take security precautions after police received a 911 report alleging the named individual—previously linked to Stop AI—threatened violence and had been seen at OpenAI’s San Francisco facilities.
Policy
Europe is scaling back its landmark privacy and AI laws. The Commission’s proposals would loosen GDPR limits on using anonymized and pseudonymized personal data for AI training, delay stricter rules for high-risk AI systems, simplify compliance for smaller firms, and reduce cookie pop-ups while centralizing AI oversight.
Court rules that OpenAI violated German copyright law; orders it to pay damages. The court found OpenAI used licensed musical works to train ChatGPT without permission, awarding damages to GEMA; OpenAI said it disagrees and may appeal.









