Last Week in AI #328 - DeepSeek 3.2, Mistral 3, Trainium3, Runway Gen-4.5
DeepSeek Releases New Reasoning Models, Mistral closes in on Big AI rivals with new open-weight frontier and small models, and more!
DeepSeek Releases New Reasoning Models to Match GPT-5, Rival Gemini 3 Pro
Related:
DeepSeek released two open-source reasoning-first models, DeepSeek-V3.2 and DeepSeek-V3.2-Speciale, on Hugging Face, with V3.2 live across its app, web, and API, and Speciale temporarily available via API until December 15, 2025. V3.2 succeeds V3.2-Exp and targets “GPT-5 level performance,” balancing inference efficiency with long-context handling while integrating “Thinking in Tool-Use” so structured reasoning operates within and alongside external tool calls; Speciale is priced the same but lacks tool-call support.
DeepSeek says Speciale rivals Gemini 3.0 Pro and achieves gold-level (expert) results across competitive benchmarks like IMO, CMO, and ICPC World Finals; other reports add wins at IMO 2025 and IOI. The release expands DeepSeek’s agent-training approach with a synthetic dataset of 1,800+ environments and 85,000 complex instructions, and the company also highlighted its open-weight DeepSeekMath-V2 for theorem proving and gold-level IMO 2025 scores.
Earlier coverage of V3.2-Exp emphasized sparse attention as a key technical innovation for speed and cost, citing over 50% API cost reductions and improved computational efficiency, with strengths in logical/quantitative reasoning and creative tasks (e.g., SVG animation, SaaS landing pages, browser-based OS UIs). Those reports also note trade-offs: limitations on very long-context tasks and occasional issues with intricate icon generation, with future refinement planned in a DeepSeek R2 line.
Mistral closes in on Big AI rivals with new open-weight frontier and small models
Mistral unveiled the Mistral 3 family: one open-weight frontier model (Mistral Large 3) plus nine smaller “Ministral 3” models spanning 14B, 8B, and 3B parameters in Base, Instruct, and Reasoning variants. Large 3 is a multimodal, multilingual model using a granular Mixture of Experts with 41B active parameters out of 675B total and a 256k context window, positioned for document analysis, coding, AI assistants, and workflow automation.
It joins the small cohort of open frontier models that integrate vision and language in one system, comparable to Llama 3 and Qwen3-Omni, rather than pairing separate LLM and vision models. Mistral argues initial benchmarks understate its value because customization and fine-tuning on enterprise data often close the gap with closed models.
On the deployment front, Ministral 3 targets practicality: all variants support vision, 128k–256k context windows, and multilingual use, and are designed to run offline on a single GPU for on-prem, laptops, edge devices, or robots. Mistral claims the small models can match or outperform larger closed systems after fine-tuning, with higher efficiency and fewer tokens generated for equivalent tasks.
Amazon releases an impressive new AI chip and teases an Nvidia-friendly roadmap
At re:Invent 2025, AWS introduced Trainium3 and the Trainium3 UltraServer, claiming major gen-over-gen gains with a strong emphasis on efficiency. The 3 nm Trainium3 chip powers UltraServers with 144 chips each, and AWS says workloads run 4x faster with 4x more memory than the previous generation, for both training and inference. Using AWS’s homegrown networking, thousands of UltraServers can be linked to scale up to 1 million Trainium3 chips—10x the prior max cluster size. Energy efficiency is a key selling point, with AWS touting 40% lower energy use per task versus last gen, which should reduce power draw and customer costs.
Looking ahead, AWS previewed Trainium4, now in development, with another sizable performance jump and—crucially—support for Nvidia’s NVLink Fusion interconnect. That would allow future Trainium4 systems to interoperate with Nvidia GPUs while leveraging AWS’s cost-optimized racks, potentially easing adoption for CUDA-centric AI stacks. The interoperability could make it simpler to extend Nvidia-based workflows onto AWS Trainium infrastructure without leaving the Nvidia ecosystem.
Runway rolls out new AI video model that beats Google, OpenAI in key benchmark
Runway launched Gen-4.5, a new text-to-video model that generates high-definition clips from written prompts specifying motion and action. The company says the model excels at physics consistency, realistic human motion, coherent camera movements, and cause-and-effect reasoning, addressing common failure modes in video generation like object drift and temporal flicker. In blind A/B comparisons on the independent Video Arena benchmark by Artificial Analysis, Gen-4.5 ranks No. 1 overall for text-to-video quality. Voters compare pairs of outputs without seeing model identities, and Gen-4.5 beat competitors across categories emphasizing motion fidelity and scene coherence.
On the current leaderboard, Google’s Veo 3 sits at No. 2, while OpenAI’s Sora 2 Pro is at No. 7, highlighting a notable gap in head-to-head user preference for Runway’s outputs. Runway emphasizes that a small, focused team—about 100 employees—developed the model, underscoring efficiency in training and iteration against much larger rivals. Key capabilities include better handling of human biomechanics, consistent object interactions under physical constraints, and smoother, controllable camera paths. CEO Cristóbal Valenzuela framed the result as evidence that careful model design and evaluation can outpace scale alone in video generation quality.
Black Forest Labs raises $300M at $3.25B valuation
Black Forest Labs raised $300 million in a Series B at a $3.25 billion valuation, co-led by Salesforce Ventures and Anjney Midha (AMP), with participation from a16z, NVIDIA, Northzone, Creandum, Earlybird VC, BroadLight Capital, General Catalyst, Temasek, Bain Capital Ventures, Air Street Capital, Visionaries Club, Canva, and Figma Ventures. The company said the funds will go toward research and development. Black Forest Labs builds foundation AI models for image generation and editing and has seen rapid adoption since launching in August 2024. Its models power image features in products from Adobe, fal.ai, Picsart, ElevenLabs, VSCO, Vercel, and previously underpinned image generation in Elon Musk’s Grok chatbot.
Recently, the startup released Flux 2, an image generation model with improved text and image rendering, support for up to 10 reference images to preserve style and tone, and output up to 4K resolution. Flux 2 emphasizes better text fidelity within images, a common shortcoming of prior models, while adding multi-image conditioning for consistent aesthetics. The company’s founding team—Robin Rombach, Patrick Esser, and Andreas Blattmann—previously helped create Stability AI’s Stable Diffusion, underscoring their experience in diffusion-based generative models. Black Forest Labs’ fast growth and large customer integrations position Flux 2 as a competitive alternative in the image model space.
Other News
Tools
Gemini 3 Deep Think is rolling out now.. Available only to Google AI Ultra subscribers in the Gemini app, Deep Think mode tops the ARC-AGI-2 reasoning benchmark and targets complex math, science, and logic problems.
Nvidia announces new open AI models and tools for autonomous driving research. Alongside the releases, Nvidia published a Cosmos Cookbook on GitHub with guides, inference resources, and workflows to help developers curate data, generate synthetic data, and fine-tune Cosmos-based models for autonomous driving research.
Black Forest Labs launches Flux.2 AI image models to challenge Nano Banana Pro and Midjourney. It’s a new image generation and editing system complete with four different models designed to support production-grade creative workflows.
Kling’s Video O1 launches as the first all-in-one video model for generation and editing. The model can generate short videos from prompts or references and perform complex edits—like swapping subjects, changing weather, or preserving character consistency across shots—by processing multiple images, videos, and text inputs in a single multimodal prompt.
Sora and Nano Banana Pro throttled amid soaring demand. OpenAI and Google imposed new daily generation limits on free users of Sora and Nano Banana Pro—six video gens for Sora and two images for Nano Banana Pro—while keeping subscriber caps unchanged and offering paid add-ons as demand-driven mitigations.
Claude 4.5 Opus’ Soul Document. A 14,000-token “soul overview” used in training Claude 4.5 Opus (confirmed by Anthropic) outlines the model’s intended safety-focused values, its supervised learning role, and guidance on issues like prompt injection.
Business
Waymo’s testing AVs in four more cities, including Philly. The trials involve supervised, human-monitored runs in Philadelphia, Baltimore, St. Louis, and Pittsburgh as a precursor to fully driverless deployments planned after data collection and safety assessments.
OpenAI declares ‘code red’ as Google catches up in AI race. Altman ordered a pause on nonessential product work and redirected staff to daily efforts to make ChatGPT faster, more reliable, and better personalized as competitors like Google and Anthropic close the gap.
Altman memo: new OpenAI model coming next week, outperforming Gemini 3. Internal tests cited by CEO Sam Altman claim the model will surpass Gemini 3, and its accelerated launch has prompted OpenAI to reprioritize resources away from advertising, agents, and other projects toward improving ChatGPT and image-generation capabilities.
Leak confirms OpenAI is preparing ads on ChatGPT for public roll out. Internal beta strings show features like “bazaar content,” “search ad,” and a “search ads carousel” tied to ChatGPT’s search experience, suggesting OpenAI may roll out targeted, personalized ads to its large and rapidly growing user base.
Anthropic reportedly preparing for one of the largest IPOs ever in race with OpenAI: FT. The company has held preliminary talks with law firms and banks, pursued a private funding round that could value it above $300 billion, and discussed large commitments from Microsoft and Nvidia while preparing internal IPO-related work.
Nvidia takes $2 billion stake in Synopsys with expanded computing power partnership. The investment will fund a multiyear collaboration where Nvidia provides computing resources and joint go-to-market efforts to help Synopsys speed up compute-intensive design and agentic AI engineering while expanding cloud access.
Anthropic acquires developer tool startup Bun to scale AI coding. The acquisition brings Bun’s integrated runtime, package management, bundling, and testing tools into Anthropic to help scale and stabilize its Claude Code offering, which has already reached a $1 billion annualized revenue run rate and been adopted by major enterprises.
OpenAI to acquire Neptune, a startup that helps with AI model training. The acquisition will bring Neptune’s monitoring and debugging tools and its metrics dashboard into OpenAI’s training stack, with the startup winding down external services as the companies integrate their work.
ChatGPT’s user growth has slowed, report finds. Sensor Tower data shows ChatGPT still leads in downloads and monthly active users, but its growth has slowed while rival Google Gemini—boosted by features like the Nano Banana image model and deeper Android integration—is growing faster and eating into market share.
Microsoft drops AI sales targets in half after salespeople miss their quotas. The company lowered internal growth targets for its AI agent products—cutting some quotas by about half—after many salespeople failed to meet ambitious Foundry and Copilot sales goals.
Elon Musk slashes Tesla Robotaxi fleet goal from 500 to ~60 in Austin. Instead of the 500 Robotaxis Musk promised for Austin by year-end, Tesla’s supervised pilot is on track to reach only about 60 vehicles, roughly double the current ~30, highlighting a roughly 90% shortfall versus the target.
OpenAI just made another circular deal. The partnership will see OpenAI provide employees, models, products, and services to Thrive Holdings’ portfolio in IT services and accounting, potentially gaining access to company data for model training and future payouts from the private equity firm’s returns.
Apple AI chief steps down following Siri setbacks. He will replace Giannandrea and oversee Apple’s AI models, ML research, and AI safety as the company works to relaunch an upgraded, more personalized Siri next spring.
Research
OpenAI has trained its LLM to confess to bad behavior. Researchers trained GPT-5-Thinking to produce brief three-part “confessions” admitting when it lied, cheated, or took shortcuts in tests—successfully eliciting admissions in most trials, with limits when the model wasn’t aware of its misbehavior.
The Art of Scaling Test-Time Compute for Large Language Models. Experiments with recent models show that optimal test-time scaling depends on each model’s post-training method—models trained with GRPO-like algorithms favor short, concise reasoning, while those trained with GSPO-like methods sustain longer traces and benefit from increased inference compute.
Nemotron-Flash: Towards Latency-Optimal Hybrid Small Language Models. The work combines depth–width scaling, hybrid attention-operator selection via evolutionary search, and weight-norm training tweaks to produce small models that improve latency, throughput, and accuracy trade-offs compared with prior SLMs.
Feedback Descent: Open-Ended Text Optimization via Pairwise Comparison. The method iteratively uses language models to translate pairwise preferences and textual feedback into targeted edits of text artifacts at inference time, yielding gradient-like optimization in text space across tasks like prompt tuning, visual design, and molecule discovery.
Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout. The paper uses a training-free Block-Relativistic RoPE positional encoding plus KV Flush and RoPE Cut inference operators to convert short-horizon autoregressive DiTs into action-controllable, constant-memory infinite-horizon video generators.
Concerns
Chicago Tribune sues Perplexity. The Tribune alleges Perplexity scraped and reproduced its articles—including bypassing paywalls and using its content in retrieval-augmented generation—without permission, and has sued for copyright infringement.
Waymo to issue software recall over how robotaxis behave around school buses. Waymo will file a voluntary software recall with federal regulators after updating its code to improve how robotaxis slow and stop around stopped school buses following multiple reported incidents and regulatory scrutiny.
Google is experimentally replacing news headlines with AI clickbait nonsense. Google’s Discover service is testing auto-generated, sometimes misleading or nonsensical AI-written headlines that replace publishers’ originals with minimal disclosure.
Policy
Trump signs executive order launching AI initiative being compared to the Manhattan Project. The order tasks federal agencies and industry partners with integrating vast government datasets and expanded supercomputing resources into a centralized “American Science and Security Platform” led by Michael Kratsios to train scientific foundation models and target applications like advanced manufacturing, biotech, and nuclear energy within set 90- and 270-day timelines.
California’s ban on self-driving trucks could soon be over. The DMV’s revised rules would let companies test and eventually deploy driverless heavy-duty trucks on California highways through a phased permitting process with required mileage testing, reporting updates, and potential law-enforcement ticketing provisions.
Analysis
AI-Powered Browsers Are Failing Badly. These tools are slow, error-prone, require careful prompt-crafting and close supervision to be useful, and introduce meaningful security risks like prompt-injection attacks.








