Last Week in AI #304 - OpenAI Audio, Ernie 4.5, Claude Websearch
OpenAI Unveils New Audio Models to Make AI Agents Sound More Human Than Ever, Baidu launches two new versions of its AI model Ernie, and more!
Top News
OpenAI Unveils New Audio Models to Make AI Agents Sound More Human Than Ever
OpenAI has introduced a suite of new audio models aimed at making AI voice agents sound more human-like and responsive. The release includes two new speech-to-text models, GPT-4o-transcribe and GPT-4o-mini-transcribe, which outperform previous models in transcription accuracy across multiple languages, even in challenging scenarios such as understanding different accents and filtering background noise. The new GPT-4o-mini-tts text-to-speech model allows developers to control the tone and delivery of the AI's speech, a feature OpenAI refers to as "steerability". Additionally, an updated Agents SDK simplifies the conversion of text agents into voice agents. \
Baidu launches two new versions of its AI model Ernie
Chinese tech giant Baidu has introduced two new versions of its artificial intelligence model, Ernie - Ernie 4.5 and Ernie X1. The company claims that Ernie X1 performs at the same level as DeepSeek R1 but at half the cost, while Ernie 4.5 has been enhanced to understand memes and satire due to its "high EQ". Both models possess multimodal capabilities, meaning they can process video, images, audio, and text. Despite being an early competitor to OpenAI's ChatGPT, Baidu has faced challenges in achieving widespread adoption. The company plans to launch Ernie 5 later this year, promising further multimodal enhancements.
Anthropic adds web search to its Claude chatbot
Anthropic's AI chatbot, Claude, has been upgraded with a web search feature, allowing it to scour the internet for information to inform its responses. The feature is currently available for paid users in the U.S., with plans to extend it to free users and other countries. The web search function works with the latest model, Claude 3.7 Sonnet, and provides direct citations for fact-checking. However, the feature has been inconsistent in triggering for current events-related questions. This update brings Claude in line with other AI chatbots like OpenAI's ChatGPT, Google's Gemini, and Mistral's Le Chat, despite previous claims that Claude was designed to be self-contained.
Meta AI is finally coming to the EU, but with limitations
Meta has announced the launch of its AI-powered virtual assistant, Meta AI, in the European Union, despite ongoing regulatory issues with European privacy authorities. The tool, which has been available in the U.S. since 2023, will be rolled out across Meta's social platforms, including WhatsApp in the U.K., but with a more limited feature set due to EU's stringent privacy regulations. Meta AI, capable of chatting, answering questions, and generating images, has not been trained on local users' data in the EU, hence it won't be notifying users or seeking their consent. The launch represents Meta's first step in bringing more AI to Europe, despite the company's criticism of Europe's AI regulations.
Other News
Tools
Roblox’s new AI model can generate 3D objects - Roblox's Cube 3D model, which is open-sourced, aims to enhance 3D creation efficiency by generating 3D models from text prompts and will eventually support multimodal inputs like images and videos.
Allen Institute for AI (AI2) Releases OLMo 32B: A Fully Open Model to Beat GPT 3.5 and GPT-4o mini on a Suite of Multi-Skill Benchmarks - OLMo 2 32B, released by the Allen Institute for AI, is a fully open large language model that surpasses GPT-3.5 Turbo and GPT-4o mini
NVIDIA Launches Family of Open Reasoning AI Models for Developers and Enterprises to Build Agentic AI Platforms - NVIDIA's Llama Nemotron models, enhanced for reasoning and decision-making
Stability AI’s new AI model turns photos into 3D scenes - Stability AI's Stable Virtual Camera model allows users to create immersive 3D videos from 2D images by generating novel views and dynamic camera paths, although it may struggle with complex scenes and certain textures.
Google brings a ‘canvas’ feature to Gemini, plus Audio Overview - Google has introduced a new Canvas feature to its Gemini chatbot, allowing users to collaboratively create and refine writing and coding projects, alongside an Audio Overview feature that generates podcast-style audio summaries of documents.
Canopy Labs Releases Orpheus, a Permissively-Licensed LLM for Convincing Text to Speech - Canopy Labs has launched Orpheus, a family of large language models for text-to-speech generation, capable of conveying emotions and performing zero-shot voice cloning, with the three-billion-parameter model available under an open-source license.
xAI launches an API for generating images - xAI's new image generation API, featuring the "grok-2-image-1212" model, offers competitive pricing and limited customization options as the company seeks to expand its revenue streams and investor interest.
Business
1X will test humanoid robots in ‘a few hundred’ homes in 2025 - 1X plans to test its humanoid robot, Neo Gamma, in homes by 2025, using teleoperators to assist with its current limitations, while addressing privacy concerns and collecting data to improve its AI capabilities.
Mark Zuckerberg says that Meta’s Llama models have hit 1B downloads - Meta's Llama models have reached 1 billion downloads despite facing legal and competitive challenges, with plans for new model releases and significant investment in AI development.
Elon Musk’s AI company, xAI, acquires a generative AI video startup - xAI's acquisition of Hotshot suggests plans to develop competitive video generation models, potentially integrating them into its Grok chatbot platform.
Perplexity is reportedly in talks to raise up to $1B at an $18B valuation - Perplexity, an AI-powered search startup, is reportedly in early talks to raise $1 billion, doubling its valuation to $18 billion, amid increasing competition and expansion into new areas like enterprise solutions and an "agentic" browser.
Apple Shuffles AI Executive Ranks in Bid to Turn Around Siri - Apple is restructuring its AI leadership by appointing Vision Pro creator Mike Rockwell to lead Siri development, aiming to address delays and improve its AI technology, which has been lagging behind competitors.
OpenAI’s o1-pro is the company’s most expensive AI model yet - OpenAI's o1-pro model, despite its high cost and increased computational power, has received mixed reviews for its performance improvements over the standard o1 model, particularly in solving complex problems.
BotQ: US firm’s factory where humanoids will build robots, deliver 12,000 units a year - BotQ's factory will utilize vertical integration and advanced software systems like MES, PLM, and ERP to ensure high-quality, efficient production and management of humanoid robots.
Research
Measuring AI Ability to Complete Long Tasks - AI performance, measured by the length of tasks it can complete, has been exponentially increasing with a doubling time of around 7 months, suggesting that within a few years, AI could autonomously handle tasks currently requiring weeks of human effort.
EXAONE Deep: Reasoning Enhanced Language Models - EXAONE Deep models, developed by LG AI Research, are fine-tuned for enhanced reasoning tasks using techniques like Supervised Fine-Tuning, Direct Preference Optimization, and Online Reinforcement Learning, outperforming several existing models across different scales.
Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers - Vamba, a hybrid Mamba-Transformer model, enhances hour-long video understanding by reducing computational complexity and memory usage through efficient modules like Mamba-2 blocks and cross-attention layers, achieving superior performance on benchmarks such as LVBench.
FlowTok: Flowing Seamlessly Across Text and Image Tokens - FlowTok introduces a streamlined framework for seamless flow matching between text and image tokens, achieving efficient and state-of-the-art multimodal generation without complex conditioning mechanisms.
CoRe^2: Collect, Reflect and Refine to Generate Better and Faster - CoRe^2 is a novel, plug-and-play sampling framework that enhances generative models' performance by efficiently refining image quality and semantic faithfulness without being architecture-specific, achieving superior results across various benchmarks.
Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification - Scaling up sampling-based search with random sampling and self-verification enhances model performance, revealing that larger response pools improve verification accuracy and highlighting the need for better out-of-box verification capabilities in frontier models.
Concerns
ChatGPT hit with privacy complaint over defamatory hallucinations - OpenAI faces a privacy complaint in Europe over ChatGPT's generation of false and defamatory information, highlighting concerns about compliance with GDPR's accuracy requirements and the potential reputational damage caused by AI hallucinations.
Policy
Ben Stiller, Mark Ruffalo and More Than 400 Hollywood Names Urge Trump to Not Let AI Companies ‘Exploit’ Copyrighted Works - Hollywood creative leaders are urging the Trump administration to maintain strong copyright protections against AI companies like OpenAI and Google, which seek to use copyrighted works for AI training without permission or compensation.
A.I. Art Generated With Text Prompts Cannot Be Copyrighted, U.S. Rules - Art generated by artificial intelligence (A.I.) from a text prompt cannot be copyrighted even if an artist uses long, targeted inputs or creates multiple iterations of a work before they are satisfied with the final output, according to new guidance from the U.S. Copyright Office.