Open-weights frontier model crosses MMLU 92
A new community release matches closed-frontier benchmarks while shipping under a permissive license, intensifying the open vs closed debate.
A curated weekly digest of what's shipping and shifting.
A new community release matches closed-frontier benchmarks while shipping under a permissive license, intensifying the open vs closed debate.
Providers must now disclose training data summaries and energy usage, with phased compliance starting Q4.
Tool-using agents continue rapid gains on real-world software engineering tasks, narrowing the gap to junior engineers.
GPT-5 merges fast and deep reasoning behind a single endpoint, automatically allocating compute per query and setting new SOTA on AIME and GPQA.
Claude 4 Opus sustains multi-hour autonomous coding sessions and tops SWE-bench Verified at 72%, redefining expectations for agentic workloads.
B200 Ultra and GB300 NVL72 racks begin volume shipments to hyperscalers, with FP4 training cutting frontier-model costs significantly.
Llama 4 family launches with MoE variants up to 2T total parameters, native image and audio understanding, and a more permissive license.
A 671B-parameter MoE matches GPT-4o on many benchmarks while reportedly trained for under $6M, reshaping the cost narrative for frontier AI.
o1 spends compute thinking before answering, dramatically improving math, science, and code reasoning and opening a new scaling axis beyond pretraining.
OpenAI's omni model handles text, vision, and realtime audio in a single network, with sub-300ms voice latency rivaling human conversation.
Gemini 1.5 Pro debuts a sparse MoE architecture and a 1M-token context window — later expanded to 2M — enabling whole-codebase and feature-film reasoning.
Custom GPTs, the Assistants API, and a 128K-context GPT-4 Turbo launch — marking the start of the mainstream agent platform era.
Llama 2 becomes the first frontier-class open-weights model usable commercially, igniting the open-source LLM ecosystem.
GPT-4 debuts with multimodal input, sharply improved reasoning, and professional-exam-level performance, defining the modern LLM benchmark.
OpenAI's free chat interface to GPT-3.5 becomes the fastest-growing consumer app in history and kicks off the generative AI boom.
A 175B-parameter model performs new tasks from prompts alone, validating scaling laws and triggering the LLM era.
Vaswani et al. publish the architecture that underpins every modern LLM, replacing recurrence with self-attention.