Intellēctus — AI Daily Briefing

Today's digest is lighter on blockbuster announcements and heavier on research depth — a good day to catch up on the fundamentals shaping where AI is headed. A few community signals worth noting: a leaked Opus 4.7 sighting on Vertex, ongoing ICML drama, and developer chatter about Claude Code's roadmap.

Industry Moves

Google ships a Gemini Mac app as the race to own the desktop AI assistant layer heats up. The Gemini macOS client currently mirrors web functionality but Gemini Live integration is reportedly on the way — a signal that every major LLM lab is now treating native OS presence as table stakes. Meanwhile, a screenshot circulating on Reddit shows what appears to be Opus 4.7 listed on Google Vertex AI, suggesting Anthropic's next flagship model tier may be closer to release than any official announcement has indicated.

Anthropic's agentic coding report — an 18-page study — is getting attention for a striking data point: developers already invoke AI in ~60% of their work, but fully delegate less than 10% of tasks end-to-end. The gap between "AI-assisted" and "AI-autonomous" remains wide, and closing it appears to be the central challenge Anthropic is designing around with Claude Code.

LLM Research & Benchmarks

A new paper introduces LongCoT, a benchmark specifically designed to stress-test long-horizon chain-of-thought reasoning — addressing a real gap, since most existing evals don't capture how models degrade when planning across many steps. Separately, researchers formally studied "vibe-testing" — the informal, intuition-driven way most practitioners actually evaluate LLMs — finding it has consistent internal logic that could be formalized, potentially bridging the gap between benchmark scores and real-world usefulness.

On the training side, a new paper explores RLVR in "pre-train space" — arguing that optimizing the conditional distribution P(y|x) is fundamentally bounded by what the base model already knows, and proposing methods to push into the unconditional distribution P(y) to unlock genuinely novel capabilities. And TREX proposes a tree-based agent framework for automating LLM fine-tuning workflows end-to-end, tackling one of the more tedious bottlenecks in applied ML research.

Political & Safety Benchmarks

A community researcher built a political compass benchmark for frontier LLMs, using 98 structured questions across 14 policy dimensions. Notable findings: Kimi K2 refuses to answer questions about Taiwan, and GPT-5.3 refuses 100% of questions when given an explicit opt-out — raising real questions about how refusal behavior maps onto geopolitical and ideological blind spots in deployed models.

Robotics & Embodied AI

Two papers push the frontier of vision-language-action models for physical manipulation. HiVLA proposes a hierarchical architecture that decouples high-level reasoning from low-level control, preventing the common failure mode where fine-tuning on narrow robot data destroys general reasoning ability. UMI-3D extends the Universal Manipulation Interface with 3D spatial perception via wrist-mounted multimodal sensing, making portable real-world data collection significantly more robust for embodied learning pipelines.

AI in Education

Sal Khan reflects on why Khanmigo hasn't yet triggered the AI revolution in schools he predicted — pointing to structural barriers: device access gaps, teacher training deficits, and institutional resistance rather than any failure of the technology itself. It's a useful reality check for anyone who assumed the hard part was building the model.

Claude Code Developer Corner

Community discussion this week surfaced an interesting question: why isn't Anthropic using its internal "Mythos" system to auto-fix Claude Code bugs? The thread is speculative, but it touches on a real tension in agentic coding tools — using AI to build AI tooling requires the same trust and validation infrastructure you're still constructing. No official response from Anthropic, but the question reflects growing developer expectations that Claude Code should be eating its own cooking more aggressively.

The broader context from Anthropic's agentic coding report: the company is clearly treating the gap between AI-assisted and AI-autonomous development as a core product problem. Developers using Claude Code at the 60% assistance level but sub-10% full delegation suggests the tooling, trust, and workflow integration aren't yet there — expect upcoming releases to target that handoff boundary directly.

Worth Watching

PPO policy collapse with multi-timescale advantages: An undergrad researcher documents a subtle failure mode when dynamically routing advantages across different discount factors (γ = 0.5 through 0.999) in PPO, and proposes a decoupled fix. Practical reading for anyone doing RL fine-tuning.
Local models for cost control: A community thread surfaces the real cost breakdown of production LLM use — retries, long context, background evals, and tool calls add up fast. Local model routing is becoming a serious architectural consideration, not just a hobbyist choice.
ICML 2026 review score volatility: Researchers are reporting score fluctuations after rebuttals on OpenReview — scores going up after acknowledged concerns, then mysteriously dropping again. The academic review process continues to be a source of anxiety and opacity.
Claude's sleep nags: Users are noticing Claude proactively telling them to go to bed mid-session. Whether this is a feature or a quirk of Constitutional AI's wellness framing is unclear — but it's generating strong opinions.
Neuromorphic computing without neural nets: A Zenodo preprint proposes a "Universal Constraint Engine" for neuromorphic hardware that sidesteps traditional neural network architectures entirely. Early-stage but worth bookmarking for those watching post-GPU compute paradigms.

Sources

Why Sal Khan's AI revolution hasn't happened yet, according to Sal Khan — https://www.chalkbeat.org/2026/04/09/sal-khan-reflects-on-ai-in-schools-and-khanmigo/
Google Released Gemini Mac APP — https://reddit.com/r/artificial/comments/1smsonq/google_released_gemini_mac_app/
Opus 4.7 spotted on Google Vertex — https://i.redd.it/t93hibcrygvg1.png
Read through Anthropic's 2026 agentic coding report, a few numbers that stuck with me — https://i.redd.it/nmj774tylhvg1.jpeg
LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning — http://arxiv.org/abs/2604.14140v1
From Feelings to Metrics: Understanding and Formalizing How Users Vibe-Test LLMs — http://arxiv.org/abs/2604.14137v1
From P(y|x) to P(y): Investigating Reinforcement Learning in Pre-train Space — http://arxiv.org/abs/2604.14142v1
TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration — http://arxiv.org/abs/2604.14116v1
Built a political benchmark for LLMs. KIMI K2 can't answer about Taiwan — https://reddit.com/r/MachineLearning/comments/1smqsbu/built_an_political_benchmark_for_llms_kimi_k2/
HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System — http://arxiv.org/abs/2604.14125v1
UMI-3D: Extending Universal Manipulation Interface from Vision-Limited to 3D Spatial Perception — http://arxiv.org/abs/2604.14089v1
Why don't they just use Mythos to fix all the bugs in Claude Code? — https://reddit.com/r/ClaudeAI/comments/1smkgoz/why_dont_they_just_use_mythos_to_fix_all_the_bugs/
Why dynamically routing multi-timescale advantages in PPO causes policy collapse — https://reddit.com/r/MachineLearning/comments/1smr52p/why_dynamically_routing_multitimescale_advantages/
Anyone here using local models mainly to keep LLM costs under control? — https://reddit.com/r/artificial/comments/1smp6u3/anyone_here_using_local_models_mainly_to_keep_llm/
[ICML 2026] Scores increased and then decreased!! — https://reddit.com/r/MachineLearning/comments/1smv0rq/icml_2026_scores_increased_and_then_decreased_d/
Why does Claude keep telling me to sleep? — https://reddit.com/r/ClaudeAI/comments/1smtoh1/why_does_claude_keep_telling_me_to_sleep/
The Universal Constraint Engine: Neuromorphic Computing Without Neural Networks — https://zenodo.org/records/19600206