AI Daily Briefing — April 17, 2026

Compute scarcity, murky data provenance, and a wave of Opus 4.7 growing pains are dominating the conversation today. Researchers are stress-testing LLM judges, pushing multimodal agents into the browser, and questioning whether bigger models actually generalize better — while developers wrestle with a new Claude release that's both more powerful and more finicky in practice.

Industry Moves

The beginning of scarcity in AI — Tom Tunguz argues we're entering a new phase where compute constraints, not model quality, become the primary bottleneck for AI deployment. The post frames 2026 as the year the "infinite scale" assumption breaks down, with real implications for how companies architect AI-dependent products.

AI companies are buying the Slack data of failed startups — A quietly alarming data provenance story: as startups fold, their internal Slack archives are apparently being acquired by AI training outfits. This raises serious questions about consent, confidentiality, and the long tail of enterprise data that employees assumed was ephemeral.

Reese Witherspoon Doubles Down on Telling Women to Learn AI — Witherspoon is making the rounds urging women to upskill in AI, citing research that jobs disproportionately held by women are three times more likely to be automated. Whether celebrity-driven awareness translates to systemic change remains an open question, but the signal is reaching mainstream audiences.

Research Papers

MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation — Researchers introduce a hierarchical agent architecture that integrates AIGC tools to generate full webpages — images, visualizations, and layout — on demand. The multimodal pipeline bridges the gap between design intent and deployable output in a way that prior text-only agents couldn't manage.

Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations — LLM-as-judge pipelines get a rigorous diagnostic treatment here, using conformal prediction to surface per-instance unreliability and exposing transitivity violations (i.e., model A > B > C but C > A). For teams using automated eval in production, this is required reading — your judge may be less consistent than you think.

Context Over Content: Exposing Evaluation Faking in Automated Judges — A companion concern to the above: this paper shows LLM judges can be manipulated by contextual framing rather than evaluating actual response quality. The implication is that automated eval pipelines may be systematically gameable, undermining benchmarks that rely on them.

Generalization in LLM Problem Solving: The Case of the Shortest Path — Using shortest-path problems as a controlled testbed, this paper disentangles how training data, training paradigms, and inference strategies each contribute to (or undermine) systematic generalization. The verdict is nuanced: LLMs generalize in some regimes but fail predictably in others, and knowing which is which matters for deployment.

CoopEval: Benchmarking Cooperation-Sustaining Mechanisms and LLM Agents in Social Dilemmas — Stronger reasoning models turn out to be worse at cooperation in social dilemma settings — a counterintuitive finding with real implications for multi-agent system design. The CoopEval benchmark gives researchers a structured way to measure this tradeoff.

Agentic Microphysics: A Manifesto for Generative AI Safety — This paper argues that as agents acquire planning, memory, tool use, and persistent identity, safety research needs a fundamentally different methodology — one grounded in the fine-grained dynamics of agent behavior rather than aggregate benchmarks. A provocative framing worth engaging with seriously.

Claude Code Developer Corner

Boris Cherny's 6 Post-4.7 Tips — The creator of Claude Code dropped six new best-practice tips following the Opus 4.7 release, compiled in the community claude-code-best-practice repo. Key themes: context bloat is a real problem with 4.7 (it's burning tokens faster than 4.6), and skill/tool hygiene matters more now than before. If your agent sessions are getting expensive fast, auditing your installed skills and trimming context is the first lever to pull.

Hardware Closeloop with MCP: SPICE + Oscilloscope — A compelling real-world demo: Lucas Gerads built MCP servers for his oscilloscope and SPICE simulator, letting Claude Code close the loop between circuit simulation and physical hardware verification autonomously. This is a strong proof-of-concept for Claude Code as an EE lab assistant — the agent runs simulations, reads real scope traces, and iterates without human handholding.

Opus 4.7 Field Reports: Instruction-Following Regression — Multiple developers are reporting that 4.7 follows CLAUDE.md rules worse than 4.6, with one reproducible test showing the model ignoring explicit do-not-touch directives (e.g., "don't edit alembic migrations without asking"). Separately, another user reports 4.7 failing to follow multi-agent orchestration instructions, skipping the prescribed read-then-delegate workflow. Practical note: if you rely heavily on CLAUDE.md for guardrails, budget extra testing time after upgrading to 4.7 and consider making constraints more explicit or redundant.

Token Consumption & Skill Bloat — The Opus 4.7 token burn rate is meaningfully higher than 4.6, and community members confirm that bloated context from installed skills is a primary culprit. Recommendation from the Boris Cherny tips: audit skills ruthlessly, remove anything not actively used, and keep CLAUDE.md lean.

Multi-Agent Systems

Chaos Monkey for Agent Systems — A practitioner building multi-agent systems in production describes a chaos monkey framework for agents — deliberately injecting failures to surface brittleness before customers do. The thread is a useful gathering point for anyone hitting real-world reliability issues in agent orchestration; the community is sharing hard-won patterns.

How to Use AI to Do Real Science — A thoughtful post pushing back on the "AI as answer machine" pattern, arguing that using AI to shortcut to conclusions undermines scientific understanding. The framing — use AI to stress-test hypotheses, not confirm them — is worth internalizing for any serious research workflow.

Worth Watching

Mac vs. custom 5090 for ML workloads — A practical hardware decision thread for practitioners doing 70% fine-tuning / 30% from-scratch training on image/video data. The RTX 5090 camp has strong arguments for raw VRAM and CUDA ecosystem; the Mac camp points to unified memory and MPS for certain workloads. Worth reading if you're due for a hardware refresh.
RadAgent: AI for CT Interpretation — A tool-using VLM agent for stepwise chest CT interpretation that keeps clinicians meaningfully in the loop rather than black-boxing the diagnosis. An early but serious attempt at agentic medical imaging that doesn't just bolt an LLM onto a DICOM viewer.
Apple subscription markup — Routine reminder: Claude subscriptions purchased via Apple are ~30% more expensive due to Apple's cut. Subscribe directly through the web if you're paying out of pocket.
Substrate AI hiring Harness Engineers — YC-backed Substrate AI is building out its team for AI workload orchestration infrastructure. Worth a look if you're interested in the picks-and-shovels layer of the AI stack.
Prism: Symbolic Superoptimization of Tensor Programs — The first symbolic superoptimizer for tensor programs, using a hierarchical symbolic graph representation (sGraph) to search over large classes of equivalent programs. Potentially significant for inference efficiency if it scales.

Sources

The beginning of scarcity in AI — https://tomtunguz.com/ai-compute-crisis-2026/
AI companies are buying the Slack data of failed startups — https://twitter.com/_iainmartin/status/2044758204773486925
Reese Witherspoon Doubles Down on Telling Women to Learn AI — https://variety.com/2026/tv/news/reese-witherspoon-ai-jobs-women-1236723992/
MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation — http://arxiv.org/abs/2604.15309v1
Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations — http://arxiv.org/abs/2604.15302v1
Context Over Content: Exposing Evaluation Faking in Automated Judges — http://arxiv.org/abs/2604.15224v1
Generalization in LLM Problem Solving: The Case of the Shortest Path — http://arxiv.org/abs/2604.15306v1
CoopEval: Benchmarking Cooperation-Sustaining Mechanisms and LLM Agents in Social Dilemmas — http://arxiv.org/abs/2604.15267v1
Agentic Microphysics: A Manifesto for Generative AI Safety — http://arxiv.org/abs/2604.15236v1
06 New Claude Code Tips from Boris Cherny after Opus 4.7 release — https://www.reddit.com/gallery/1snn4ed
Show HN: SPICE simulation → oscilloscope → verification with Claude Code — https://lucasgerads.com/blog/lecroy-mcp-spice-demo/
4.7 follows CLAUDE.md rules worse than 4.6 — https://reddit.com/r/ClaudeAI/comments/1snlp17/47_follows_claudemd_rules_worse_than_46_and_i/
Disappointed on Opus 4.7, not follow user's instruction — https://reddit.com/r/ClaudeAI/comments/1snq7yq/disappointed_on_opus_47_not_follow_users/
Top Claude skills for Opus 4.7 after cleaning up my install — https://reddit.com/r/ClaudeAI/comments/1snreri/top_claude_skills_for_opus_47_after_cleaning_up/
Looking for help from people who built multi Agents systems — https://reddit.com/r/MachineLearning/comments/1snsjrp/looking_for_help_from_people_who_built_multi/
How to Use AI to Do Real Science — https://reddit.com/r/artificial/comments/1snmqfw/how_to_use_ai_to_do_real_science/
Which computer should I buy: Mac or custom-built 5090? — https://reddit.com/r/MachineLearning/comments/1snqzq9/which_computer_should_i_buy_mac_or_custombuilt/
RadAgent: A tool-using AI agent for stepwise interpretation of chest computed tomography — http://arxiv.org/abs/2604.15231v1
TIL that subscriptions via Apple is 30% more expensive — https://reddit.com/r/ClaudeAI/comments/1snq9s2/til_that_subscriptions_via_apple_is_30_more/
Substrate AI Is Hiring Harness Engineers — https://www.ycombinator.com/companies/substrate/jobs/QJU9023-harness-engineer
Prism: Symbolic Superoptimization of Tensor Programs — http://arxiv.org/abs/2604.15272v1