LLM Myths, AI Agents in Production, and Rethinking Data ROI

Biweekly Data & Analytics Digest: Cliffside Chronicle

The Truth About Long-Context LLMs

The NOLIMA study reveals what many practitioners have witnessed in real deployments: LLMs struggle significantly with long contexts despite claims to the contrary. Traditional tests allow models to succeed by simply matching words between questions and answers. For example, asking "Who was the first person to land on Mars?" when the text contains "Sarah Chen became the first person to land on Mars" only tests word-matching, not understanding.

The results are sobering:

  • Even top models like GPT-4o drop from 99.3% to 69.7% accuracy at 32K tokens

  • Specialized reasoning models fare poorly at length (GPT-o3 Mini: 18.9%, DeepSeek R1: 20.7% at 32K)

  • Most models struggle beyond 4K tokens

Practical takeaways:

  • RAG remains essential for reliable AI systems

  • Breaking content into smaller chunks improves performance

  • When evaluating LLMs, test their ability to reason, not just match words

This research validates the hybrid approaches many teams have adopted rather than relying solely on expanding context windows.

From Pilots to Production: The Power of AI Agents

Databricks has unveiled a suite of innovative tools designed to help enterprises confidently scale AI agents from pilot to production environments. The new offerings include centralized governance for all AI models through Mosaic AI Gateway with custom LLM provider support, simplified integration into existing workflows via AI/BI Genie Conversational API, streamlined human-in-the-loop feedback through an upgraded Agent Evaluation Review App, and infrastructure-free batch inference capabilities.

These advancements enable organizations to deploy AI agents in high-value, mission-critical applications while ensuring proper governance and ease of use, ultimately helping businesses overcome the confidence barrier that has prevented many from fully realizing AI's potential beyond pilot phases.

Shattering the Data ROI Illusion: Why Traditional Metrics Miss the Mark

Hex’s post tackles the challenging question many data leaders face: how to measure the ROI of data teams. After countless conversations with data leaders, attempting to quantify data team ROI through complex metrics and "insight logs" is ultimately futile and counterproductive.

Instead of trying to calculate ROI directly, think about data team value through what he calls an "ROI as NPS" approach. The true measure of a data team's impact is what their stakeholders say about them. If business partners value the data team's contributions, they should be advocating for more resources on the team's behalf.

McCardel acknowledges this requires embracing a service mindset, which some data leaders resist. However, he emphasizes that data teams exist fundamentally to help other functions succeed through better decision-making. The ROI of data work "rounds to zero" unless it inspires action from stakeholders. When data teams focus on delivering obvious impact, they won't need spreadsheets to justify their existence—their partners will champion them instead.

From GPT-2 to DeepSeek R1: Andrej Karpathy’s Insider Tour of How LLMs Really Work

In this thorough, plain-language walkthrough, Andrej Karpathy—a widely respected AI expert with stints leading Tesla’s Autopilot vision team and shaping OpenAI’s early research—traces the evolution of large language models from GPT-2 all the way to cutting-edge systems like DeepSeek R1.

He offers a “mental model” for understanding LLMs: first they’re trained on colossal internet text datasets (pre-training) to acquire general knowledge, then refined with human-annotated conversation data (supervised fine-tuning), and finally improved through trial-and-error problem solving (reinforcement learning), which can make them appear to “think out loud.”

Throughout, Karpathy shows why these models can oscillate between genius outputs and bizarre errors, stressing that users should see them as powerful—but still imperfect—tools. By highlighting the techniques that mitigate mistakes (e.g. tool use for math or searching the web) and unveiling the trajectory from older GPT-style systems to next-gen models like DeepSeek R1, he paints a credible insider’s view of how LLMs work now and where they’re headed.

Seven Hard-Earned Lessons from a Year of Building AI Agents

In this candid reflection, Maya Murad shares the real-world challenges and pivotal lessons her team learned while developing AI Agents over the course of a year. From discovering the pitfalls of purely end-to-end approaches to figuring out how to “chain” multiple smaller tasks, this post offers valuable firsthand guidance for anyone striving to deploy agents in dynamic, complex environments.

This article is packed with actionable insights into unifying agent states, setting realistic expectations, and creating evaluation frameworks that keep AI projects on track. Whether you’re just starting out or already knee-deep in advanced AI initiatives, these lessons are a must-read for navigating the messy reality of building and scaling AI Agents.

Blog Spotlight: Azure Container Apps

Container technology has reshaped the way organizations build, run, and manage applications. Containerized solutions allow teams to develop highly scalable, consistent, and efficient microservices, a major game changer for modern software delivery pipelines. Azure Container Apps (ACA) offers a fully managed, serverless experience that lets you reap the benefits of container architectures without wrestling with the complexities of Kubernetes.

What topics interest you most in AI & Data?

We’d love your input to help us better understand your needs and prioritize the topics that matter most to you in future newsletters.

Login or Subscribe to participate in polls.

It is a capital mistake to theorize before one has data.

~ Sherlock Holmes in “A study in Scarlet” by Arthur Conan Doyle