AI: From the Engine Room
AI Mechanisms
01
I've Heard This Engine Before
"The engine room isn't a better vantage point than the bridge—just a different one, with different things visible."
I've Heard This Engine Before
Introduction to the Series
The View from Below Deck
Most AI commentary comes from the bridge—executives announcing strategy, analysts tracking markets, consultants offering frameworks. This series comes from a different vantage point: the engine room.
The engine room is where you hear the machinery. Where the gap between what's promised and what's practical becomes tangible. It's not a better vantage point than the bridge—just a different one, with different things visible.
The engine room isn't a better vantage point than the bridge—just a different one, with different things visible.
I've spent my career in data engineering and ML research, watching waves of technology hype come and go. Some delivered. Some didn't. The patterns are recognizable if you've seen a few cycles.
Why This Series Now
There's a lot of excellent AI coverage available—from researchers explaining breakthroughs to executives sharing implementation stories. What I've found harder to find is the middle layer: practical explanations of how these systems work that connect to real decisions about data, governance, and interfaces.
That's the gap this series tries to fill. Not because other perspectives are wrong, but because this one might be useful to people navigating similar questions.
There's a lot of excellent AI coverage. What's harder to find is the middle layer: how these systems work, connected to real decisions about data and governance.
What You'll Find Here
The series covers three areas over thirteen articles:
AI Mechanisms (Articles 1-5): How attention, training, and context actually work. The goal is intuition, not exhaustive technical detail.
The Proprietary Data Paradox (Articles 6-9): Why data strategy is harder than it looks. Knowledge architecture, tacit expertise, interface design.
Forward-Looking Governance (Articles 10-13): Hallucination, effective prompting, what it all adds up to, and why AI readiness is a governance question.
A Note on Tone
I'll share what I've observed and what I think it means. I'll try to be clear about what's well-established versus what's my interpretation. Reasonable people will disagree with some of this—AI is a fast-moving field where even experts have honest disagreements.
My goal isn't to be the definitive voice on these topics. It's to offer a practitioner perspective that might help you form your own views.
Understanding how something works changes the questions you ask. That's what I'm hoping this series provides—better questions, not final answers.
What AI decision have you made in the last 90 days based on vendor claims you haven't verified? What would it take to pressure-test those assumptions?
The Attention Mechanism Explained
"Attention lets models compute relevance weights dynamically—deciding which inputs matter for each specific output."
What Attention Actually Does
The Core Idea
Before attention mechanisms, neural networks for language had a sequencing problem. They processed text word by word, compressing everything into a fixed-size representation. By the end of a long sentence, early information had degraded.
Attention addressed this by letting models look back at all previous inputs and compute relevance weights dynamically. For each output, the model calculates how much to weight each input—which parts to "attend to" for this particular task.
Attention lets models compute relevance weights dynamically—deciding which inputs matter for each specific output.
This is genuinely useful. It's also worth understanding what it isn't: the model isn't "paying attention" in the way humans do. It's computing weighted combinations of learned representations.
The Transformer Architecture
The 2017 "Attention Is All You Need" paper showed you could build powerful language models using attention as the core mechanism. That architecture—the Transformer—underlies GPT, Claude, Gemini, and most of the models in current use.
What made Transformers significant: they process sequences in parallel rather than step-by-step, making them faster to train and better at capturing relationships between distant parts of a text.
The Capabilities and Limits
Attention enables impressive capabilities. Models can track pronouns back to their antecedents, maintain coherence across long passages, and pick up on subtle contextual cues. This explains a lot of what makes modern language models feel responsive and contextually aware.
The limits are equally important. Attention operates on learned patterns from training data. It excels at tasks that resemble what it's seen. It struggles with genuinely novel reasoning that requires going beyond those patterns.
This isn't a criticism—it's just how the mechanism works. Knowing this helps calibrate expectations.
Attention excels at tasks resembling what it's seen in training. It struggles with reasoning that requires going beyond those patterns.
Practical Implications
A few things follow from understanding attention:
Impressive demos may not generalize. If the demo task closely matches training patterns, performance can drop on your specific use case.
Context placement matters. Where you put information in a prompt affects how the model weights it.
Fine-tuning shifts attention, not knowledge. This distinction matters more than most realize. Fine-tuning adjusts what patterns the model prioritizes—it can make the model more likely to respond in a particular style, format, or domain vocabulary. But it doesn't add new reasoning capabilities or factual knowledge that wasn't implicit in the base model. If you fine-tune on customer service transcripts, you get better customer service tone; you don't get knowledge of your product catalog unless that was already in the training data. Many projects stumble on this assumption.
Attention is a powerful pattern-matching mechanism. Understanding this helps predict where models will excel and where they might struggle.
For the AI applications you're considering, how similar are your tasks to what models were trained on? Where might the patterns not transfer?
Training Economics
"Compute costs get the headlines, but data quality, talent, and iteration time often determine whether training is viable."
What Training Costs and What It Buys
The Cost Structure
Training a large language model from scratch involves substantial compute costs—tens of millions of dollars for frontier models. But compute is just one component.
You also need large, clean datasets (scarce for most domains), ML engineering talent comfortable with distributed training, infrastructure that can handle the workload, and time for iteration and debugging. Each of these has its own constraints.
Compute costs get the headlines, but data quality, talent, and iteration time often determine whether training is viable.
What Training Actually Produces
Training compresses patterns from data into model weights. The model learns statistical relationships between tokens—what tends to follow what, in what contexts.
This compression is lossy. Information gets generalized, blended, and sometimes lost. What emerges is capability within the distribution of training data: the model becomes skilled at tasks resembling what it saw during training.
Training creates both capability and constraint. What the model learns determines what it can do—and what it can't do without additional training.
The Fine-Tuning Option
Most organizations don't need to train from scratch—they can adapt existing models through fine-tuning. This is more accessible, but comes with its own tradeoffs.
Fine-tuning adjusts weights to shift model behavior toward specific tasks. It works well for adapting style, format, and focus. It's less effective at adding genuinely new knowledge or capabilities that weren't implicit in the base model.
There's also the challenge of catastrophic forgetting: intensive fine-tuning on narrow tasks can degrade general capabilities.
Fine-tuning shifts focus and style effectively. It's less effective at adding genuinely new capabilities that weren't in the base model.
The RAG Alternative
Retrieval-Augmented Generation takes a different approach: instead of encoding knowledge in weights, keep it external and retrieve relevant information at query time.
This is often more practical for proprietary knowledge—cheaper than training, easier to update, more controllable. But it introduces different challenges: retrieval quality, context limits, and the complexity of knowing what to retrieve.
Consider: a user asks about "the Q3 policy change." Your retrieval system finds documents mentioning "Q3" and "policy"—but the relevant context might be in a Q2 memo that announced changes taking effect in Q3, or in meeting notes that never used the word "policy" but discussed the same operational shift. The semantic gap between how users ask and how information is stored creates retrieval failures that look like model failures.
We'll explore this more deeply in Pillar II, where we examine why structured knowledge often outperforms simple vector similarity.
Ongoing Costs
Training is a one-time cost; inference (running the model) is ongoing. For many deployments, inference costs over the model's lifetime exceed training costs significantly.
This shifts the economics: a more expensive model to train might be cheaper overall if it requires fewer inference calls to achieve the same results.
Training economics shape what's viable. Understanding the full cost structure—not just compute—helps identify realistic approaches.
For your AI initiatives, what's the realistic total cost of ownership over two years? How do training, fine-tuning, and RAG approaches compare?
The Efficiency Illusion
"Benchmarks measure what's measurable, which isn't always what matters for your specific application."
What Efficiency Gains Actually Deliver
The Benchmark Question
AI models are evaluated on standardized benchmarks—tests measuring specific capabilities like reasoning, coding, or factual recall. Benchmarks are useful for comparison, but they measure what's measurable, which isn't always what matters for a given application.
When a model matches benchmark performance at lower cost, it's worth asking: what capabilities might differ outside these specific tests?
Benchmarks measure what's measurable, which isn't always what matters for your specific application.
When Benchmarks Mislead
A concrete example: a team I know evaluated models for technical documentation. The smaller, more efficient model scored nearly identically to the larger one on coding benchmarks and general Q&A. In production, the difference became clear.
The larger model correctly inferred that "the system" in paragraph four referred to a specific microservice mentioned three pages earlier. The efficient model treated it as a generic reference and gave plausible but wrong instructions. The benchmark didn't test for this kind of long-range coreference resolution in domain-specific contexts—but the documentation use case depended on it.
The efficient model wasn't worse in any absolute sense. It was worse for that specific task in ways no benchmark had measured.
Sources of Efficiency Gains
Efficiency improvements come from several sources, each with different tradeoffs:
Quantization reduces precision of weights, making models smaller and faster. This works well up to a point, then subtly degrades capability.
Distillation trains smaller models to mimic larger ones. Effective, but the student typically doesn't fully match the teacher on edge cases.
Architecture innovations genuinely do more with less. These are real advances, though often incremental rather than transformational.
Hardware optimization improves how models use compute resources. Real gains, applicable across models.
The Tradeoff Reality
For a given architecture, there's a relationship between model size, training compute, and capability. Smaller models are faster and cheaper. They're also less capable in ways that may or may not matter for your use case.
The question isn't whether efficiency gains are real—they often are. It's whether the tradeoffs matter for what you're trying to do.
The question isn't whether efficiency gains are real. It's whether the tradeoffs matter for what you're trying to do.
Practical Evaluation
Instead of optimizing for benchmarks, consider what actually matters for your use case:
Error tolerance: What's the cost when the model gets something wrong?
Edge case behavior: How does performance hold up on unusual inputs?
Latency requirements: How fast does response need to be?
Volume and budget: What's the realistic query volume and cost constraint?
Efficiency gains are real, but they come with tradeoffs. Evaluate against your actual requirements, not benchmark headlines.
For your most important AI use cases, what would a failure cost? How does that shape the capability-efficiency tradeoff?
Context Windows and Memory
"Models are stateless. Each response is generated fresh from the current context. Previous conversations aren't remembered unless explicitly included."
What Models Actually 'Remember'
How Context Works
A context window is the text the model can process when generating a response. It includes system prompts, conversation history, any documents you've included—everything the model can "see" for this particular response.
The key insight: models are stateless. Each response is generated fresh from whatever's in the current context window. Previous conversations aren't remembered unless explicitly included. The model doesn't learn or accumulate knowledge from interactions.
Models are stateless. Each response is generated fresh from the current context. Previous conversations aren't remembered unless explicitly included.
Bigger Windows, Different Tradeoffs
Context windows have grown substantially—from a few thousand tokens to 100K+ in some models. This sounds like pure improvement, but there are considerations.
Longer contexts are slower and more expensive to process. More importantly, research shows models don't use long contexts uniformly—information in the middle tends to get less attention than content at the beginning and end.
This "lost in the middle" phenomenon means dumping everything into a huge context isn't automatically better than thoughtful selection of what to include.
RAG and Context Management
Retrieval-Augmented Generation uses context windows differently: instead of including everything, you retrieve relevant chunks and include only those.
This is often more effective than brute-force context stuffing, but introduces its own challenges. What if relevant information spans multiple chunks? What if retrieval misses something important? Context management becomes a design problem.
Thoughtful selection of what to include often beats dumping everything into a large context window.
Practical Implications
Understanding context mechanics suggests several practices:
Be explicit about what's needed. Don't assume the model remembers anything from outside the current context.
Position matters. Put the most important information at the beginning of your context.
More isn't always better. Selective, relevant context often outperforms comprehensive context.
Design for statelessness. Build systems that explicitly manage what context is included in each interaction.
Context is working memory, not long-term memory. Design systems that explicitly manage what the model sees for each response.
For applications requiring continuity across interactions, how are you managing context? What happens when important information falls outside the window?
The Proprietary Data Paradox
The Multiplier Effect
"Proprietary data provides facts. Reference data provides relationships. Relationships enable computation that creates insight."
How Reference Data Transforms Proprietary Data
The Data Advantage Assumption
Many organizations assume their proprietary data is their AI advantage. "We have twenty years of customer data." "Our operational data is unique." This assumption is understandable—and often incomplete.
Proprietary data frequently has challenges: inconsistent formats, implicit assumptions that made sense to creators but aren't documented, gaps that weren't problems for original use cases but matter for AI applications.
Proprietary data is often an advantage in potential. Reference data is what converts that potential into something usable.
What Reference Data Provides
Reference data provides scaffolding that makes proprietary data interpretable and connectable:
Taxonomies and ontologies provide standard categorization schemes that enable comparison across different data sources.
Entity registries disambiguate references—connecting 'IBM' and 'International Business Machines' to the same entity.
Relationship schemas define standard ways of expressing connections between entities.
When I mapped product data to USDA nutritional standards, suddenly I could compute similarities and relationships that were invisible before. The reference data didn't add proprietary value—it unlocked value that was already latent in the proprietary data.
The Multiplier Mechanism
The pattern: proprietary data provides isolated facts. Reference data provides relationships. Relationships enable computation. Computation creates insight.
A customer database tells you what people bought. Reference data linking products to categories, categories to market segments, segments to trends—that's what enables analysis of why they bought and predictions of what they might buy next.
Proprietary data provides facts. Reference data provides relationships. Relationships enable computation that creates insight.
Practical Access
Much useful reference data already exists: government databases, academic ontologies, industry standards, open knowledge bases. The work often isn't creating reference data—it's mapping your proprietary data to existing standards.
That mapping work is harder than it sounds. It requires understanding both your data and the reference data well enough to align them meaningfully. Consider: your internal system calls it a "customer account," the reference ontology calls it an "organizational entity," and the actual concept lives somewhere between those definitions. Semantic drift compounds the problem—terms evolve differently across systems, so "active customer" might mean different things in your CRM versus your billing system versus the reference standard.
Domain-specific disambiguation adds another layer. In our nutritional work, "apple" could mean the fruit, a flavor profile, or an ingredient category depending on context. The reference data had all three; knowing which one to map to required understanding we couldn't automate. This is knowledge architecture work, and it's consistently underestimated.
Reference data multiplies the value of proprietary data by making it connectable and computable. The mapping work is where value gets created.
What reference data sources exist for your domain? How much of your proprietary data is currently mapped to external standards?
The Tacit Knowledge Bottleneck
"Experts know things they can't fully articulate. That tacit knowledge is often their most valuable contribution—and the hardest to capture."
Expert Knowledge Resists Capture
The Articulation Gap
Ask experts how they make decisions and you'll typically get an incomplete picture. They'll describe the factors they consciously consider. What they often can't articulate: the pattern recognition that happens below conscious awareness, the subtle cues that trigger intuition, the exceptions they've learned to recognize without being able to name.
This is tacit knowledge—knowledge demonstrated in practice but not fully expressible. It's often the most valuable thing experts have.
Experts know things they can't fully articulate. That tacit knowledge is often their most valuable contribution—and the hardest to capture.
The AI Training Problem
AI systems learn from data. If knowledge isn't captured in data, models can't learn it. Tacit knowledge, by definition, isn't in the data—it's in the experts who create and interpret the data.
This creates a bottleneck: the most valuable knowledge for AI to learn is often the knowledge that's hardest to capture in training data.
Common Approaches and Their Limits
Standard knowledge capture methods often fall short:
Interviews capture what experts think they do, which differs from what they actually do.
Documentation captures explicit procedures but misses contextual judgment.
Observation shows what experts do but not why—and experts often can't fully explain the 'why' themselves.
More Promising Approaches
Some methods work better with the nature of tacit knowledge:
Decision logging: Capture not just outcomes but the specific inputs that led to them. Build datasets of expert judgment in context.
Structured disagreement: When experts disagree, exploring why often surfaces tacit criteria neither would articulate unprompted. In one project, two senior engineers disagreed on an edge case in a classification system. Exploring the disagreement revealed that one was implicitly weighting latency ("this needs to be fast") while the other was weighting reliability ("this needs to be right"). Neither had stated these priorities explicitly—they'd internalized them over years. The disagreement made the tacit explicit, and we captured both criteria for the training data.
Schema co-creation: Building data models with experts forces articulation of what entities and relationships matter.
The next article explores this last approach in more depth—how building schemas with experts can make tacit knowledge traversable.
Building data models with experts forces articulation. The modeling process itself surfaces knowledge that wouldn't emerge through interviews alone.
Tacit knowledge is real and valuable. Capturing it requires methods designed for knowledge that resists articulation.
Where does tacit knowledge create the most value in your organization? What happens to that knowledge when key people leave?
Why Vectors Aren't Enough
"Vector search finds semantically similar content. But expertise often involves connecting concepts that aren't similar—they're related in ways similarity search can miss."
The Case for Structured Knowledge
The Vector Search Pattern
The current approach to grounding AI in organizational knowledge typically involves vector databases: embed your documents, do similarity search at query time, feed relevant chunks to the language model. This works well for many use cases.
Where it can struggle: when the answer requires connecting concepts that aren't semantically similar, when relationships matter more than content similarity, when expertise involves reasoning across multiple domains.
Vector search finds semantically similar content. But expertise often involves connecting concepts that aren't similar—they're related in ways similarity search can miss.
The Relational Limitation
Relational databases taught us to think in tables—entities with attributes, relationships through foreign keys. This works brilliantly for transactions and structured operations. It works less well for knowledge.
When I talked to domain experts about how concepts in their field related, they didn't think in rows and columns. They thought in patterns: these things share properties, these processes produce these outcomes, these categories overlap in these ways. The relationships were primary. The entities were anchors for those relationships.
What Graphs Add
Knowledge graphs make relationships first-class objects. Instead of inferring connections from attribute matching, you explicitly represent how things relate.
When I built the product graph, the valuable questions became: what products share compound profiles? Which ingredients cluster together across manufacturers? What's the health effect footprint of a category? These are graph traversal questions, not similarity search questions.
Graphs make relationships first-class objects. The valuable questions are often about traversing connections, not finding similar content.
Complementary Approaches
This isn't an argument that graphs replace vectors—they're complementary. Vector search excels at finding relevant content when you're not sure exactly what you're looking for. Graph traversal excels when you know the relationship structure and need to reason through connections.
For AI agents and more complex RAG implementations, graphs can provide routing: knowing which paths through information are likely to be relevant for different types of questions. The graph provides the map; vector search helps explore each territory.
The Provenance Dimension
One additional benefit of explicit knowledge structures: they can carry provenance. In my work, different data sources had different confidence levels—government databases, academic literature, derived calculations. The graph tracked these distinctions.
When results came back, users could see not just what the answer was, but where it came from and how confident to be in it. A claim backed by FDA data looked different from one derived from our own algorithmic inference. Users learned to calibrate their trust based on provenance signals.
This matters because AI systems speak with uniform confidence. A graph with provenance gives the interface something to work with—material to make uncertainty visible. Without it, you're asking interface designers to create honesty from nothing.
The same principle applies to data governance more broadly. A searchable data dictionary that indexes every column across your medallion architecture—bronze through gold—transforms governance from documentation burden to discovery tool. When someone can find all tables containing 'customer_id' in seconds rather than hours, the structured metadata becomes its own form of organizational knowledge graph.
The next article explores this in depth: how interface design becomes the final guardrail when the underlying system can mislead.
Vectors and graphs solve different problems. Vectors find similar content; graphs traverse relationships. Complex knowledge applications often benefit from both.
For your domain expertise, what are the key relationships between concepts? Could those relationships be made explicit and traversable?
UI as the Ultimate Guardrail
"Systems speak with uniform confidence regardless of how well-grounded their outputs are. Interface design determines whether users can see the difference."
Designing Interfaces for Systems That Can Mislead
The Uniform Confidence Problem
Complex systems tend to present all outputs with equal confidence. A result backed by authoritative data looks the same as one derived from algorithmic inference. Users reasonably trust what's presented—they can't see the uncertainty underneath.
This is the confidence problem: systems speak with a uniform voice regardless of how well-grounded their outputs are. The interface determines whether users can distinguish between high-confidence and speculative results.
Systems speak with uniform confidence regardless of how well-grounded their outputs are. Interface design determines whether users can see the difference.
Designing the Human-in-the-Loop
'Human-in-the-loop' is often invoked as a safety measure. But the loop has to be designed—it doesn't happen automatically.
Some patterns that helped in my work:
Provenance visibility: Every data point traced back to its source. Regulatory data looked different from derived scores. Users could see at a glance what was authoritative versus algorithmic.
Constraints as interface elements: Acceptable ranges were visible, not hidden. Users saw boundaries before they acted, not after.
Drill-down by default: Every aggregate was explorable. Users could click through to see what drove a result, not just the result itself.
Visual confidence encoding: Confidence levels had consistent visual treatment. The eye could learn the language before consciously parsing numbers.
The Visualization Skepticism Principle
Network visualizations are particularly seductive. A beautiful graph makes patterns feel discovered and real—even when those patterns depend on arbitrary parameter choices.
I learned to make thresholds visible and adjustable. If a pattern survives across different threshold settings, it's probably real. If it vanishes when you tweak a parameter, it was probably an artifact.
Visualizations that invite exploration should also invite skepticism.
If a pattern survives across different threshold settings, it's probably real. If it vanishes when you tweak a parameter, it was probably an artifact.
Separating Exploration from Action
The stakes differ between exploring possibilities and acting on recommendations. I found it helpful to make this boundary explicit in the interface.
Exploration was fast and fluid—query, visualize, download. But when the system suggested an action or flagged an insight for decision-making, it showed the reasoning path. Users could audit the logic before acting.
A domain expert can validate or override a transparent recommendation. They can't validate a black box.
This same principle shapes how I think about data quality interfaces. A letter grade—A through F—provides an intuitive, at-a-glance quality indicator that most people can parse without training. But the grade alone isn't enough. Users need to drill into the dimensions that produced it: completeness, uniqueness, recency, documentation. The grade invites attention; the dimensions enable judgment. Without both, you're either overwhelming users with numbers or asking them to trust scores they can't interrogate.
Connection to LLMs
Everything about designing for knowledge graphs applies to large language models—arguably with greater force. If a system with explicit, queryable relationships requires careful design, how much more does a system generating text from statistical patterns?
The principles transfer: surface uncertainty, show provenance where possible, enable drill-down, make constraints visible. The implementations differ, but the design responsibility remains.
Interface design determines whether users can appropriately calibrate trust. Make uncertainty visible, not hidden.
For AI-powered interfaces in your organization, can users tell the difference between high-confidence and speculative outputs? What would change if they could?
Forward-Looking Governance
Hallucination is a Feature, Not a Bug
"The model doesn't have a concept of truth—it has a concept of plausibility. Hallucination isn't a malfunction; it's inherent to probabilistic text generation."
Understanding Why AI Makes Things Up
How Text Generation Works
Language models are next-token predictors. Given a sequence of text, they generate the most likely continuation based on patterns learned from training data. They do this billions of times, producing fluent text.
Here's the important part: 'most likely' doesn't mean 'true.' The model doesn't have a concept of truth—it has a concept of plausibility. What text patterns typically follow this text pattern? That's what it's computing.
When training data is incomplete or when queries push beyond learned patterns, the model generates plausible-sounding text that may not be accurate. This isn't a malfunction—it's how the mechanism works.
The model doesn't have a concept of truth—it has a concept of plausibility. Hallucination isn't a malfunction; it's inherent to probabilistic text generation.
Why It Can't Be Eliminated
Hallucination is inherent to probabilistic text generation. Mitigation strategies can reduce frequency but not eliminate it:
RLHF trains models toward outputs humans rate as accurate. Helpful, but humans can't verify everything.
RAG grounds responses in retrieved documents. But models can still misinterpret or selectively use retrieved content.
Constitutional approaches train models to follow principles. Reduces certain error types but doesn't address fundamental uncertainty.
Any system that generates novel text from statistical patterns will sometimes generate text that sounds right but isn't.
The Governance Implication
If hallucination can't be eliminated technically, it becomes a governance problem: how do you manage systems that will sometimes confidently produce incorrect outputs?
This reframes the question. Instead of 'how do we fix hallucination?' the more useful question might be 'for which use cases is this error rate acceptable, and what processes do we need around use cases where it isn't?'
Instead of 'how do we fix hallucination?' the better question might be 'for which use cases is this error rate acceptable?'
Questions for Governance Review
Rather than a checklist, consider these questions when evaluating AI applications:
Use case fit: What's the actual error rate we've measured, and what's the cost of an error in this context? Are we deploying in a context where that error rate is tolerable given the stakes?
Verification design: For outputs that matter, who reviews them and how? Is that review realistic given volume and time constraints, or are we assuming verification that won't actually happen?
Monitoring capability: Can we see what's being generated at scale? Would we know if output quality degraded? What would trigger an alert?
Accountability clarity: When an AI output leads to a decision, who's responsible for validating it? Is that person empowered to override, and do they have the context to do so?
These connect to the interface design principles from Article 9—honest interfaces help users know when verification matters.
Hallucination is a property of probabilistic text generation, not a bug being fixed. Governance is about managing that property, not waiting for it to disappear.
For AI applications in your organization, what's the actual cost of incorrect outputs? Does your governance match that risk level?
Working With the Machine
"The most valuable prompting skill isn't crafting better prompts. It's recognizing when prompting isn't the answer."
Prompting as Interface Design—and Knowing When It's Not the Answer
The Real Skill: Knowing When to Stop
Here's the insight that took me longer to learn than it should have: the most valuable prompting skill isn't crafting better prompts. It's recognizing when prompting isn't the answer.
If you're on your fifth iteration of a prompt, trying to get reliable behavior for a critical task, that's a signal. The task might not be well-suited for current LLMs. Fine-tuning might be more appropriate. Traditional software might be the better tool. Or you might need a different architecture entirely—RAG, agents, or hybrid approaches.
Prompt engineering articles rarely mention this. But in practice, knowing when to stop prompting and start reconsidering the approach is underrated.
The most valuable prompting skill isn't crafting better prompts. It's recognizing when prompting isn't the answer.
The Interface Frame
When you write a prompt, you're not programming in the traditional sense—you're not giving precise instructions to a deterministic system. You're providing context to a probabilistic system that will interpret that context based on its training.
This is closer to interface design than coding. Like a good interface, a good prompt makes desired behavior clear, handles edge cases gracefully, and provides useful feedback when things go wrong.
Prompting is closer to interface design than coding. You're designing how human intent gets interpreted by a probabilistic system.
What Tends to Work
When prompting is the right approach, some patterns help:
Format specificity: 'Return a JSON object with these fields' eliminates ambiguity about output structure.
Examples over descriptions: Showing what you want often works better than explaining it. The model learns from patterns.
Task decomposition: Breaking complex tasks into steps often improves results compared to asking for everything at once.
Explicit reasoning requests: 'Explain your reasoning' or 'think through this step by step' often improves outputs—and makes errors easier to catch.
Constraints Beat Personas
Giving the model a persona ('You are an expert marketer') is popular. It works sometimes, but has a failure mode: the model may confidently perform the persona even outside its actual competence.
An alternative: describe the task and constraints directly. Instead of 'You are an expert lawyer,' try 'Review this text for potential issues. Note that this is an initial review and will need professional legal review—I'm looking for areas to focus on.'
Constraints often work better than personas. 'You are an expert' makes the model confident; constraints help it stay appropriately scoped.
System vs. User Prompts
If you're building applications, you're working with two levels: system prompts (persistent instructions) and user prompts (individual inputs).
System prompts establish guardrails, define behavior patterns, and handle recurring edge cases. User prompts are where specific tasks happen. Getting this division right affects both capability and cost—too much in the system prompt hits context limits; too little makes behavior inconsistent.
Signs You Need a Different Approach
Watch for these signals that prompting has hit its limits:
Brittleness: The prompt works for test cases but fails on slight variations in real inputs.
Iteration spiral: Each fix for one edge case breaks another. You're playing whack-a-mole.
Complexity explosion: The prompt has grown to thousands of tokens of instructions, conditions, and exceptions.
Verification burden: Every output requires careful human review because reliability isn't high enough.
When you see these patterns, consider: is this the right tool for this job?
Prompting is interface design between human intent and machine capability. Be specific about format, use examples, decompose complexity—and know when a different approach is needed.
For critical AI workflows in your organization, how much effort goes into prompting? At what point does that effort suggest a different approach might be better?
What the Engine Room Taught Me
"These systems are more comprehensible than marketing suggests, and more limited than hype implies. Both things are true."
Synthesis and Looking Forward
The Through-Line
If there's one thread connecting these articles, it's this: these systems are more comprehensible than marketing often suggests, and more limited than hype implies. Both things are true simultaneously.
Attention mechanisms, training economics, context windows—these aren't black boxes. They're engineered systems with understandable properties. Understanding them doesn't require a PhD; it requires curiosity and the willingness to look beneath the surface.
At the same time, the limitations are real. Hallucination isn't a bug being fixed. Context isn't memory. Efficiency gains come with tradeoffs. These aren't criticisms—they're properties of how these systems work.
These systems are more comprehensible than marketing suggests, and more limited than hype implies. Both things are true.
What Changes With Understanding
If you've engaged with this series, you have a framework for evaluating claims. When someone promises transformation, you can ask useful questions: What's the training data? How are they handling context? What's the error rate, and what governance wraps around it? Is the interface honest about uncertainty?
You have vocabulary for knowledge architecture discussions. What becomes a node? What relationships matter? Where does data come from, and what's the confidence level?
You understand why prompting is interface design, why proprietary data needs reference data, why tacit knowledge creates bottlenecks.
What Stays Stable
The fundamentals are more stable than headlines suggest. Attention mechanisms work the way they've worked since 2017. Training economics follow predictable patterns. The challenges of tacit knowledge and honest interfaces predate AI entirely—they're as old as expert systems.
This is actually good news. It means understanding the current moment provides durable value. When new models launch and benchmarks break, you can distinguish what's changed fundamentally from what's incremental improvement.
The fundamentals are more stable than headlines suggest. Understanding them provides durable value across hype cycles.
The Infrastructure Question
Here's a pattern I've observed—and I'll state it more directly than I have elsewhere in this series: organizations that move fast by skipping foundations consistently pay for it later. I've seen this enough times that I treat it as a near-certainty rather than a tendency.
Ungoverned data creates dependencies that calcify. Architectures that ignore knowledge structure make retrieval brittle in ways that compound. Interfaces that hide uncertainty train users to trust what shouldn't be trusted, and that trust becomes institutionalized.
The correction is always more expensive than doing it right initially. In my experience, "we'll clean up the data later" and "we'll add governance when we scale" translate to 2-3x cost multipliers—and that's before accounting for the decisions made on bad foundations in the interim.
Building clean data pipelines, thoughtful knowledge structures, honest interfaces—this work isn't glamorous. It also doesn't become obsolete when the next model releases.
Governance as Readiness
The infrastructure question leads to a reframe I've come to believe deeply: AI readiness isn't primarily about AI. It's about governance.
Organizations that can answer basic questions—What data do we have? Can we trust it? How do we know?—are organizations positioned to use AI effectively. Organizations that can't answer those questions will struggle regardless of how sophisticated their models are.
This is why the final article in this series focuses on governance as the foundation for AI readiness. The work of observing your data landscape, orienting around quality and fitness for purpose, deciding on priorities, and acting to improve—this work compounds. It makes every subsequent AI initiative faster, cheaper, and more likely to succeed.
The organizations that will thrive with AI aren't necessarily the ones with the most sophisticated models. They're the ones that have done the unglamorous work of knowing what they have and whether they can trust it.
The Engine Room
The engine room metaphor has been useful to me not because it's superior to the bridge, but because it's different. Different things are visible from different vantage points.
What's visible from the engine room: the gap between demos and deployment. The unglamorous work that makes systems actually useful. The limitations that don't make it into marketing materials. The slow accumulation of understanding through building.
That's the perspective I've tried to share in this series. Not the final word—AI is moving too fast for final words. But hopefully a useful contribution to navigating what's here and what's coming.
The slow accumulation of understanding through building—that's what the engine room offers. Not the final word, but a useful perspective.
Understanding the machinery helps you navigate hype cycles, ask better questions, and make decisions grounded in how these systems actually work.
What foundations is your organization building now that will still matter regardless of which models dominate in two years?
AI Readiness Is a Governance Question
"AI readiness isn't primarily about AI. It's about whether you can answer a simple question: What data do we have, and can we trust it?"
Why the Work Before AI Is the Work That Matters
The Readiness Misconception
"AI readiness" typically conjures images of model selection, prompt engineering, and integration architecture. These matter. But they're late-stage concerns that assume something more fundamental: that you know what data you have and whether you can trust it.
Most organizations can't answer those questions reliably. Not because they lack data—they're drowning in it—but because they lack visibility. Tables proliferate across schemas. Columns duplicate with inconsistent names. Quality degrades silently. Documentation exists in someone's head until that someone leaves.
This is the governance gap, and it's where AI initiatives go to struggle.
AI readiness isn't primarily about AI. It's about whether you can answer a simple question: What data do we have, and can we trust it?
The OODA Loop for Data Governance
John Boyd's OODA loop—Observe, Orient, Decide, Act—was developed for fighter pilots making split-second decisions. The framework translates surprisingly well to data governance, where the challenge is also about maintaining situational awareness in a complex, fast-changing environment.
Each phase addresses a distinct governance question:
OBSERVE: What data exists? Before you can govern data, you need to see it. This means automated metadata collection, asset discovery, and maintaining a searchable inventory of tables and columns across your environment. The goal is visibility that doesn't depend on institutional memory.
ORIENT: Can I trust it? Knowing data exists isn't enough—you need to know its quality. This means assessing completeness, uniqueness, recency, and other dimensions that determine fitness for purpose. Quality isn't binary; it's contextual. A table that's "good enough" for one use case may be inadequate for another.
DECIDE: What should I prioritize? With visibility and quality assessment in place, you can make informed decisions about where to invest remediation effort. Not all data quality issues matter equally. Some tables are critical paths; others are rarely touched. Governance resources are finite.
ACT: How do I improve it? Finally, governance requires action—standardizing schemas, fixing quality issues, documenting business logic. The key is that action flows from understanding, not guesswork.
The Metadata-Only Principle
Effective governance doesn't require exposing sensitive data. Quality assessment can operate on statistical aggregates: null percentages, distinct counts, row counts, freshness timestamps. You can know that a column is 73% complete without ever seeing the values in that column.
This matters for two reasons. First, it simplifies compliance—governance metadata doesn't carry the sensitivity of the underlying data. Second, it enables broader access. A data dictionary that anyone can search is more valuable than one locked behind restrictive permissions.
The metadata-only approach also creates a natural audit trail. Timestamped quality assessments show how data health has changed over time—useful for demonstrating improvement and diagnosing regressions.
Quality Dimensions That Matter
Not all quality dimensions are equally important, and importance varies by use case. But a reasonable starting set includes:
Completeness: What percentage of expected values are present? Null rates matter, but so does the pattern—random nulls differ from systematic gaps.
Uniqueness: Are keys actually unique? Are there unexpected duplicates? Uniqueness violations often signal upstream data issues.
Recency: How fresh is the data? A table last updated six months ago might be fine for reference data, problematic for operational data.
Consistency: Do values conform to expected formats and ranges? Are there outliers that suggest data entry errors or system bugs?
Documentation: Are tables and columns described? Can someone unfamiliar with the data understand what it represents?
These dimensions can be combined into a composite score—letter grades work well for at-a-glance assessment—while preserving the ability to drill into specifics when needed.
Quality dimensions can combine into intuitive grades while preserving drill-down to specifics. The grade invites attention; the dimensions enable judgment.
Building Before the Platform Arrives
Many organizations are waiting for their next-generation data platform—Unity Catalog, Fabric, or equivalent—before investing in governance. This is backwards.
Governance practices established now carry forward into new platforms. The data dictionary you build today will migrate. The quality baselines you establish become comparison points for measuring whether the new platform actually improves things. The muscle memory your team develops around quality assessment doesn't reset when infrastructure changes.
More importantly, governance work surfaces problems you'll want to fix before migration, not after. Discovering that a critical table has a 40% null rate is better learned during a governance assessment than during a failed AI deployment on the new platform.
The gap between current tooling and future platforms isn't a reason to wait. It's an opportunity to establish foundations that make the transition smoother and the eventual AI applications more reliable.
The Compounding Effect
Governance work compounds in ways that aren't immediately visible.
A searchable data dictionary reduces the time to answer "where does this data come from?" from hours to seconds. That time savings multiplies across every analyst, every project, every quarter.
Quality baselines transform vague concerns ("I don't trust that data") into specific, actionable issues ("that table has dropped from A to C grade over three months—here's which dimensions degraded"). Specific problems get fixed. Vague concerns persist.
Historical quality records enable trend analysis. You can demonstrate improvement to stakeholders, identify seasonal patterns, and catch regressions before they impact downstream systems.
Each of these benefits individually might seem modest. In combination, they transform an organization's relationship with its data from reactive to proactive.
What This Enables for AI
Return to the AI readiness question with governance in place:
RAG implementations become more reliable because you know which tables to retrieve from—and crucially, which ones to avoid. A column-level index means you can answer "where is customer_id used?" before building retrieval logic, not after discovering inconsistencies in production.
Fine-tuning decisions become more informed because you understand data quality and coverage. A table with 95% completeness and stable schema history is a better training candidate than one with quality issues you'd otherwise discover through model failures.
Confidence in outputs becomes grounded because you can trace claims back through provenance. When a model cites internal data, you can verify whether that data was current, complete, and appropriate for the query.
The pattern from Article 9 applies: AI systems speak with uniform confidence. Governance provides the substrate that lets you distinguish warranted confidence from unwarranted.
Governance provides what AI systems lack: the ability to distinguish warranted confidence from unwarranted.
Starting Points
If governance work feels overwhelming, start with observability: can you list all tables in your environment with basic metadata? Row counts, column counts, last-modified dates? This alone provides value and establishes infrastructure for more sophisticated assessment.
Next, establish quality baselines for critical tables—the ones that feed reports, power applications, or would be candidates for AI applications. You don't need to assess everything at once. Start where the impact is highest.
Build in automation early. Manual governance doesn't scale and doesn't persist. Scheduled crawlers, automated quality checks, alerts on degradation—these convert governance from a project into a capability.
Finally, make governance visible. A quality dashboard that stakeholders can access changes the conversation from "we should care about data quality" to "this specific table needs attention." Visibility creates accountability.
The Real Readiness
AI readiness isn't a destination—it's a capability. And that capability rests on foundations that have nothing to do with models, prompts, or inference endpoints.
Organizations that can answer "what data do we have?" and "can we trust it?" are ready for AI in ways that no amount of model selection can substitute for. Organizations that can't answer those questions will find AI initiatives consistently more difficult, more expensive, and less reliable than expected.
The work of governance isn't glamorous. It doesn't make for exciting demos or impressive announcements. It's the engine room work—the machinery that makes everything else possible.
And it compounds. Every month of governance investment makes the next AI initiative more likely to succeed.
The organizations that will thrive with AI aren't necessarily the ones with the most sophisticated models. They're the ones that have done the unglamorous work of knowing what they have and whether they can trust it.
AI readiness is governance readiness. The work of observing your data landscape, assessing quality, and building reliable metadata infrastructure isn't preparation for AI—it's the foundation that determines whether AI initiatives succeed.
If someone asked you today to list every table containing customer data with its quality grade and last-updated timestamp, how long would that take? That gap—between asking and knowing—is your governance opportunity.