When Specificity Breaks Intelligence … and the Bank

Stop Calling It Intelligence When It’s Really Just Scripting

Sep 04, 2025

In our recent stress-testing of AI for enterprise-scale financial workflows, we uncovered a recurring pattern that underscores both the promise — and the peril — of today’s fixation with “prompt engineering.”

At Charli, we’ve been tackling some of the most technically complex automation challenges for normalizing fragmented financial data across hundreds of employees and thousands of files and systems. These workflows are major industry bottlenecks that are error-prone, slow, and directly tied to lost revenue opportunities.

The challenge isn’t just messy data; it’s fractured data. Spreadsheets with inconsistent headers, embedded tables buried in Word or PDF files, CSV exports with schema drift, and internal databases with mismatched field definitions. Even skilled financial analysts and administrators struggle to reconcile, copy-and-paste, and calculate across this variability at scale. And scale magnifies the pain when many are forced to rinse-and-repeat these tasks across hundreds of different clients, each with its own quirks.

The outcome is predictable. Processes the business expects in hours or days routinely stretch into weeks or even months.

As any responsible engineering team would, our team turned to the latest generation of ML models and large language models (LLMs) to benchmark whether they could accelerate extraction and normalization. And yes, with very carefully engineered prompts, the LLMs could sometimes retrieve some of the values we needed from a given file. On the surface, this feels like a bit of a breakthrough.

But here’s the kicker: this is not intelligence. And it’s certainly not scalable automation.

These experiments highlight the brittleness of prompt-driven approaches. Change the column order, vary the table layout, or introduce nested fields, and the entire chain collapses. What looks like success in a controlled test is nothing more than scripted pattern-matching that is both fragile and dependent on human-crafted specificity. In production environments, where schema drift, multimodal inputs, and edge-case variability are the norm, this approach simply doesn’t hold.

Our team has lived through enough of the enterprise automation cycles to spot the false signals. These prototypes are not solutions — they’re warnings. Over-reliance on LLM demoware creates the illusion of progress, while masking the deeper requirement of systems that can adapt to structural variation and infer context without constant human intervention.

“… without constant human intervention.”

The Illusion of Intelligence in Prompt Engineering

Hyper-specific prompting doesn’t scale. It is brittle, fragile, and dependent on the exact phrasing, formatting, and quirks of the input. The moment the source data shifts even slightly — a new table layout, a misaligned column header, a different file structure — the carefully tuned prompt collapses.

That’s not artificial intelligence. That’s human intelligence smuggled into the system through elaborate scripting. Prompt engineering at this level of specificity is just coding in disguise, only with a fuzzier syntax.

It makes for great demo-ware, but it doesn’t survive the brutal variability of production environments.

What Real Intelligence Looks Like

Real intelligence is not about telling a model exactly where to look, what to extract, and how to calculate. That’s simply encoding instructions in a new form. True intelligence means the system can infer how to handle data automatically regardless of format, source, or content type, and deliver the right outcome without constant human intervention.

And this goes far beyond the surface-level task of classifying “this is an Excel file” or “that’s a PDF.” Enterprise-grade intelligence demands multimodal reasoning that can:

Read and interpret embedded layouts and structures: not just plain text, but nested tables, headers, and footnotes across heterogeneous file types.
Parse complex or degraded inputs: tables inside scanned documents, images with mixed formatting, or files with OCR errors.
Reconcile inconsistent schemas: aligning mismatched field names, column orders, and export formats that shift from version to version.
Infer context, not follow instructions: distinguishing between revenue, bookings, or backlog in a table by understanding the semantic context, not because a prompt told it which column to use.

At enterprise scale, these challenges compound. The system must consistently deliver accurate outputs across thousands of variations and edge cases — without engineers rewriting prompts for each new scenario. Anything else quickly collapses under production load.

The distinction is critical. Prompt-driven “hacks” can demo well but accrue massive technical debt, locking teams into brittle, human-maintained and binding scripts. Systems with genuine adaptive intelligence, by contrast, have staying power as they adapt quickly to changing inputs, reduce operational overhead, and scale across the full diversity of enterprise workflows.

Real intelligence is resilient, transferable, and self-adapting. Prompt hacks are not.

The true power of enterprise-grade, scalable AI is handing an agent a customer file with a simple instruction — ‘here are the financials’ — and having the system figure it out on its own. Gaps, fragments, inconsistencies, quarters, months, breakdowns, messy footnotes — all handled automatically. No fuss. No babysitting.

The Cost of Getting This Wrong

For CIOs and IT leaders, the danger is clear: investing heavily in “prompt-driven solutions” creates brittle infrastructure that can’t withstand enterprise realities. It locks the organization into human-maintained scripts, creates hidden operational debt, and offers little long-term ROI.

Worse, it distracts from the real challenge: building adaptive systems that are robust against the full chaos of enterprise data.

Yes, LLMs can be coaxed into extracting what you want — with enough hand-holding. But specificity breaks intelligence. The “intelligence” in those cases lies in the human crafting the prompt, not in the machine.

The real frontier is not better prompts. It’s systems that exhibit resilience, adaptability, and true multimodal understanding — systems that can handle whatever the enterprise throws at them without breaking.

That is the difference between AI as a demo and AI as infrastructure.

Rethinking AI Infrastructure to Unlock New Insights in Capital Markets

Inside Charli AI Labs

Discussion about this post