• The Brainyacts
  • Posts
  • 271 | The AI Test Law Firms Aren’t Running (But Should Be)

271 | The AI Test Law Firms Aren’t Running (But Should Be)

Brainyacts #271

Hello to all 9625+ of you from around the globe. Thank you for reading and sharing!

TL;DR

Most law firms rolled out AI without ever defining the exam it needs to pass.

External benchmarks are useful, but they reflect someone else’s idea of “good legal work.” If you don’t articulate and test your own standards: by practice, partner, and matter type - you’re relying on hope, release notes, and vendor demos.

The real frontier in legal AI isn’t bigger models. It’s practitioner-calibrated systems that reflect how lawyers actually exercise judgment.

Deep Dive

A few months ago, I asked a senior partner how their firm had tested the AI tools they’d rolled out firmwide. 

She paused. Not because the question was unfamiliar, but because no one had asked it quite that way.

There had been demos. Vendor pilots. Security and privacy reviews. A handful of impressive examples shown in partner meetings. The firm was comfortable that the technology was capable. What no one had done, however, was define explicitly what acceptable meant for their lawyers, their clients, and their work.

That omission is more common than most firms realize. And it helps explain why AI adoption in elite legal organizations has been slower and more uneven than the hype would suggest.

This isn’t because lawyers are resistant to AI! It’s because we are asking AI to operate in one of the few professions where judgment is not a feature of the work. It is the product.

The Wrong Question: “Is This AI Any Good?”

Most legal AI conversations still revolve around a largely useless question: Is this model good?

We point to benchmarks. Accuracy scores. Comparisons between frontier models. Occasionally, we cite impressive results on standardized legal tasks.

Those benchmarks are not meaningless. In fact, organizations like OpenAI have invested heavily in legal benchmarks precisely because they want more realistic evaluations of model performance, including complex, economically meaningful legal reasoning tasks rather than toy problems.

But even the best public benchmarks answer the wrong question for practicing lawyers.

Knowing that a model performs well on a generalized legal task does not tell you whether it can draft a memo your partners would actually send. Or analyze a fiduciary-duty problem the way your firm would. Or exercise restraint where judgment, not completeness, matters. 

Lawyers don’t accept work product because it is “generally correct.” We accept it because it meets our standards. Standards shaped by experience, risk tolerance, jurisdiction, client expectations, and professional judgment.

Those standards are highly specific. And they vary dramatically even within the same firm. 

Why “Firmwide AI” Keeps Quietly Stalling

Most law firms are partnerships. Many AI strategies, however, are designed as if firms were factories. 

Enterprise AI assumes common workflows, shared language, documented processes, and a willingness to adapt behavior to tools. Law firms have none of those things by default and for good reason. They run on individual heuristics: the accumulated, often undocumented ways experienced lawyers decide what matters, what doesn’t, and how far to push.

When a firm rolls out one model, one interface, and one “best practice” workflow, the burden shifts to individual lawyers to translate their thinking into the tool’s logic and then translate the output back into their own mental model.

That translation cost is subtle but real. Over time, it overwhelms any efficiency gain.

If lawyers can’t see themselves in the tool, adoption stalls not because the tool is bad, but because it’s exhausting.

This is not a culture problem. It’s an abstraction problem. 

The Test Lawyers Already Run (But Never Write Down)

Every experienced lawyer already has a private acceptance test. It's in their head.

They know when a draft is “basically fine” versus “I’d never send this.” They know which clauses are non-negotiable, which risks are tolerable for this client but not that one, and which mistakes are annoying versus unacceptable. They know when tone matters more than technical precision and when the reverse is true. 

None of this lives in a KM system. Almost none of it is meaningfully templated. And very little of it is shared, even inside the same practice group.

This is not a bug in legal practice. It is legal practice.

The mistake firms are making with AI is assuming that this implicit acceptance test can remain implicit while letting machines generate increasingly lawyer-like work.

Owning the Exam Before Trusting the Student

In other industries, teams have responded to this problem by moving away from generic benchmarks and toward custom, task-specific evaluations often called “golden datasets” or “evals.”

The name doesn’t matter. The habit does.

Before trusting an AI system with real work, they assemble a small but representative set of real examples: the kinds of inputs the system will actually see, paired with outputs that their best humans would accept as “good enough to use.”

For lawyers, this doesn’t require building a massive testing program. It might mean something as concrete as the following: 

Take a complex, fact-heavy Delaware fiduciary-duty scenario,  the kind a partner might give to a senior associate, and ask the model to draft a neutral client memo analyzing board authority, stockholder agreements, and conflicted investors. Specify the jurisdiction. Provide primary sources. Impose realistic constraints. Then judge the result the same way you would judge a human draft. 

Not: Is it impressive?

But: Would we send this to the client?

That is a benchmark lawyers actually understand.

The Calibration Prompt: A Simple, Repeatable Discipline 

This is where calibration prompts come in.

A calibration prompt is not about creating the “perfect” instruction. It’s about creating a prompt you can realistically reuse over time across models and across updates to notice changes that matter to your work.

A good calibration prompt has four traits lawyers already care about: 

  1. It defines a real task, not a generic one.

  2. It constrains scope: jurisdiction, time frame, subject matter.

  3. It requires structure: categorization, issue-spotting, prioritization.

  4. It sets expectations around sourcing and uncertainty. 

Run the same prompt:

  • across different models,

  • again after model updates,

  • and periodically over time.

What you’re watching for isn’t perfection. You’re watching for improvement, regression, drift, and changes in behavior that affect reliability.

If you don’t do this yourself, you won’t notice those changes until something goes wrong.

Why External Benchmarks Aren’t Enough (Even the Good Ones)

To be clear, this is not an argument against external benchmarks. In fact, the frontier labs deserve credit for moving beyond toy problems and abstract scores.

A good example is OpenAI’s GDP-style legal evaluations, which test models on realistic, economically meaningful legal tasks rather than multiple-choice questions. If you review the lawyer examples they publish consisting of complex fact patterns, jurisdiction-specific analysis, real doctrinal tradeoffs, they’re genuinely quite good. They look much closer to the kind of work lawyers actually do.

And that’s precisely the point.

Even the best public benchmarks are still someone else’s version of “realistic.”

They reflect:

  • someone else’s assumptions about what matters,

  • someone else’s tolerance for risk and ambiguity,

  • someone else’s idea of what a “good” legal memo looks like.

 Those examples are not your clients, your partners, your practice norms, or your internal red lines.

A Delaware fiduciary-duty problem evaluated by an external benchmark may be excellent and still fail your firm’s standards for tone, emphasis, or judgment. A memo that passes a public eval may still be something your partners would heavily rewrite before it ever reached a client.

That gap is unavoidable. And it’s not a flaw in the benchmark.

It’s simply the reality that professional judgment is contextual.

Which leads to the uncomfortable but necessary conclusion:

external benchmarks can tell you whether a model is capable BUT only you (the lawyer) can decide whether it is acceptable.

That burden does not transfer to vendors. It does not disappear because a benchmark looks sophisticated. And it does not get solved by waiting for better models.

If you don’t define and test your own standards, you will eventually discover the mismatch the hard way after a model update, a prompt change, or a quiet regression that no release note ever mentions. 

AI as Living Work Product

Software engineers use unit tests to ensure new code doesn’t break old features. Custom benchmarks serve the same function for AI.

You tweak a prompt. Change a model. Add a feature. Suddenly something subtle breaks.

Without a benchmark, firms often discover this only after a partner complains, a client notices, or worse, no one notices at all.

At that point, the issue isn’t innovation. It’s risk management.

The Real Frontier for Legal AI

The next phase of legal AI will not be defined by bigger models or more agents.

It will be defined by whether firms are willing to articulate their own standards and insist that AI systems meet them.

That likely means moving away from “one model per firm” toward systems calibrated where judgment actually lives: by practice, by partner, by matter type. Centralized governance remains essential. Calibration cannot be averaged without erasing the very thing clients pay for.

Instead of asking whether an AI tool works, firms should be asking a more familiar question:

Would we accept this work product, in this context, from a junior lawyer without rework or embarrassment?

If the answer is “sometimes,” that’s not a reason to abandon AI. It’s a reason to define the test.

The future of legal AI won’t belong to the firms with the most tools. It will belong to the firms that insist on owning the exam before trusting the student.

That is it for now. Talk soon again.

To read previous editions, click here.

Was this newsletter useful? Help me to improve!

With your feedback, I can improve the letter. Click on a link to vote:

Login or Subscribe to participate in polls.

Who is the author, Josh Kubicki?

I am a lawyer, entrepreneur, and teacher. Not a theorist, I am an applied researcher and former Chief Strategy Officer, recognized by Fast Company and Bloomberg Law for my work. Through this newsletter, I offer you pragmatic insights into leveraging AI to inform and improve your daily life in legal services.

DISCLAIMER: None of this is legal advice. This newsletter is strictly educational and is not legal advice or a solicitation to buy or sell any assets or to make any legal decisions. Please /be careful and do your own research.