Thought Leadership

The Data Problem at the Center of Benefits AI

The most valuable data in employer-sponsored benefits has never been in a database. Here's why that's true, what it took to fix it, and what eleven years of verified plan data actually unlocks.

4 min readBy Andrew Kimmel

The most valuable data in employer-sponsored benefits has never been in a database. It's been sitting in PDFs, buried in broker files, carrier portals, and HR shared drives, formatted differently by every carrier, never standardized, never aggregated at scale.

That's just how this industry developed and it's largely stayed that way. Which means most analytics products being built in this space are sitting on weaker data than their builders realize.

I've spent eleven years solving this problem. Here's what I learned.

The data you think exists doesn't

When we started Bnchmrk in 2015, I assumed structured benefits data was out there somewhere. A government database, an industry clearinghouse, something. It wasn't.

Every employer negotiates their own group health plan through a broker or directly with a carrier. The result is a contract that lives in a PDF. There's no central registry. The Form 5500, which developers find quickly because it's public and searchable, is a financial disclosure, not a benefits disclosure. It confirms a plan exists and reports aggregate spend. Deductibles, contribution splits, plan design: none of it. The one field it captures with any real precision is broker compensation.

The closest thing to a standardized document is the Summary of Benefits and Coverage. Carriers are required to produce them, but they're only loosely templated. One carrier formats cost-sharing one way, another does it differently. And contribution data, what the employer pays versus what the employee pays, is never in the SBC. You'll find it in a separate rate sheet or benefit guide, if you can find it at all.

The industry settled for surveys. It's not enough.

Ask HR teams to self-report their plan details, aggregate the responses, publish benchmarks. That's been the standard for decades. It has three problems that matter a lot if you're building on top of it.

Self-reported data is wrong at the source. The HR manager filling out the survey isn't cross-referencing the SPD. The details that matter most like contribution tiers, out-of-pocket structures, or stop-loss attachment points, are exactly what gets approximated or misremembered.

It's also stale the day it's published. Benefits change at renewal. Last year's survey doesn't reflect what's in force today.

And it's skewed. The employers who complete surveys are larger companies with dedicated HR teams. The dataset doesn't represent the market, it represents the part of the market that had time to fill out the form.

For a benchmarking report, manageable. For a model making recommendations, these problems compound.

What it actually takes

The only alternative is source documents: SBCs, rate sheets, benefit guides, processed at scale.

Most people assume it's a parsing problem. The parsing is the easy part. It's a validation problem.

Every document is formatted differently. Fields conflict within the same document. Deductibles that don't reconcile with out-of-pocket maximums. Contribution structures where the tier math doesn't add up. Plan type classifications that contradict the benefit design. You have to catch those, resolve them against the source, and maintain that discipline continuously as new documents flow in.

We've been doing this since 2015. It's why anyone who has tried to build this from scratch has either given up or ended up with something much narrower than they planned.

What we built — and what it means for you

We started Bnchmrk in 2015 to solve this for benefits professionals. Consultants needed accurate benchmarking data and the only way to get it was from source documents. That's still true and we're still doing it.

But what we built to power benchmarking is infrastructure. A continuously maintained, document-verified dataset of employer benefit plans covering every state, every major industry, updated daily. And that infrastructure has value well beyond the reports we generate from it.

Here's what that means practically. A model that tells a broker their client is trending 12% higher at renewal is useful. A model that adds "your deductible is at the 25th percentile for manufacturers in Texas, here's what moving to the 50th costs" is what drives a decision. That second sentence requires verified plan design data at the peer group level. That's what the Bnchmrk API delivers.

We're talking to companies building in this space about API access, dataset licensing, and data delivery. The best partnerships work both ways, you get access to the dataset and what you're seeing in your platform feeds back into it. The pool gets better for everyone. If that's relevant to what you're building, let's talk.

Share: