Welcome to AI Governance Insider

A major AI company is being sued for training models on pirated books.

Meanwhile, procurement teams at Fortune 500s are quietly adding a new question to their vendor security reviews: "Where did your training data come from?"

The answer to that question is now killing deals worth millions.

This week: Why training data provenance just became the new GDPR compliance check, and what Singapore's agentic AI guidance means for anyone deploying autonomous agents.

The Lawsuit That Changed Procurement

Authors are suing NVIDIA for allegedly training its NeMo models on a dataset sourced from shadow libraries - pirated book repositories.

The claim isn't just "you used our content." It's "you knew the content was stolen, and you used it anyway."

Why this matters beyond NVIDIA:

The lawsuit exposes a dirty secret most AI vendors won't admit: They don't actually know where all their training data came from.

Third-party datasets. Scraped web content. "Publicly available" data that isn't actually licensed for commercial use.

Most vendors inherited training pipelines from research teams who weren't thinking about licensing. They were thinking about benchmarks.

The procurement fallout:

I'm seeing this play out in real vendor reviews right now.

CISOs are asking: "Can you prove your training data is legally licensed?"

Most vendors can't. Or they deflect: "We used publicly available datasets." "We follow industry best practices." "Our legal team reviewed it."

That's not an answer. That's a liability transfer.

What's changing:

Training data provenance is becoming a contract requirement — the same way GDPR compliance became non-negotiable in 2018.

If your vendor can't document the licensing chain for their training data, you're inheriting their legal risk.

And unlike a data breach (which might happen), copyright infringement litigation is starting to feel inevitable.

The vendor question nobody's ready for:

"Can you provide documentation showing that all training data used in this model was obtained with proper licensing or permission from copyright holders?"

If your vendor hesitates, that's your answer.

The Lawsuit That Changed Procurement

Authors are suing NVIDIA for allegedly training its NeMo models on a dataset sourced from shadow libraries - pirated book repositories.

The claim isn't just "you used our content." It's "you knew the content was stolen, and you used it anyway."

Why this matters beyond NVIDIA:

The lawsuit exposes a dirty secret most AI vendors won't admit: They don't actually know where all their training data came from.

Third-party datasets. Scraped web content. "Publicly available" data that isn't actually licensed for commercial use.

Most vendors inherited training pipelines from research teams who weren't thinking about licensing. They were thinking about benchmarks.

The procurement fallout:

I'm seeing this play out in real vendor reviews right now.

CISOs are asking: "Can you prove your training data is legally licensed?"

Most vendors can't. Or they deflect: "We used publicly available datasets." "We follow industry best practices." "Our legal team reviewed it."

That's not an answer. That's a liability transfer.

What's changing:

Training data provenance is becoming a contract requirement — the same way GDPR compliance became non-negotiable in 2018.

If your vendor can't document the licensing chain for their training data, you're inheriting their legal risk.

And unlike a data breach (which might happen), copyright infringement litigation is starting to feel inevitable.

The vendor question nobody's ready for:

"Can you provide documentation showing that all training data used in this model was obtained with proper licensing or permission from copyright holders?"

If your vendor hesitates, that's your answer.

What Regulators Actually Said About AI Agents

While everyone's been focused on training data, Singapore and the UK just released guidance on the thing that actually keeps me up at night: autonomous AI agents making decisions without human oversight.

Singapore's IMDA and PDPC dropped a practical guide specifically for agentic AI systems - the first major regulator to address this directly.

The core message: If your AI agent can take actions autonomously, you need accountability mechanisms, transparency about what it's doing, and human oversight on critical decisions.

Not "nice to have." Required.

The UK's ICO followed up with a warning about AI agents making autonomous decisions that affect people's data rights.

Their concern: How do you exercise "the right to human review" when the AI agent already executed the action?

Why this matters now:

Agentic AI isn't a future problem. Companies are already deploying AI agents that:

  • Respond to customer support tickets without human review

  • Make procurement decisions based on vendor data

  • Triage security alerts and auto-remediate

  • Schedule meetings and prioritize tasks based on email content

The promise is efficiency. The risk is accountability disappears.

The three questions Singapore wants you to answer:

  1. Accountability: Who's responsible when the agent makes a mistake? (Spoiler: It's still you, not the AI.)

  2. Transparency: Can you explain what the agent did and why? Or is it a black box making decisions in your name?

  3. Human oversight: Where are your kill switches? What decisions require human approval?

What I'm seeing go wrong:

Most companies deploying agentic AI don't have answers to these questions because they didn't design the system with regulatory expectations in mind.

They optimized for "automate everything" without asking "what happens when it automates the wrong thing?"

The result: AI agents with broad permissions, minimal logging, and no human-in-the-loop controls on high-stakes actions.

When regulators audit that setup, it won't go well.

Stat of the Week: The EU AI Act Training Data Mandate

Starting August 2026, the EU AI Act requires every AI provider to publish a summary of the datasets used for training. Penalties for non-compliance: up to 10 million euros or 2% of annual turnover.

This isn't optional guidance. It's law.

Translation: If your AI vendor operates in or sells to the EU, they'll need to document training data provenance. If they can't, you're either losing that vendor or inheriting their compliance risk.

The procurement teams who add this question now will avoid scrambling in Q3.

The AI Wire

NIST is seeking input on AI agent security. The Center for AI Standards and Innovation (CAISI) published a Request for Information on securing AI agent systems, with comments due March 9, 2026. Focus areas: indirect prompt injection, data poisoning, and models taking harmful actions even without adversarial inputs. If you're deploying AI agents, this RFI signals where security requirements are heading. [Link: NIST RFI]

EU AI Act training data requirements hit in August. All general-purpose AI model providers must publish training data summaries, respect copyright opt-outs, and label AI-generated content. This is the regulatory forcing function that will make training data provenance a standard procurement question globally.

The Bottom Line

Training data provenance isn't a legal problem. It's a procurement problem.

The NVIDIA lawsuit exposed what everyone already suspected: Most AI vendors can't fully document where their training data came from.

That uncertainty is now killing enterprise deals.

Meanwhile, Singapore and the UK are the first regulators to say out loud what CISOs have been worrying about: Agentic AI without accountability mechanisms is a compliance time bomb.

What to do this week:

  • If you're buying AI: Add training data provenance to your vendor security questionnaire. If they can't answer, that's a risk you need to quantify.

  • If you're deploying AI agents: Map out where your agents have autonomy, what they can do without human approval, and whether you can explain their decisions in an audit.

  • If you're a vendor: Start documenting your training data sources now. This question isn't going away, and "we'll get back to you" kills trust.

The Bottom Line

Training data provenance isn't a legal problem. It's a procurement problem.

The NVIDIA lawsuit exposed what everyone already suspected: Most AI vendors can't fully document where their training data came from.

That uncertainty is now killing enterprise deals.

Meanwhile, Singapore and the UK are the first regulators to say out loud what CISOs have been worrying about: Agentic AI without accountability mechanisms is a compliance time bomb.

What to do this week:

  • If you're buying AI: Add training data provenance to your vendor security questionnaire. If they can't answer, that's a risk you need to quantify.

  • If you're deploying AI agents: Map out where your agents have autonomy, what they can do without human approval, and whether you can explain their decisions in an audit.

  • If you're a vendor: Start documenting your training data sources now. This question isn't going away, and "we'll get back to you" kills trust.

Not sure what to ask your AI vendor?

I built a checklist: AI Vendor Risk Check - the questions your procurement team should be asking (but probably isn't).

Reply: What's the hardest AI governance question you've gotten from your legal or compliance team? I read everything.

Stay compliant. Stay curious.

Anson

P.S. Next week: Why your AI policy doesn't cover what's actually happening in Slack.

1,020+ CISOs & compliance leaders subscribe weekly. Sponsor this newsletter: [email protected]

Keep reading