AI Product Manager — Interview Prep Guide

Interview Prep Guide

Your interview is coming.
Here's the fastest path.

The complete AI PM prep guide — from first principles to high-probability interview questions. Walk in sharp, not just prepared.

→ opens The AI PM Role

Pick your prep mode

Interview tomorrow

30-min crash course. Hit the essentials and go.

1 The AI PM Role

2 Crafting Your Narrative

3 Mock Q&A Bank

~30–45 min

1 week out

Full structured path. Everything in order, nothing skipped.

1 AI PM Role + Narrative

2 Agile & ADO

3 AI / ML + Agents

4 Managing AI + Metrics

5 Mock Q&A Bank

~4–5 hours total

Just brushing up

Spot-check your vocabulary. Sharpen specific answers.

1 Glossary

2 Mock Q&A Bank

3 Questions to Ask

~20–30 min

Suggested learning path

AI PM
Role

Start here

Agile &
ADO

Frameworks

AI / ML
& Agents

AI Knowledge

Managing
AI Products

Practice

Mock
Q&A Bank

Finish here

Full path: ~4–5 hours Crash path: ~30–45 min Sections marked deep = longer deep dives

How to use this guide

Say answers out loud

Reading Q&As silently isn't preparation. Speak each answer until it flows naturally — that's the moment you're ready.

Do the Glossary before the interview

Read every term once. Interviewers notice candidates who use precise vocabulary — it signals depth, not just prep.

Personalise your narrative

Use "Crafting Your Narrative" to put your own story into every model answer. Generic answers are forgettable. Specific ones land.

Ask AI to explain any term

Highlight any word on the page and click Ask AI. The coach on the right explains it in PM context — no tab-switching needed.

What this guide covers

Agile & Scrum deep

4 values, 3 roles, 5 ceremonies, DoD, velocity, sprint planning — with interview angles at every step

ADO & User Stories

Epic → Feature → Story → Task/Bug hierarchy. INVEST criteria. Given/When/Then AC with all 4 scenario types

AI / ML / LLM deep

AI vs ML vs GenAI, transformers, RAG, embeddings, fine-tuning, prompt engineering — explained for PMs

Agents & MCP

What AI agents are, multi-agent orchestration, Model Context Protocol — the frontier of AI product design

Managing AI Products

UAT for AI features, monitoring drift, human-in-the-loop design, confidence thresholds, model specs in stories

40+ Mock Q&As

High-probability interview questions with model answers — covering strategy, execution, AI craft, and stakeholder management

Getting Started

The AI PM Role

What separates an AI Product Manager from a traditional PM — and what interviewers are really evaluating.

PM vs AI PM — the key differences

Traditional PM

Translates user needs into features. Manages backlog, roadmap, and stakeholders. Measures success with engagement, adoption, and revenue metrics. Works primarily with engineering and design.

AI Product Manager

All of the above — plus: defines model quality requirements, understands data pipelines, writes AI-specific acceptance criteria, manages confidence thresholds and human-in-the-loop decisions, monitors for model drift, and balances automation with human oversight.

What interviewers are evaluating

1. Can you think in outcomes, not features?

AI PMs who succeed define the problem precisely before reaching for AI as the solution. Interviewers test this by asking "why" repeatedly — why this feature, why this model, why this metric.

2. Do you understand AI's limitations?

Confident AI PMs acknowledge uncertainty. They know AI models fail, drift, and hallucinate. They design products with these limitations in mind — not in spite of them.

3. Can you bridge technical and business?

The core AI PM superpower: translating "our model has 87% precision" into "3 in 10 customers will get a wrong recommendation" — and then deciding what to do about it.

4. Do you think about humans in the loop?

The best AI PMs never fully remove humans from consequential decisions. They design escalation paths, review queues, and feedback mechanisms as first-class product features.

5. Are you comfortable with ambiguity?

AI development is non-deterministic. Requirements change when the model behaves unexpectedly. Interviewers want to see how you adapt and make decisions without complete information.

The AI PM competency stack

Foundation (must have): Product thinking, stakeholder management, Agile delivery, user story writing, data literacy

AI layer (differentiator): Understanding of ML lifecycle, prompt engineering, AI metrics (precision/recall/F1), model evaluation, responsible AI principles

Advanced (stand-out): Agentic AI design, multi-model orchestration, RAG architecture, AI governance frameworks, MLOps awareness

Step 1 of 5 · AI PM Role

Getting Started

Crafting Your Narrative

How to build a compelling "About Me" for an AI PM interview — structured, confident, and specific to your background.

The 3-part structure

Part 1 — Foundation

Where did you start? What core skill did you build? Frame your early career as the foundation for what you do now. Example framing: "I spent [X years] in [domain], where I learned how to [core skill]. That shaped how I approach product problems today."

Part 2 — AI progression

How did you get into AI product work? What have you built or shipped? Anchor with 2–3 specific, quantified achievements. Frame your AI experience as intentional progression, not accidental. Use numbers wherever possible.

Part 3 — Why this role

What specifically draws you to this company and role? Reference something real — a product they've built, a strategy they've announced, a problem in the industry you care about. Generic enthusiasm is forgettable. Specific insight is memorable.

Tips for delivery

Keep it to 2.5–3 minutes

Leave space for follow-up questions. If they want more they'll ask. If you fill all available time, you've lost the dialogue.

Pause between parts

A brief pause after each section signals confidence and lets the interviewer absorb before you continue.

Lead with impact, not chronology

Don't just describe a timeline. Lead with what you achieved and why it matters — then give context for how you got there.

End with energy toward them

Your closing should point forward — toward this company, this product, this problem. Not backward at your resume.

Common mistake: Spending too long on early career and rushing through recent AI work. Interviewers care most about what you've done in the last 2–3 years. Front-load your AI-relevant experience.

Core Framework

Agile & Scrum

The operational backbone of AI product delivery. Know this inside out — interviewers test both theory and practical application.

Core distinction: Agile is the philosophy. Scrum is the framework. ADO/Jira is the tool. Don't conflate them.

The 4 Agile Values (Manifesto, 2001)

1. Individuals and interactions — over processes and tools

People and communication beat rigid workflows. A quick conversation resolves what a 3-page document cannot. A great PM builds relationships that make this possible.

2. Working software — over comprehensive documentation

Shipping something usable beats exhaustive specs. Documentation serves delivery — it doesn't replace it. Write docs that help people build, not docs that prove you thought about building.

3. Customer collaboration — over contract negotiation

Ongoing engagement with users beats fixed-scope contracts. Requirements evolve — and that's expected. The best PMs stay in continuous discovery even after a product ships.

4. Responding to change — over following a plan

A plan is a hypothesis. Reality will differ. The team that adapts fastest wins. Agile doesn't mean no planning — it means building the ability to change the plan when needed.

Agile vs Waterfall

Waterfall

Linear: Requirements → Design → Build → Test → Deploy. Phases are sequential. No working software until the end. Change is expensive — every phase depends on the prior one. Works for construction; rarely works for software.

Agile

Iterative and incremental. Plan, build, test, release in short cycles. Each cycle = working software. Change is expected. Feedback is continuous. Essential for AI products — model accuracy, edge cases, and data quality can't be fully specced upfront.

Product Owner

Owns the product backlog. Defines and prioritises what gets built. Represents the customer and business. Accountable for maximising product value.

As an AI PM, you typically fill this role.

Scrum Master

Facilitates Scrum ceremonies. Removes team blockers. Coaches the team on Agile practices. Not a project manager — serves the team, doesn't direct it.

Development Team

Cross-functional, self-organising group of 3–9 people (developers, QA, data scientists, designers). Commits to sprint goals and owns the "how."

Interview framing: "As PM I act as Product Owner in Scrum — I prioritise the backlog, define acceptance criteria, attend all ceremonies, and am the single accountability point for product decisions in the sprint."

Product Backlog

The ordered list of everything the product might need — features, fixes, improvements, tech debt, research spikes. The PO owns and prioritises it. Never "complete" — evolves as the product and market evolve. Top items are refined and ready; bottom items are rough ideas.

Sprint Backlog

The subset of product backlog items selected for the current sprint, plus the team's plan to deliver them. The team owns this. The PO should not change it mid-sprint — doing so is a Scrum anti-pattern that destroys team trust.

Increment

The sum of all completed work in a sprint — plus all previous increments. Must meet the team's Definition of Done. Should be potentially shippable every sprint, even if the PO decides not to release it publicly yet.

Critical distinction: Acceptance Criteria (AC) is story-specific — it defines "done" for one story. Definition of Done (DoD) is team-wide — it applies to every story and includes code review, testing, documentation, deployment to staging, etc.

Sprint Planning

When: Start of sprint · ~2 hrs per sprint week

Part 1 — What: PO presents refined backlog items. Team asks clarifying questions. Team selects what to commit to.
Part 2 — How: Team breaks stories into tasks, estimates effort, flags dependencies.
Output: Sprint Goal + committed Sprint Backlog.

PM responsibility: Arrive with refined, AC-complete stories. Never bring an unrefined story to planning — it derails the entire ceremony.

Daily Standup

When: Every day · 15-minute timebox

Three questions per team member:
1. What did I complete yesterday?
2. What will I work on today?
3. What is blocking me?

Not a status report to management. The team synchronising with itself. Blockers get resolved offline — not solved in the standup.

PM responsibility: Listen for blockers needing your action — missing requirements, pending decisions, cross-team dependencies.

Sprint Review

When: End of sprint · ~1 hr per sprint week

Team demos completed work to stakeholders. Working software only — no slides about what "will be" built. Stakeholders give feedback. PO updates backlog based on what they learn.

PM responsibility: Facilitate the demo, articulate business value of what was delivered, gather structured feedback, translate feedback into backlog updates.

Sprint Retrospective

When: After Sprint Review · ~45 min per sprint week

Team inspects its own process — not the product. Three questions:
1. What went well?
2. What didn't go well?
3. What will we improve next sprint?

Output: 1–3 concrete, actionable improvements.
PM responsibility: Participate fully. Own retro actions that involve requirements quality, documentation, or stakeholder communication.

Backlog Refinement / Grooming

When: Mid-sprint · 1–2 hrs/week (not an official Scrum event but universally practiced)

PO and team review upcoming backlog items — clarify requirements, split large stories, estimate effort, identify dependencies. Goal: the top of the backlog is always sprint-ready.

PM responsibility: Own this session. Come prepared with written stories, AC drafted, mockups or data samples ready. This is where your PM craft actually happens.

Definition of Done (DoD)

A shared, team-wide checklist that defines when a story is truly complete. Typical DoD items: code written and reviewed, unit tests passing, integration tested, documentation updated, acceptance criteria verified, deployed to staging environment, PO sign-off received.

For AI features, add to the DoD: model evaluated on held-out test data, confidence threshold validated, human review queue tested, monitoring/alerting instrumented, data drift baseline recorded.

Story Points & Velocity

Story Points

A relative measure of effort, complexity, and uncertainty. Uses Fibonacci sequence: 1, 2, 3, 5, 8, 13, 21. A 5-point story is roughly twice as complex as a 3-pointer. The team calibrates together — a "1" is the simplest possible story for that specific team.

Velocity

Average story points completed per sprint across the last 3–5 sprints. Used for forecasting only — not a performance metric. If velocity is 40 points and the refined backlog has 200 points, you have ~5 sprints (~10 weeks) of work ahead.

Interview framing: "I don't treat velocity as a measure of team performance. I use it as a forecasting tool to set realistic stakeholder expectations — and as an early signal when something external is blocking the team."

Common question

Describe how you've applied Agile in your work.

Anchor on the PM/PO role and concrete practices. Mention backlog grooming, sprint planning, demos with stakeholders, and retrospective-driven improvement. If you have an AI context, highlight how Agile was essential for iterating on model quality — requirements that can only be fully validated once the model runs on real data.

Common question

What do you do when a story isn't ready for sprint planning?

"I pull it out — full stop. An unrefined story in a sprint leads to mid-sprint clarification storms, scope creep, and missed commitments. My rule: no story enters a sprint without clear acceptance criteria, defined scope, and no unresolved dependencies. That's what refinement sessions are for. If something urgent comes up at the last minute, I either time-box a spike to investigate it, or defer it to the next sprint."

Common question

How do you balance speed with documentation in Agile?

"Documentation serves a reader. I ask: who reads this, when, and what decision does it support? That determines the appropriate level of detail. For a fast-moving PoC, a one-page problem statement and success criteria is enough. For a production AI feature in a regulated industry, I write a full BRD with model specs, data governance notes, and audit trail requirements. Agile doesn't mean less documentation — it means right-sized documentation delivered at the right moment."

Step 2 of 5 · Frameworks

Core Framework

ADO & User Stories

The work item hierarchy that turns strategy into shipped software. Know it cold — and be able to write great stories in the interview itself.

Work item hierarchy

Epic

Large business initiative spanning multiple sprints or quarters. Tied to a strategic objective. Example: "AI-powered customer self-service — reduce support contact volume by 25%"

Feature

Distinct, deliverable capability within an Epic. Can span 2–4 sprints. Has its own business value. Example: "Intent detection — classify customer queries into 12 categories with ≥90% accuracy"

User Story

Small unit of user value. Fits in one sprint. Written from user's perspective. Has Acceptance Criteria. Example: "As a customer, I want my billing query routed to the right team automatically so I don't have to repeat myself."

Task

Specific piece of work within a story. Owned by an individual. Tracked in hours. Examples: design intent detection prompt, build routing API, write unit tests, UAT with 5 users.

Bug

Deviation from expected behavior. Must include: steps to reproduce, expected result, actual result, severity, environment. Example: "Query classified as billing when it is a technical issue — Severity: High — Reproducible 4/5 attempts."

Writing user stories — the INVEST test

I — Independent

Can be delivered without waiting for another unfinished story

N — Negotiable

Details evolve through conversation — not a fixed contract

V — Valuable

Delivers standalone value to the user or business

E — Estimable

Team can size it — if not, needs more refinement

S — Small

Fits within one sprint — if not, split it

T — Testable

Acceptance Criteria can be written to verify it

Acceptance criteria — Given / When / Then

Write 4 scenario types — not just 1

Most PMs only write the happy path. Writing all four scenarios is what separates a junior PM from a senior one.

// Type 1 — Happy path (expected behavior) Given a document is uploaded in a supported format When the AI extraction model processes it Then all required fields are extracted with ≥90% confidence within 3 seconds // Type 2 — Edge case (unusual but valid input) Given the model extracts a field with confidence below the threshold When the result is generated Then the field is flagged for human review — NOT auto-populated // Type 3 — Failure case (graceful degradation) Given the uploaded file is an unsupported format When extraction is attempted Then the system returns a clear error message and does not process the file // Type 4 — Non-functional (performance, security, privacy) Perf pipeline handles 100 concurrent uploads without response time degradation Data no document content is stored beyond the processing window per data governance policy

Story splitting techniques

By workflow step

Split "upload, process, download" into three separate stories — one per step

By data type

Split "extract from documents" into Story 1: PDFs, Story 2: scanned images, Story 3: Word docs

Happy path first

Ship the happy path in Sprint 1. Handle edge cases and errors in Sprint 2.

By user role

Split by persona when different users have different experiences of the same feature

Core Framework

Prioritisation

How to make defensible decisions about what to build next — using frameworks, not instinct.

RICE scoring

Reach × Impact × Confidence ÷ Effort

Reach: How many users affected per sprint/quarter?
Impact: How much does this move the needle? (1=minimal, 2=low, 3=medium, 4=high, 5=massive)
Confidence: How confident are you in your estimates? (0–100%)
Effort: How many person-months to build?

Higher RICE score = higher priority. Compare items on the same scale — absolute numbers matter less than relative ranking.

Kano Model

Basic needs (Must-haves)

Features users expect as table stakes. Absence causes dissatisfaction. Presence doesn't delight — it just avoids upset. Prioritise these first, always.

Performance needs

Features where more is better — linearly. More accuracy, faster speed, lower cost. Score these with RICE to rank relative value.

Delighters

Unexpected features that create excitement. Users didn't know they wanted them. Time-box these to innovation sprints — don't over-invest before validating.

Interview tip: "I use Kano to categorise first, then RICE to rank within categories. Basic needs get prioritised regardless of RICE score — the business can't function without them."

MoSCoW — for release scoping

Must have

Non-negotiable for the release. Without these, the release fails or doesn't ship.

Should have

Important but not critical for this release. Can be deferred to the next sprint with minimal impact.

Could have

Nice to have if time allows. Won't cause significant issues if dropped.

Won't have (this time)

Explicitly out of scope for this release. Not forever — just now. Documenting "Won't have" is as important as documenting "Must have."

OKRs — outcome-driven prioritisation

Objective: Qualitative, inspirational goal — "Become the most trusted AI assistant for enterprise finance teams"
Key Results: 3–5 measurable outcomes — "Reduce time-to-answer by 40%", "Achieve NPS > 50 among power users", "95% of responses pass accuracy audit"

PM discipline: Before adding anything to the sprint backlog, ask: "Which Key Result does this move?" If the answer is none, the feature needs a stronger case to exist.

Common trap: Treating delivery OKRs ("ship 3 features") as outcome OKRs ("reduce user effort by 30%"). Delivery is a means, not the goal.

Core Frameworks

Estimation Methods

How Agile teams estimate work — from sprint-level story points to roadmap-level T-shirt sizing. Interviewers test both your understanding of the techniques and your judgement about when to use each.

Why estimation matters in Agile

Estimation in Agile is not about predicting the future with precision — it is about creating a shared understanding of effort and complexity. Good estimates enable sprint planning, capacity management, roadmap forecasting, and trade-off conversations with stakeholders. The goal is relative accuracy, not absolute precision.

Key rule: Never estimate a story that isn't refined. If the team can't estimate it, that is a signal the story needs more refinement — not more pressure to guess.

Story Points & the Fibonacci series

Why Fibonacci and not 1, 2, 3, 4, 5...?

The Fibonacci sequence (1, 2, 3, 5, 8, 13, 21...) grows non-linearly — gaps between numbers get bigger as they increase. This reflects how estimation uncertainty works: the difference between a 1-point and 2-point story is meaningful, but the difference between a 20 and 21-point story is not. Large stories are inherently uncertain — Fibonacci forces the team to acknowledge that with wider gaps at the top.

1 — 2

Trivial / Very small
Well-understood, minimal complexity, few unknowns.

3 — 5

Small / Medium
Clear requirements, some complexity. Standard feature work.

8 — 13

Large / Very large
Significant complexity. 13+ = must split before entering a sprint.

What story points actually measure

Effort — how much work is required?
Complexity — how difficult is the problem to solve?
Uncertainty — how much don't we know yet?

Story points are relative and team-specific. You cannot compare points across different teams.

Common mistake: Converting story points to hours. Points include complexity and uncertainty which don't convert to time linearly. Use velocity for sprint capacity forecasting only — never as a performance target.

Planning Poker — the estimation ceremony

Step 1 — PO reads the story

Product Owner reads the story aloud, answers clarifying questions. Only proceed when questions are resolved.

Step 2 — Everyone estimates privately

Each team member selects a Fibonacci card silently. No one reveals yet — prevents anchoring bias where the first speaker pulls everyone else's estimate.

Step 3 — Simultaneous reveal

All cards revealed at once. Close estimates (3, 3, 5) → take consensus and move on. Spend no more than 5 minutes per story.

Step 4 — Discuss divergence

Wide divergence (2 and 13) means different assumptions. The lowest and highest estimators explain their reasoning — this surfaces hidden dependencies, technical risks, or missing requirements.

The "?" card

Means "I don't have enough information to estimate." The story goes back to refinement. Never force an estimate on a story that isn't understood.

PM insight: Wide divergence in Planning Poker is not a problem — it is a feature. It surfaces hidden assumptions. A story that gets a 2 and a 13 simultaneously needs more refinement, not more voting.

T-shirt sizing — for Features and Epics

T-shirt sizing is a high-level estimation technique for Features, Epics, and roadmap items — before they are broken into User Stories. It trades precision for speed. Instead of debating 13 vs 21 points, you simply ask: "Is this a small, medium, or large piece of work?" Set a reference Medium first — everything else is sized relative to it.

XS — Extra Small

1–2 days. Tiny scope. Can be done as one story in a single sprint.

S — Small

3–5 days. Clear scope, low complexity. One sprint, 2–3 stories.

M — Medium (the anchor)

1–2 weeks. The reference size — everything is sized relative to this.

L — Large

2–4 weeks. Multiple sprints. Needs breakdown before entering a sprint.

XL — Extra Large

1–2 months. Complex, multi-team. Needs a spike before estimation.

XXL — Epic-scale

Quarter or more. This is an Epic — must be decomposed into Features first.

How they connect: A Feature sized "M" in T-shirt sizing typically breaks into 3–5 User Stories totalling 15–25 story points. T-shirt sizing for roadmap planning; story points for sprint planning.

Other estimation techniques

Three-point estimation (for high-risk tasks)

Optimistic (O) + 4 × Most Likely (M) + Pessimistic (P) ÷ 6
Used for tasks with significant uncertainty — integrations, new technology, poorly understood domains. Gives a weighted average that accounts for worst-case scenarios.

Affinity / Bucket sizing (for large backlogs)

Write each story on a card. Team silently sorts into Fibonacci buckets. Discuss only items people placed differently. Can estimate 50+ stories in under an hour — great for early-stage backlog sizing.

Interview Q&A

Common question

How do you run a story point estimation session?

"I use Planning Poker. Stories must be refined with clear acceptance criteria before we estimate — we never estimate vague stories. During the session, I read the story, take questions, then everyone reveals their Fibonacci card simultaneously. Close estimates (3, 3, 5) → consensus and move on. Wide divergence (2 and 13) → the outliers explain their reasoning. That conversation almost always surfaces a hidden dependency or missing requirement. The '?' card means the story goes back to refinement. Five minutes maximum per story."

Common question

What is the difference between story points and T-shirt sizing?

"T-shirt sizing is for high-level, early-stage estimation of Features and Epics on the roadmap — fast and directional. Story points are for sprint-ready User Stories during Planning Poker — relative, Fibonacci-based, team-specific. They operate at different levels. I use T-shirt sizes to plan the quarter and story points to manage sprint capacity. A feature sized 'Medium' in T-shirt terms typically breaks down into 15–25 story points once refined into User Stories."

Core Frameworks

Product & Team KPIs

Two categories: product KPIs tell you whether you're building the right thing — team KPIs tell you whether you're building it well. Know both, and know the difference between them.

Interview framing: "I track two categories: product KPIs that measure whether we're delivering value to users, and delivery KPIs that measure whether our team process is healthy. Product KPIs are the goal. Delivery KPIs are the engine."

Product KPIs — are we building the right thing?

Adoption Rate

Formula: Active users ÷ eligible users
% of eligible users actively using the feature. High accuracy but low adoption = UX or trust problem, not a model problem. Critical for AI features where users often resist AI assistance even when it's accurate.

Retention Rate

Formula: Users retained ÷ users at start of period
% of users who continue using the product over time. Rising retention = ongoing value delivered. Dropping retention = solved once but not sticky.

Net Promoter Score (NPS)

Formula: % Promoters − % Detractors (scale: −100 to +100)
Would users recommend the product? A leading indicator of churn and growth. Scores above 50 are excellent for enterprise software. Detractors (0–6) are most important to understand.

Customer Satisfaction Score (CSAT)

Formula: Satisfied responses ÷ total responses (1–5 stars)
How well the product met expectations in a specific interaction. More granular and moment-specific than NPS. Used heavily in CX and contact centre products.

Churn Rate

Formula: Users lost ÷ total users per period
% who stopped using the product. A lagging indicator — rising churn means something went wrong weeks or months earlier. Monitor NPS and session frequency as leading indicators to catch churn before it appears.

Time to Value (TTV)

Formula: Time from onboarding → first meaningful outcome
How long until a new user or client experiences core product value? Shorter TTV = lower churn risk. For enterprise AI products, TTV includes data ingestion, model calibration, and user training.

Revenue Metrics

ARR / MRR: Annual / Monthly Recurring Revenue — the business engine and growth tracker
ARPU: Average Revenue Per User — how much each user contributes
LTV: Lifetime Value — total revenue expected over the customer relationship. Compare to CAC (Customer Acquisition Cost) to assess unit economics.

Team / Delivery KPIs — are we building it well?

Velocity

Formula: Average story points completed per sprint (last 3–5 sprints)
How much work the team can reliably complete per sprint. Use for forecasting only — never as a performance target. Pressuring a team to increase velocity leads to inflated estimates, not more output.

Sprint Burndown

Remaining work plotted daily vs. an ideal completion line. Actual line above ideal = behind schedule. Flat lines = blocked — needs immediate attention. A consistently flat burndown mid-sprint is a blocker signal, not a team performance issue.

Sprint Goal Completion Rate

Formula: Sprints where sprint goal was met ÷ total sprints
More meaningful than velocity. A team that meets its sprint goal with 80% of planned points is healthier than one hitting 100% of points but missing the goal every sprint. The goal is the point — points are the means.

Cycle Time

Formula: Time from work started → work done (per story)
How long a story flows through the system once work begins. Long cycle times signal blockers, context switching, or stories that are too large. Compare cycle time to story point size — large cycle time on small stories = hidden friction.

Lead Time

Formula: Time from requirement created → work done
Total end-to-end responsiveness including backlog wait time, refinement, planning, and development. Lead time > cycle time. The gap between them is backlog wait time — often the biggest opportunity to reduce.

Defect Rate / Bug Escape Rate

Formula: Bugs found in production ÷ stories shipped
A rising defect rate after a velocity increase = team is moving fast but cutting quality corners. For AI products, also track model error rate — % of AI outputs that are incorrect or flagged by human reviewers.

Interview Q&A

Common — any interviewer

How do you measure whether your product is successful?

"I separate product KPIs from delivery KPIs. Product KPIs — adoption, retention, CSAT, NPS — tell me whether we're building something users value. Delivery KPIs — velocity, sprint goal completion, cycle time, defect rate — tell me whether the team is working effectively. I define product KPIs at requirements stage, before we build, so everyone agrees on what success looks like. Post-launch I track them weekly and set threshold alerts so I'm not discovering problems in a quarterly review."

Common

What is the difference between velocity and sprint goal completion rate?

"Velocity measures output — story points completed per sprint. Sprint goal completion rate measures outcome — did the team achieve what it set out to achieve? A team can hit 100% of planned points but miss the sprint goal if they over-indexed on low-priority stories. I care more about sprint goal completion because it measures whether we're moving in the right direction, not just moving fast. Velocity is a forecasting tool; sprint goal completion is a health indicator."

Core Frameworks

Documentation Types

Every document serves a specific audience and decision. Before writing any document, ask: "Who reads this, when, and what decision does it enable?" That determines the right format and level of detail.

Key principle: Agile does not mean no documentation. It means right-sized documentation delivered to the right person at the right time. Write documents that people actually use — not documents that prove you thought about building.

Strategic & discovery documents

BRD — Business Requirements Document

Owner: BA / PM · Audience: Business stakeholders, project sponsors
Purpose: Captures the business need and context — why the project exists, what problem it solves, who the stakeholders are. More business-facing than a PRD. Written before product design begins.

Contains: Business objectives · Stakeholder analysis · Current state pain with evidence · Desired future state · Business constraints · ROI / benefit case · High-level scope · Success criteria from a business perspective

PRD — Product Requirements Document

Owner: PM · Audience: Product, engineering, design, data science
Purpose: The PM's primary artifact. Defines what to build and why — the user problem, proposed solution, success metrics, constraints. Does not specify how to build it — that belongs to engineering.

Contains: Problem statement · User personas · Goals & non-goals · Feature requirements · Success metrics · Dependencies · Open questions · Out of scope · For AI features: Model requirements section with confidence thresholds, input/output schema, fallback design, monitoring requirements

Blueprint — Solution Blueprint

Owner: PM / Solution Architect · Audience: Cross-functional team, technical leadership
Purpose: Bridges business requirements and technical design. Shows how the solution will be structured — key components, data flows, integrations, architecture — without being a full technical spec. Used in enterprise AI engagements to align all parties before development begins.

Contains: Solution overview · Architecture diagram · Key components · Integration map · Data flows · Technology choices · Assumptions & constraints · Risk areas

Delivery & execution documents

User Story (+ Acceptance Criteria)

Owner: PM / PO · Audience: Development team, QA
Purpose: The atomic unit of delivery in Agile. "As a [persona], I want [action], so that [outcome]." Must fit in one sprint with testable AC in Given/When/Then format covering all four scenario types.

Contains: Persona · User need · Business outcome · AC: happy path, edge case, failure case, non-functional · Story points · Dependencies · DoD reference · For AI: model quality AC and monitoring AC

FSD — Functional Specification Document

Owner: BA / PM · Audience: Engineering, QA, UAT team
Purpose: Describes how the system should behave from a functional perspective without specifying technical implementation. Sits between the BRD and the SDD. Used in regulated industries and large enterprise projects requiring a full behavioural specification before development.

Contains: Functional requirements per use case · System behaviour · UI/UX requirements · Data inputs & outputs · Error handling · Validation rules · Business rules · Reporting requirements

SDD — System Design Document

Owner: Engineering / Architect · Audience: Development team, DevOps, security
Purpose: The engineering team's artifact — describes how the system will be built technically. PMs review but do not author this. Your role is to verify it accurately reflects the functional requirements you specified.

Contains: Technical architecture · Database schema · API contracts · Infrastructure design · Security design · Performance considerations · Deployment approach

PDD — Process Design Document

Owner: BA / Operations / PM · Audience: Operations, change management, training
Purpose: Documents the business process — current state (as-is) and future state (to-be). Identifies where automation or AI can intervene. Critical for enterprise AI products where workflows must be mapped before automation can be designed.

Contains: Current-state process map · Pain points & bottlenecks · Future-state process map · Role & responsibility changes · AI automation opportunities · Exception handling · Change impact assessment

AI-specific documents

Model Spec — AI Model Specification

Owner: PM + Data Scientist (co-authored) · Audience: Data science, engineering, QA
Purpose: Defines the AI model's requirements — inputs, outputs, quality thresholds, evaluation criteria, fallback behaviour. PM writes the business requirements; data scientist fills in the technical design. Must be created before model development begins.

Contains: Input data schema · Output format · Confidence threshold · Precision/recall targets · Evaluation dataset spec · Fallback design · Human review routing logic · Retraining trigger criteria · Drift monitoring requirements

SOP — Standard Operating Procedure

Owner: Operations / PM · Audience: Operations team, human reviewers, quality managers
Purpose: Step-by-step instructions for the human review process in AI products — what reviewers do when the model routes a case to the review queue, how they make decisions, and how they record corrections. Without an SOP, the human-in-the-loop design breaks down in practice.

Contains: Step-by-step review process · Decision criteria for AI outputs · Escalation paths · How to record corrections (feeds retraining) · Quality checklist · SLA for review turnaround

The document hierarchy — how they connect

Strategy: BRD → defines the business problem and scope
Product: PRD + Blueprint → defines what to build and how it fits together
Process: PDD → maps the workflows the product will change
Delivery: FSD → functional behaviour · User Stories → sprint delivery units
Technical: SDD → how engineering builds it (PM reviews, doesn't author)
AI-specific: Model Spec → AI quality requirements (PM co-authors)
Operations: SOP → how humans operate and govern the AI system in production

As an AI PM you will: Author BRD, PRD, User Stories, PDD. Co-author Model Spec. Review FSD and SDD. Ensure SOP exists before any AI feature goes to production — without it, the human review queue has no process and the HITL loop breaks down silently.

Interview Q&A

Very common — BA-background interviewers

What is the difference between a BRD and a PRD?

"A BRD is business-facing — it captures the business problem, stakeholders, constraints, and the case for why the project exists. It answers 'why are we doing this?' A PRD is product-facing — it captures the proposed solution, user personas, feature requirements, and success metrics. It answers 'what are we building and for whom?' The BRD comes first and informs the PRD. For AI products I extend the PRD with a Model Spec section covering confidence thresholds, input schema, fallback design, and monitoring requirements — because these are product decisions that must be defined before model development begins."

Common

Walk me through the documentation you produce on an AI feature.

"My documentation follows the delivery lifecycle. In discovery I produce a BRD — problem statement, stakeholder analysis, current state pain, desired outcome. In solution design I write the PRD with a Model Spec section for AI requirements, and a PDD to map current and future-state workflows. In delivery I write User Stories with Given/When/Then AC across all four scenario types. Before production I ensure an SOP exists for the human review queue — without it, reviewers have no process and the HITL loop breaks down. I review the SDD but don't author it — I verify it accurately reflects what I specified in the functional requirements."

AI Knowledge

AI / ML / LLM Fundamentals

What every AI PM must understand — explained for product thinkers, not data scientists.

Artificial Intelligence (AI)

The broad field of building systems that simulate human intelligence — reasoning, learning, perception, and decision-making. Everything below is a subset of AI.

Machine Learning (ML)

Systems that learn patterns from data rather than following hand-coded rules.
Supervised learning: learns from labeled data (classification, regression)
Unsupervised learning: finds hidden structure without labels (clustering, anomaly detection)
Reinforcement learning: learns by maximising reward (pricing, game AI, routing optimisation)

Deep Learning

A subset of ML using neural networks with many layers. Powers modern AI — image recognition, speech, natural language understanding. Requires large datasets and compute.

Generative AI (GenAI)

AI that generates new content — text, images, code, audio, video — rather than just classifying or predicting. Powered by foundation models trained on enormous datasets. Examples: GPT-4, Claude, Gemini, Stable Diffusion.

NLP — Natural Language Processing

AI that understands and generates human language. Underpins chatbots, document extraction, sentiment analysis, intent classification, translation — core capabilities of most enterprise AI products.

Artificial Intelligence (AI)

The broadest field — building systems that simulate intelligent behaviour: reasoning, perception, decision-making, planning. Everything below is a subset.

Machine Learning (ML)

Systems that learn patterns from data rather than following hand-coded rules. Three main paradigms: supervised, unsupervised, reinforcement learning.

Deep Learning (DL)

Multi-layer neural networks that learn representations automatically. Powers image recognition, speech, and natural language. Requires large datasets and significant compute.

DL ∩ NLP

LLM

Large Language
Models

Conv. AI

Conversational
AI

Natural Language Processing (NLP)

AI for understanding and generating human language. Underpins chatbots, translation, document extraction, sentiment analysis — core to most enterprise AI products.

Also under AI → Expert Systems Computer Vision Robotics & Planning Knowledge Representation

AI ⊃ ML ⊃ Deep Learning · NLP overlaps ML · LLMs and Conversational AI live at the intersection of Deep Learning and NLP

What is an LLM?

A Large Language Model is trained on vast text using transformer architecture. It learns to predict the next token in a sequence. Doing this well, at massive scale, produces models capable of writing, reasoning, coding, and more.

Key concepts for PMs

Tokenisation: Text is split into tokens (roughly words/subwords) before processing
Context window: How much text the model can "see" at once — larger = better for long documents
Temperature: Controls randomness — 0 = deterministic, 1+ = creative/unpredictable
Fine-tuning: Adapting a pre-trained model for a specific domain with labeled examples
Embeddings: Numerical vector representations of text capturing semantic meaning — powers search and RAG

The attention mechanism (simplified)

The core innovation in transformers. The model learns which parts of the input to "attend to" when generating each output token. This is why LLMs understand context across long documents — they're not just looking at the words immediately before; they're weighing the entire context.

How to explain LLMs in interviews: "An LLM learns statistical patterns across billions of text examples. When you prompt it, you're activating those learned patterns to generate a contextually relevant response. The quality depends on training data quality, model size, and how well you've framed the prompt."

How an LLM processes your prompt: tokenisation converts text to IDs, embeddings map them to vector space, attention layers compute relationships, then output tokens are decoded back to text.

RAG — Retrieval-Augmented Generation

A pattern that combines an LLM with real-time knowledge retrieval. Instead of relying on the model's training data alone, RAG retrieves relevant documents from a vector database and feeds them as context to the LLM at query time.

Why it matters for PMs: RAG is how enterprise AI products stay accurate and current. It's the architecture behind intelligent search, knowledge assistants, and real-time agent guidance — key capabilities in most AI PM roles.

Prompt Engineering

Designing inputs to LLMs to reliably produce desired outputs. Key techniques:
Zero-shot: Just the instruction, no examples
Few-shot: Instruction + 2–5 labeled examples
Chain-of-thought: Ask the model to reason step-by-step before answering
System prompt: Instructions that define the model's role, constraints, and persona

PM responsibility: Define the prompt architecture and evaluation criteria. You don't write every prompt — but you own what "good" looks like.

Model Drift

When a model's performance degrades over time because real-world data patterns shift away from the training distribution. PMs must instrument monitoring from Day 1 — define drift alerts as part of the production Definition of Done, not as a post-launch afterthought.

Hallucination

When an LLM confidently produces factually incorrect or fabricated information. A fundamental limitation of current LLMs. PMs must design products with hallucination in mind — through RAG (grounding in real data), human review gates for high-stakes outputs, and user-facing confidence indicators.

RAG augments the LLM with retrieved context: the query is embedded, similar docs are fetched from a vector DB, and both are passed to the LLM — giving grounded, up-to-date answers.

1. Problem Definition

Is ML the right solution? Define the prediction task precisely. What are you predicting? What's the input data? What's the output? What does "correct" mean?

2. Data Collection & Labeling

Gather training data. Label it (for supervised learning). Data quality here determines model quality ceiling — garbage in, garbage out. PMs often under-invest here and over-invest in model selection.

3. Model Training & Evaluation

Train model on labeled data. Evaluate on held-out test set. Measure precision, recall, F1. Iterate on features, architecture, hyperparameters. PMs define the evaluation criteria and the acceptable performance thresholds.

4. Deployment & Monitoring

Deploy to production with canary/shadow rollout. Instrument monitoring for accuracy drift, latency, error rates. Set alerts. Plan the retraining cadence. PMs own the production success criteria.

5. Feedback Loop & Retraining

Human corrections, user feedback, and production data feed back into retraining. This is where AI products compound in value — the product gets smarter as it's used. Design for this loop from Day 1.

Common question

How do you decide when to use AI vs a simpler rule-based solution?

"I ask three questions: Is the pattern too complex for rules? Do we have enough data to train reliably? Is the cost of errors acceptable given the use case? If rules can solve 90% of cases cleanly and the remaining 10% are low-stakes, rules are probably better. AI shines when the pattern space is genuinely complex, when volume is high enough for statistical learning, and when we have a feedback loop to improve over time."

Common question

A data scientist says the model is ready. You disagree. What do you do?

"I'd first make sure we're measuring the same thing against the same success criteria — which should have been defined at requirements stage. If the model meets the technical metrics but I'm concerned about real-world behavior on edge cases or corner cases the test set doesn't cover, I'd push for an expanded UAT on a broader sample. The technical metric and the business outcome aren't always the same thing — my job is to make sure we're optimising for the right one."

Common question

How do you explain AI capabilities and limitations to a non-technical stakeholder?

"I use analogies grounded in their domain. Instead of 'the model has 87% precision', I say 'for every 100 recommendations it makes, roughly 13 will be off the mark — here's how we've designed the product to catch and correct those before they reach the customer.' I always pair a limitation with the mitigation design, so stakeholders understand we've planned for the failure mode."

Step 3 of 5 · AI Knowledge

AI Knowledge

Agents & MCP

The frontier of AI product design. AI agents are moving from demos to enterprise deployment — AI PMs need to understand how to design, scope, and govern them.

What is an AI Agent?

An AI agent is a system that perceives its environment, reasons about a goal, takes actions (uses tools, calls APIs, searches the web, writes and runs code), evaluates the result, and iterates until the goal is achieved — without requiring a human to direct every step.

The key difference from a chatbot: a chatbot responds once. An agent loops — observe → think → act → observe → think → act — until done or until it reaches a defined stopping condition.

Agent anatomy

Brain (LLM): Reasoning and planning
Tools: APIs, search, code execution, databases
Memory: Short-term (context window) and long-term (vector store)
Orchestrator: Manages the reasoning loop and tool calls

Chatbot vs Agent

A chatbot tells you your flight is delayed.
An agent checks alternatives, cross-references your calendar, books the best option, and notifies your hotel — then reports back with what it did.

Multi-Agent Systems

Multiple specialised agents collaborating — each with a defined role. An orchestrator delegates tasks to worker agents, collects results, and assembles the final output.

Example — customer service resolution:
Agent 1 (Classifier) → detects issue type and urgency
Agent 2 (Retriever) → fetches account data and relevant KB articles
Agent 3 (Drafter) → generates response recommendation
Agent 4 (QA Checker) → validates for accuracy and compliance
Agent 5 (Router) → decides: send to human or auto-resolve?

Key PM consideration: In multi-agent systems you must design the handoff protocol — what data passes between agents, what constitutes a failure at each stage, and when to escalate to human review. This is a product design decision, not an engineering implementation detail.

MCP — Model Context Protocol

What is MCP?

An open standard that allows AI models to securely connect to external tools, databases, and services through a standardised interface. Think of it as USB for AI integrations.

Before MCP, every AI tool integration required custom API code — build once for Salesforce, rebuild for Jira, rebuild again for Gmail. With MCP, any compatible tool exposes a standardised interface — one integration protocol, many tools. Agents become composable.

MCP Server

The service that exposes a tool's capabilities through the MCP protocol. A Salesforce MCP server exposes "read contact", "create opportunity", "update deal stage" — all through a standard interface the agent can discover and call.

MCP Client (the agent)

The AI system that connects to MCP servers to discover available tools and call them. It doesn't need to know the implementation details of each tool — just what the tool can do and what parameters it accepts.

Designing agentic AI — PM considerations

Define the autonomy boundary

What can the agent do without human approval? What requires confirmation? Where must a human always be in the loop? This is a product requirement — define it explicitly before engineering starts.

Design graceful failure

Agents can loop indefinitely, hallucinate tool calls, or take unintended actions. Define: what does a graceful failure look like? What's the rollback path? What gets logged when the agent fails? These are DoD items.

Auditability by design

Enterprise clients need a complete trace of every agent action — what decision was made, by which model, based on what data, at what confidence level. Design for auditability from Day 1, especially in regulated industries.

Cost and latency guardrails

Agents make multiple LLM calls per task. Without guardrails, a single user request can trigger hundreds of API calls. PMs must define maximum step counts, timeout thresholds, and cost-per-task budgets as product requirements.

AI Knowledge

Responsible AI

AI PMs are the last line of defence before a harmful or biased AI product reaches users. This isn't a compliance checkbox — it's a core product design discipline.

The six pillars of responsible AI

Fairness & Bias mitigation

AI models can learn and amplify biases present in training data. A hiring model trained on historical data may disadvantage certain groups. A lending model may discriminate by proxy. PMs must define fairness metrics, run bias audits across demographic slices, and treat disparate impact as a bug — not an acceptable outcome.

Explainability (XAI)

Can we explain why the model made this decision? Critical in regulated industries. "The model said so" is not acceptable when the outcome is a loan denial, a content removal, or a hiring decision. Design explainability into the product — audit logs, confidence breakdowns, contributing factors shown to users or reviewers.

Human oversight

High-stakes decisions should have human review in the loop. Not just as a safety net — human corrections are training signals that improve the model over time. Design review queues and escalation paths as first-class product features, not afterthoughts.

Data governance & Privacy

Users have rights over their data. PMs must understand GDPR, CCPA, and sector-specific regulations. Data minimisation (only collect what you need), retention limits, consent management, and the right to deletion must be in the product requirements — not just the legal documents.

Security & Adversarial robustness

AI systems can be attacked — through prompt injection (manipulating LLM behaviour via malicious inputs), data poisoning (corrupting training data), and model extraction (stealing model weights). PMs should include adversarial testing in the product's security requirements.

Transparency

Users interacting with AI should know they're interacting with AI. Outputs should communicate uncertainty. Limitations should be disclosed. Trust is built through honesty about what the system can and cannot do.

Responsible AI in requirements

As a PM, responsible AI isn't a separate workstream — it's embedded in every story. Add to your story template: a bias evaluation criterion, an explainability requirement, a data retention note, and a human review gate definition. These are DoD items.

Common question

How do you ensure responsible AI practices in your products?

"I treat responsible AI as a product requirement, not a compliance review. At requirements stage I ask: could this model produce disparate outcomes for different user groups? How do we explain a model decision to a user who disputes it? What data are we collecting and for how long? I write these as explicit acceptance criteria — fairness thresholds, explainability hooks, data retention limits. They're in the Definition of Done alongside performance and accuracy criteria."

AI PM Practice

Managing AI Products

The practical craft of being an AI PM — from writing model specs to running UAT to monitoring production.

AI-specific additions to user stories

The 3-layer AI user story

Layer 1 — Functional spec: Standard user story format — As a [persona], I want [action], so that [outcome]

Layer 2 — Model spec: Input data schema, expected output format, confidence threshold, fallback behavior, evaluation dataset, precision/recall targets

Layer 3 — Prompt spec (for LLM features): Instruction architecture, few-shot examples, output format constraints, guardrails against harmful outputs

Most PMs write only Layer 1. Senior AI PMs own all three.

UAT for AI features — two phases

Phase 1 — Model validation

Does the AI output meet the defined quality metrics on real-world test data?

Run the model on a held-out evaluation dataset representative of production. Measure precision, recall, F1. Validate confidence threshold distribution. Check for bias across demographic slices. Pass/fail against predefined thresholds.

Phase 2 — Workflow validation

Does the end-to-end product flow work as intended?

Test the full workflow including human review queues, escalation paths, error states, and UI. Involve operational SMEs — they know the edge cases. Run adversarial tests (inputs the model should reject). Define explicit exit criteria before UAT begins.

Production monitoring checklist

Accuracy monitoring

Track prediction accuracy against ground truth labels (from human corrections, downstream outcomes, or sampling). Alert when accuracy drops below threshold.

Data drift detection

Track statistical distribution of input data over time. Alert when it diverges significantly from training distribution — this predicts model performance degradation before it's visible in accuracy metrics.

Confidence distribution monitoring

Track the distribution of confidence scores. If the average confidence drops, the model is encountering unfamiliar inputs. If it rises too high, it may be overfitting.

Human review queue health

Track volume, resolution time, and correction rate in the human review queue. Rising correction rate = model degrading. Rising volume = automation rate falling.

Step 4 of 5 · Practice

AI PM Practice

Discovery & Requirements

The upstream work that determines whether you build the right thing — before a single sprint begins.

The discovery mindset

Good product discovery answers four questions before any solution is proposed:

1. Is this a real problem? How often does it occur? Who experiences it? What's the cost?
2. Is the problem worth solving? Does it align with business objectives? Is the addressable impact meaningful?
3. Can AI solve it? Is the pattern learnable from data? Is there a feedback mechanism? Are the failure modes acceptable?
4. What does success look like? Define measurable outcomes before solution design begins.

Discovery techniques

Stakeholder interviews

Structured conversations to understand business objectives, pain points, and constraints. Use open-ended questions. Listen for the problem behind the problem — stakeholders often request solutions, not problems.

Process mapping

Document the current-state workflow in detail — every step, every decision point, every handoff. Identify where delays, errors, and manual effort concentrate. These are your AI opportunity zones.

Data analysis

Analyse operational data to quantify pain. How often does the problem occur? How long does it take? What's the error rate? Numbers turn anecdotes into requirements.

Jobs-to-be-done framing

"When [situation], I want to [motivation], so I can [expected outcome]." JTBD strips away solution assumptions and keeps focus on what the user is actually trying to accomplish.

BRD structure for AI features

1. Problem Statement — not the solution; the specific pain, with evidence
2. User Persona — who experiences this, in what context, how often
3. Current State — how it works today, including pain points and manual steps
4. Desired Outcome — what success looks like for the user and the business
5. Constraints — regulatory, technical, data availability, budget
6. AI/Model Requirements — input schema, output format, confidence thresholds, fallback, evaluation criteria
7. Success Metrics — how you'll measure whether the feature achieves its goal
8. Out of Scope — explicitly what will not be built in this release

AI PM Practice

AI Metrics & KPIs

Define success before you build — and measure the right things after you ship.

Model quality metrics

Accuracy

% of all predictions that are correct. Simple, but misleading with imbalanced datasets. If 95% of inputs belong to one class, a model that always predicts that class has 95% accuracy but is useless.

Precision

Of all positive predictions, how many were actually positive? High precision = low false positive rate. Important when false positives are costly (auto-approving a wrong transaction, triggering a false fraud alert).

Recall

Of all actual positives, how many did the model catch? High recall = low false negative rate. Important when missing a true positive is costly (missing a fraud signal, failing to detect a safety issue).

F1 Score

Harmonic mean of Precision and Recall. Use when you need to balance both. F1 = 2 × (P × R) / (P + R). Ranges from 0 to 1 — higher is better.

Confidence Threshold

Minimum probability score for a prediction to be acted on automatically. This is a product decision, not just a model parameter — it determines the automation rate vs. human review volume trade-off.

AUC-ROC

Area Under the ROC Curve — measures how well the model distinguishes between classes across all threshold settings. AUC = 1.0 is perfect; 0.5 is random. Useful for comparing models before choosing a deployment threshold.

Precision vs Recall trade-off (PM framing): "Higher recall means we catch more true cases but also flag more false ones — that means more human review volume. Higher precision means less review work but we might miss real cases. The right balance depends on the cost of each error type in this specific use case."

Business / product KPIs for AI features

Automation rate

% of tasks handled by AI without human intervention. Higher is not always better — defines the trade-off between scale and quality.

Time to resolution

How long it takes to complete a task end-to-end. AI should reduce this — measure before and after deployment.

Human correction rate

% of AI outputs that are corrected by human reviewers. Rising correction rate = model degrading or distribution shifting.

Adoption rate

% of eligible users actively using the AI feature. High accuracy but low adoption signals a UX or trust problem — not a model problem.

Error cost

The business cost of a model error — financial, reputational, regulatory. Must be quantified to set appropriate quality thresholds.

Latency (p95)

The response time at the 95th percentile — not the average. Averages hide tail latency that kills user experience. Define latency SLAs in requirements.

Interview Prep

Mock Q&A Bank

40+ high-probability interview questions with model answer frameworks. Customise each answer with your own examples — never deliver a generic answer in an interview.

Very common

What does good product management look like to you?

"Good product management starts with ruthless clarity on the problem — not the solution. I spend disproportionate time in discovery: who is experiencing the pain, how often, at what cost, and why existing solutions fall short. Then I translate that into a prioritised, outcome-driven roadmap — not a feature list. And I stay accountable through delivery: monitoring adoption, measuring outcome metrics, and iterating. For AI products specifically, good PM means defining success metrics before you build — not after deployment."

Very common

How do you define your product vision?

"A product vision should answer: what world are we trying to create for our users, and why does it matter? I work backward from the ideal end state — what does the user's life look like when our product is working perfectly? Then I identify the biggest gap between today and that state, and that becomes the strategic focus. The vision should be ambitious enough to inspire but specific enough to make trade-off decisions easy."

Common

How do you decide what NOT to build?

"I look for three signals: the feature doesn't move any of our outcome KPIs; it serves one client's need but fragments our platform for everyone else; or the cost of building and maintaining it outweighs the value. The hardest 'no' is to a high-revenue client who wants a custom feature. I explain the trade-off transparently, offer an alternative that meets the underlying need within the product strategy, and document the decision."

Common

How do you build and communicate a product roadmap?

"I use a now/next/later format — not a Gantt chart with fake precision. 'Now' is committed sprint work. 'Next' is prioritised but not scheduled. 'Later' is directional intent, not promises. Each item is linked to an outcome OKR so stakeholders understand why it's on the roadmap, not just what it is. I review the roadmap with stakeholders quarterly, and I make clear that the roadmap reflects current knowledge — it will change as we learn more."

Very common

Walk me through how you manage a product backlog.

"I use a combination of RICE scoring and Kano categorisation. First I categorise: basic needs get prioritised regardless of score; performance features get ranked by RICE; delighters get time-boxed to innovation capacity. For AI features I add a data readiness gate — a high-RICE feature that lacks labeled training data still can't enter the sprint. I run refinement sessions mid-sprint to ensure the top of the backlog is always sprint-ready. Nothing enters sprint planning without clear AC and no unresolved dependencies."

Very common

Describe a time a sprint didn't go as planned. What did you do?

Use the STAR format. Set up a real scenario: a technical dependency discovered mid-sprint, a model performing below threshold during UAT, a stakeholder changing requirements on Day 4. Describe: how you triaged, what you descoped and why, how you communicated to stakeholders, and what process change you introduced in the retrospective to prevent recurrence. End with what you learned.

Common

How do you handle scope creep?

"I prevent it upstream rather than managing it mid-sprint. Clear AC, a defined DoD, and a Sprint Goal that everyone has agreed to are my primary defences. When new requests come in mid-sprint, I acknowledge them, log them in the backlog, and explain why adding them now would put the Sprint Goal at risk. The only exception is a critical production issue — and even then I work with the SM to descope something of equivalent size."

Common

How do you work with data science and engineering teams?

"I treat data scientists as full product partners in discovery — not just executors. I involve them early when evaluating whether ML is the right approach. I write model specs (not just functional specs) so they have clear quality targets and evaluation criteria. With engineering, I'm specific about what I need, open about what I don't know, and I never over-specify the 'how'. My job is to describe the problem and the success criteria precisely — their job is to find the best technical path."

Very common

How do you measure the success of an AI feature?

"I define success metrics at requirements stage — never after deployment. For automation features: automation rate, confidence threshold distribution, human correction rate. For classification models: precision, recall, F1 calibrated to the cost of each error type. For user-facing features: task completion rate, time-to-resolution, user satisfaction. And I define a monitoring requirement in the DoD: how will we detect drift, what triggers a review, and what is the retraining cadence?"

Very common

Walk me through how you write acceptance criteria for an AI feature.

"I write AC in Given/When/Then format across four scenario types: happy path, edge case, failure case, and non-functional. For AI features specifically I add a fifth: model quality AC — the precision or recall threshold the model must meet on a defined evaluation dataset before the story is accepted. And a sixth: monitoring AC — how will we detect in production if the model's performance degrades. Most teams skip the last two. That's where production AI features break silently."

Common

How do you decide where to set the confidence threshold?

"It's a product decision rooted in the cost of each error type. I ask: what happens when the model is wrong and we auto-acted? What's the cost of that error versus the cost of routing to human review? I map this on a simple matrix: high-consequence decisions (financial, legal, medical) get conservative thresholds — lower automation, higher human review. Lower-stakes decisions can tolerate higher automation with occasional errors. I review the threshold quarterly as we gather production data."

Common

How do you handle it when an AI model performs worse in production than in testing?

"First I investigate the gap: is the production input distribution different from the test set? Are there data quality issues in production? Are edge cases that weren't in the test set appearing at high frequency? Then I act based on the severity — if the performance gap is critical, I roll back to a rules-based fallback or increase the human review threshold temporarily. I then work with data science to collect and label production examples and retrain. And I add distribution monitoring to the production observability stack to detect this earlier next time."

Very common

How do you handle competing priorities from different stakeholders?

"I use business value and user impact as the anchor — not seniority or volume. I map competing requests against the current OKRs and ask: which item moves the needle on an outcome we've committed to? I document the trade-offs and present a clear 'Option A vs Option B' to the decision-maker — with the business case for each, not just a preference. If two items genuinely tie, the one that unblocks more downstream work wins."

Common

How do you manage a stakeholder who keeps changing requirements?

"I first understand why requirements are changing — is it new information, evolving business context, or unclear initial discovery? If it's the latter, the root cause is upstream and I fix my discovery process. For mid-sprint changes, I explain the impact transparently: 'Adding this now means descoping X and delaying Y — is that a trade-off you want to make?' Documenting the decision and its rationale creates accountability on both sides and reduces future scope churn."

Common

How do you communicate AI limitations to business stakeholders?

"I translate technical metrics into business language. Instead of 'the model has 87% precision,' I say 'for every 100 recommendations, roughly 13 will be off-target — here's how we catch and correct those.' I always pair a limitation with the mitigation design. And I set expectations proactively at launch — stakeholders who understand the limitation upfront are partners in improvement; stakeholders who discover it after the fact are complainants."

Very common

Tell me about a product you shipped that didn't perform as expected. What did you learn?

Use STAR. Pick a real example. The lesson should be about your discovery process, success metric definition, or assumption validation — not about blaming engineering or external factors. Interviewers are looking for intellectual honesty and learning agility.

Very common

How do you make decisions under ambiguity?

"I structure the ambiguity first — what do I know, what don't I know, and what's the cheapest way to find out? If I need to make a decision before I can gather data, I make the reversible bet and monitor closely. For irreversible decisions I invest more in upfront research. I document my assumptions explicitly so I can validate them early and course-correct before the cost of being wrong becomes too high."

Common

What's the hardest "no" you've said in your PM career?

Pick a specific example where you declined a significant feature request or killed a project mid-stream. The answer should show you defended a principle (product coherence, user trust, technical sustainability) over short-term pressure, communicated the decision respectfully, and offered an alternative path that served the underlying need.

Step 5 of 5 · Final step

Interview Prep

Questions to Ask Them

Strong questions signal seniority. They show you've done the homework, think strategically, and care about the right things. Always prepare at least 3 per interviewer.

Strategic questions (for senior/VP interviewers)

How does your product team stay connected to customer problems at scale — is discovery embedded in delivery, or is it a separate function?

Why it works: Shows you think about discovery as an ongoing discipline, not a phase that ends when development starts.

What does the product team's relationship with data science look like — are they embedded in squads, or are they a centralised resource teams request from?

Why it works: Reveals how AI products actually get built here, and whether the PM has real influence over model decisions.

Where do you see the biggest unsolved product problem in your AI portfolio right now?

Why it works: Shows ambition and curiosity. Gives you signal on what the role will actually focus on.

Operational questions (for director/manager level)

What does the first 90 days look like for someone in this role — is it more discovery and listening, or are there immediate delivery expectations?

Why it works: Shows you think in outcomes from Day 1. Also gives you the actual scorecard they'll use to evaluate you.

How does the product team handle the tension between platform scalability and client-specific customisation?

Why it works: This is the central tension in any B2B AI platform. Asking it signals you understand enterprise product management.

What are the most common failure modes you've seen in AI feature delivery here — where do requirements tend to break down between product, data science, and engineering?

Why it works: Invites honesty. Shows you know AI PM is harder than traditional PM and you're already thinking about how to avoid known pitfalls.

Universal close — for every session

Best closing question: "Based on what you've heard from me today — is there anything you'd want me to expand on, or any area where you'd like more evidence of fit?"

This invites objections while you're still in the room. It signals confident self-awareness — a senior PM trait. Use it in every session without exception.

Never end with: "So, what are the next steps?" — let them offer that. Your close should be substantive, not administrative.

Reference

Glossary

Essential vocabulary for AI PM interviews. Read through this once before any interview — precise language signals depth.

Agile

An iterative approach to software delivery based on the 2001 Agile Manifesto. Values working software, customer collaboration, and responding to change over rigid planning.

Scrum

The most widely used Agile framework. Organises work into time-boxed Sprints with defined roles (PO, SM, Dev Team), artifacts (Product Backlog, Sprint Backlog, Increment), and ceremonies.

Sprint

A fixed-length iteration (typically 2 weeks) during which the team builds and delivers a potentially shippable product increment.

Epic

A large body of work that can be broken down into Features and User Stories. Typically spans multiple sprints and ties to a strategic objective.

User Story

A small, deliverable unit of user value written as: "As a [persona], I want [action], so that [outcome]." Must fit within one sprint and have clear Acceptance Criteria.

Acceptance Criteria (AC)

Specific, testable conditions that must be met for a User Story to be accepted as complete. Written in Given/When/Then format covering happy path, edge case, failure case, and non-functional requirements.

Definition of Done (DoD)

A team-wide checklist that defines when any piece of work is truly complete — code review, testing, documentation, deployment, and for AI features: model evaluation, monitoring setup, and human review gate testing.

Velocity

The average story points a team completes per sprint. Used for forecasting — not as a performance measure. Treat changes in velocity as signals, not targets.

Backlog Refinement

A mid-sprint session where the PO and team review, clarify, estimate, and prioritise upcoming backlog items to ensure the top of the backlog is always sprint-ready.

Story Points

A relative estimation unit for story complexity, effort, and uncertainty. Uses Fibonacci sequence (1, 2, 3, 5, 8, 13, 21). Team-calibrated — not equivalent to hours.

LLM (Large Language Model)

A deep learning model trained on vast text data to predict and generate text. Powers modern generative AI products. Examples: GPT-4, Claude, Gemini.

RAG (Retrieval-Augmented Generation)

A pattern that connects an LLM to a real-time knowledge retrieval system, grounding responses in current, factual data rather than training data alone.

Prompt Engineering

The practice of designing inputs to LLMs to reliably produce desired outputs. Techniques include zero-shot, few-shot, chain-of-thought, and system prompts.

Fine-tuning

Adapting a pre-trained foundation model for a specific domain or task using labeled examples. Produces better performance on the target task at lower inference cost than prompting alone.

Embeddings

Numerical vector representations of text that capture semantic meaning. Similar texts have similar embeddings. Powers semantic search, RAG, and recommendation systems.

Hallucination

When an LLM confidently produces factually incorrect or fabricated information. A fundamental limitation — design products with mitigation in mind (RAG, human review, confidence indicators).

Precision

Of all positive predictions, how many were actually positive? High precision = low false positive rate.

Recall

Of all actual positives, how many did the model find? High recall = low false negative rate.

F1 Score

Harmonic mean of Precision and Recall. The balanced metric when both error types matter.

Model Drift

Performance degradation over time as real-world data patterns shift away from the training distribution. Requires monitoring and retraining pipelines.

Human-in-the-loop (HITL)

A design pattern where human judgment is incorporated into an AI system's decision process — for review, correction, or approval of model outputs.

Agentic AI

AI systems that autonomously pursue goals through multi-step reasoning and action, using tools and external services to accomplish tasks without human direction at each step.

MCP (Model Context Protocol)

An open standard for connecting AI models to external tools and services through a standardised interface — enabling composable, interoperable AI agent tooling.

Confidence Threshold

The minimum probability score a model prediction must achieve to be acted on automatically. Below this threshold, the decision is routed to human review. A product decision with business consequences.

Vector Database

A database optimised for storing and searching embeddings (high-dimensional vectors). Powers semantic search and RAG systems. Examples: Pinecone, Weaviate, Chroma. A PM building RAG-based products needs to understand this as infrastructure.

RLHF (Reinforcement Learning from Human Feedback)

A training technique where human raters evaluate model outputs and those ratings are used to fine-tune the model toward preferred behaviour. Used to align LLMs with human values and reduce harmful outputs.

Constitutional AI

An alignment technique (Anthropic) where a model is trained to critique and revise its own outputs according to a set of written principles — reducing reliance on human feedback at scale.

Data Flywheel

A virtuous cycle where more users generate more data, which improves the model, which attracts more users. A core competitive moat for AI products — companies with proprietary usage data can compound their model advantage over time.

Shadow Mode

A deployment pattern where the model runs in production and generates predictions, but those predictions are not shown to users or used to make decisions. Used to validate production performance before going live.

Canary Deployment

Releasing a new model version to a small percentage of traffic (e.g., 5%) before rolling out to everyone. Allows real-world validation without full exposure — a standard AI PM release practice.

A/B Model Testing

Running two model versions simultaneously on split traffic and comparing performance on business metrics. The AI equivalent of A/B testing features — essential for validating model improvements before full rollout.

Ground Truth

The correct, verified answer that a model's prediction is compared against during evaluation. Defining ground truth is a product decision — who labels it, at what quality, and how disagreements are resolved.

Synthetic Data

Artificially generated data used to train or augment models — useful when real data is scarce, sensitive, or imbalanced. An AI PM should understand when synthetic data is appropriate and what risks it introduces (distributional shift).

Model Card

A standardised document describing a model's intended use, performance metrics, limitations, training data, and known failure modes. The AI PM's equivalent of a product requirements document — mandatory for responsible AI deployment.

Product Roadmap

A strategic communication tool showing what will be built, approximately when, and why. Best maintained as a now/next/later format — not a Gantt chart. Living document, not a contract.

OKR (Objectives and Key Results)

A goal-setting framework. Objective = qualitative, inspirational goal. Key Results = measurable outcomes that indicate progress. Separates outcomes (what we're trying to achieve) from outputs (what we're building).

RICE Score

A prioritisation formula: (Reach × Impact × Confidence) ÷ Effort. Used to rank backlog items relative to each other. Higher score = higher priority.

Kano Model

A framework for categorising features by how they affect user satisfaction: basic needs (must-haves), performance needs (more is better), and delighters (unexpected value creators).

MoSCoW

A release scoping framework: Must have, Should have, Could have, Won't have (this time). Used to negotiate scope when time or resources are constrained.

Discovery

The upstream product work of understanding user problems, validating assumptions, and defining opportunity before any solution is built. Good discovery prevents building the wrong thing.

North Star Metric

The single metric that best captures the core value the product delivers to users. Everything else is either an input to this metric or a health metric that shouldn't decline while pursuing it.

MVP (Minimum Viable Product)

The smallest version of a product that delivers enough value to test a hypothesis with real users. Not the smallest thing you can ship — the smallest thing that generates learning.

PI Planning

Program Increment Planning (from SAFe). A large-group planning event where multiple teams align on a 10-week increment of work, resolving dependencies and setting shared objectives.

EU AI Act

The world's first comprehensive AI regulation (EU, 2024). Classifies AI systems into 4 risk tiers: Unacceptable (prohibited), High (strictly regulated), Limited (transparency required), and Minimal (no obligation). High-risk systems require conformity assessments, human oversight, and auditability.

High-Risk AI System

Under the EU AI Act: AI used in healthcare, credit scoring, recruitment, law enforcement, education assessment, or critical infrastructure. Requires technical documentation, human oversight mechanisms, data governance, and registration before deployment.

GDPR (General Data Protection Regulation)

EU regulation governing personal data processing. Key AI implications: data minimisation, right to explanation of automated decisions, right to object to automated processing, and consent requirements for training data use.

Right to Explanation

Under GDPR: individuals can request a meaningful explanation of automated decisions made about them. Forces AI PMs to design explainability into the product — not as a feature, but as a user right.

Model Card

A standardised documentation format for AI models — intended use, performance metrics, limitations, training data characteristics, known failure modes, and ethical considerations. Required for responsible AI governance and increasingly expected by enterprise customers.

SR 11-7

US Federal Reserve guidance on model risk management for financial institutions. Requires independent model validation, documentation of assumptions and limitations, and ongoing performance monitoring. Sets the de facto standard for AI governance in US financial services.

Conformity Assessment

Under the EU AI Act: the process by which a high-risk AI system is evaluated against regulatory requirements before market placement. Third-party assessment is required for the highest-risk categories (biometrics, critical infrastructure).

Privacy by Design

The principle that data protection should be built into a system's architecture from the outset — not added as a compliance layer at the end. A core expectation of GDPR-compliant AI product development.

Algorithmic Accountability

The principle that organisations are responsible for the outcomes produced by their AI systems — including unintended harms, bias, and errors. An AI PM is a key accountability node: they define the system's goals, acceptance criteria, and monitoring strategy.

Bias Audit

A systematic evaluation of an AI model's outputs across demographic groups to detect disparate impact or discriminatory patterns. Increasingly required by regulation (EU AI Act, NYC Local Law 144) and increasingly expected by enterprise buyers.

Interview Prep

Practice Mode

Three practice modes to prepare for your AI PM interview. Start with Quick Drill for fast repetition, use Mock Interview for realistic practice, or tackle Scenario Challenges for situational judgment.

Choose a category

📚

All Cards

54 cards

🏗️

Foundations

8 cards

Role, narrative, Agile, user stories

💡

AI Knowledge

10 cards

LLMs, RAG, agents, responsible AI

🚀

Product Craft

8 cards

Prioritisation, metrics, roadmapping

📦

Delivery

8 cards

Scoping, estimation, UAT, monitoring

🤝

Stakeholder & Comms

6 cards

Trade-offs, difficult conversations

🎤

Behavioural

8 cards

STAR stories, leadership, failure

🧩

Scenario Challenges

6 cards

Full situational judgment questions

Advanced

AI PM Practice

AI Tools & Prompting

The LLM landscape, prompt engineering techniques, anti-patterns, and how to evaluate AI outputs. These topics appear frequently in AI PM interviews — especially for roles involving LLM-powered products.

The LLM landscape — what a PM needs to know

As an AI PM you don’t need to know which model has the highest benchmark score — you need to know how to make the build vs buy vs configure decision, what trade-offs matter for your product, and how to evaluate model outputs for your use case.

Foundation models (GPT-4, Claude, Gemini, Llama)

Large pretrained models with broad capability. Available via API (GPT-4, Claude) or open-weight for self-hosting (Llama, Mistral). API models: faster to start, vendor dependency. Open-weight: more control, higher infrastructure cost.

Specialised / fine-tuned models

Foundation models adapted for a specific domain or task via fine-tuning. Use when: you have labelled domain data, the base model consistently underperforms on your use case, or you need consistent output format/style.

Embedding models

Convert text into numerical vectors for semantic search, clustering, or retrieval. The backbone of RAG systems. Not generative — they don’t produce text. Key metric: retrieval accuracy on your specific domain.

Build vs buy vs configure decision

Buy (API): commodity capability, speed matters, no proprietary data advantage.
Configure (fine-tune/RAG): good base model but needs domain grounding or consistent style.
Build: unique capability, proprietary data moat, regulatory reason to self-host. Rare in most enterprise contexts.

Prompt engineering techniques

Zero-shot prompting

Give the model a task with no examples. Works for well-defined tasks the model has seen in training. Fast to iterate. Fails when the task is ambiguous or requires specific output format.

Few-shot prompting

Provide 2–5 examples of input-output pairs before the task. Dramatically improves consistency and format adherence. Use when zero-shot produces variable outputs or when the task has a specific structure.

Chain-of-thought (CoT)

Ask the model to reason step-by-step before giving an answer: “Think through this step by step.” Improves accuracy on reasoning tasks. Trade-off: more tokens = higher latency and cost.

System prompts

A persistent instruction that shapes the model’s behaviour for all turns in a conversation. Use for: persona, tone, output format, constraints, guardrails. The most important lever for consistent behaviour in production.

Retrieval Augmented Generation (RAG)

Augment the prompt with retrieved context from a knowledge base before generating. Reduces hallucination, keeps responses grounded in your data. PM responsibility: define retrieval quality requirements and evaluate whether retrieved chunks are relevant.

Structuring a production prompt

1. Role / persona — “You are a [role] helping [user type]...”
2. Task — what you want the model to do, precisely
3. Context — relevant background or retrieved documents
4. Format — output structure (JSON, bullet list, max length)
5. Constraints — what not to do, guardrails, tone restrictions
6. Examples — 1–3 good examples if format consistency is critical

Common prompt engineering anti-patterns

Vague instructions

“Be helpful and concise” means nothing to a model. “Respond in 3 bullet points, each under 20 words” is specific. Vague prompts produce variable outputs — variability in production is a reliability problem.

Prompt injection vulnerability

When user input is concatenated directly into a system prompt, malicious users can override instructions: “Ignore all previous instructions and...” Mitigation: validate/sanitise user input, use structured formats (JSON), separate system instructions from user content, add explicit injection defence in your system prompt.

Over-engineering before testing

Building a complex multi-step prompt pipeline before validating that a simple prompt works. Start with zero-shot, measure on your eval set, then add complexity only where it demonstrably improves results.

No fallback for model refusal

Models sometimes refuse to answer or produce unexpected outputs. Production systems need a fallback: a default response, an error message, or a human escalation path. Never assume the model will always produce the expected output.

Evaluating LLM outputs

Why evals matter for PMs

You can’t A/B test a model change in production the same way you test a UI change. Evals are your test suite — a fixed dataset of inputs with expected outputs or quality criteria. Without evals, every model update is a leap of faith.

Types of evals

Automated: rule-based checks (format, length, required fields), embedding similarity to reference answers.
Model-graded: use another LLM as a judge — scores for relevance, accuracy, tone, helpfulness.
Human eval: experts rate outputs on a rubric. Expensive but the ground truth for novel use cases.

What to include in your eval dataset

Representative inputs · Edge cases · Adversarial examples (jailbreak attempts, injection attempts) · Failure modes from production monitoring. Update continuously — a stale eval set is worse than no eval set.

Responsible AI considerations for LLMs

Content safety: filter harmful outputs. Privacy: don’t include PII in prompts. Bias: test across demographic groups. Explainability: can you explain why the model gave that output? These are PM requirements, not just engineering concerns.

Common — LLM product roles

How would you defend an LLM product against prompt injection?

“I treat it as a security requirement, not an afterthought. At the prompt level: explicitly instruct the model to ignore attempts to override its instructions, use structured input formats that separate user content from system instructions, and add injection defence language. At the application level: validate and sanitise all user inputs before they reach the model, implement content filtering on outputs, and log anomalous patterns for review. I include prompt injection test cases in the eval dataset and test every prompt change against them.”

Common

How do you decide between fine-tuning and RAG?

“They solve different problems. RAG is better when you need the model to access specific, up-to-date, or proprietary information at inference time — it grounds the response in retrieved context. Fine-tuning is better when you need the model to behave consistently in a specific style or domain, and the knowledge is stable. I start with RAG because it’s faster to iterate and easier to update. I consider fine-tuning when RAG retrieval quality is the bottleneck or when the model needs to adopt a very specific output format or persona that prompting alone can’t reliably achieve.”

AI PM Practice

Process Optimisation

AI PMs often get involved before development — mapping current processes, identifying automation opportunities, and designing the future state. These skills appear in discovery and stakeholder interviews.

Why process mapping matters for AI PMs

Before you can automate a process with AI, you need to understand exactly how it works today. Skipping the as-is mapping leads to automating the wrong steps, missing edge cases, and building AI that creates new problems faster than it solves existing ones.

SIPOC — the fast-frame technique

Suppliers: who provides inputs to the process?
Inputs: what information or materials enter?
Process: the high-level steps (5–7 boxes)
Outputs: what does the process produce?
Customers: who receives the output?
SIPOC is used in the early discovery phase — before you go deep — to agree on scope with stakeholders.

Swim lane diagram

Shows who does what at each step. Each horizontal lane = a role or system. Flow arrows show handoffs. Handoffs are where delays, errors, and frustration concentrate — these are your AI opportunity zones.

As-Is → To-Be process design

As-Is (current state): Map exactly what happens today — every step, every handoff, every decision point, every system. Don’t idealise it. Include the workarounds and manual steps people do informally.

Pain point analysis: For each step ask: How long does this take? What goes wrong here? What’s the error rate? How much human judgement is required?

To-Be (future state): Design the optimised process — which steps are automated, which are eliminated, which are transformed. Show the roles and systems in the new state. The gap between as-is and to-be is your product scope.

Questions to ask during as-is mapping workshops

• Walk me through exactly what you do when [trigger event] happens
• What information do you use to make that decision?
• How often does this step fail or require rework?
• What happens when [edge case]?
• Where do you spend the most time that feels like it shouldn’t be manual?

Value Stream Mapping

Value Stream Mapping (VSM) extends process mapping by adding time data. For each step you capture: process time (how long the actual work takes) and wait time (how long before work starts). This makes the biggest time-sink visible — usually it’s queue time between steps, not the steps themselves.

Key VSM metrics

Lead time: total time from trigger to output (process + wait)
Process time: time actually spent working (value-add)
Efficiency: process time ÷ lead time × 100%
Most enterprise processes have 5–15% efficiency — 85–95% of elapsed time is waiting.

What VSM reveals

The biggest AI opportunity is rarely the step that takes the most process time — it’s the step that creates the most wait time downstream. A model that eliminates a 2-minute task that blocks a 2-day queue is worth more than one that saves an hour at the end of the process.

Identifying AI automation opportunities

Not every step should be automated. Evaluate each process step against three criteria: volume (how often does it happen?), structure (is the input/output well-defined?), and tolerance for error (what happens if the AI is wrong?).

High-value AI targets

• High-volume, repetitive classification tasks (routing, triaging, categorising)
• Document extraction and summarisation
• Draft generation from structured inputs
• Anomaly detection in data streams
• Recommendation based on historical patterns

Low-fit for AI automation

• Novel situations with no historical precedent
• High-stakes decisions where error consequence is severe
• Tasks requiring relationship or emotional intelligence
• Creative work requiring originality and context
• Processes where the rules change frequently

The automation spectrum

Assist: AI suggests, human decides — lowest risk, highest trust
Automate with review: AI decides, human reviews flagged cases
Automate with monitoring: AI decides, exceptions escalate automatically
Full automation: AI decides end-to-end — only appropriate for low-stakes, high-confidence use cases

Change management: AI products fail not because the model is bad but because the humans who use it don’t trust it or change their behaviour. Build adoption into your requirements: explainability, feedback mechanisms, and a trust-building rollout plan.

Common in discovery-heavy roles

How do you identify where to apply AI in an existing business process?

“I start with the as-is process — map every step, measure volume and time, identify where decisions happen and what information drives them. Then I apply three filters: is this high-volume and repetitive? Is the input and output well-structured? What’s the cost of an error? The intersection of high volume, structured data, and tolerable error rate is where AI delivers reliable value. I prioritise the step that creates the biggest downstream wait — not necessarily the one that takes the most process time.”

Common

How do you handle change management when introducing AI automation to a team?

“I treat adoption as a product requirement, not an afterthought. Three things I do: First, involve the people who do the current process in the design — they surface the edge cases and their buy-in matters. Second, design the AI to assist rather than replace, at least initially — ‘AI suggests, you decide’ builds trust faster than full automation. Third, make the AI’s reasoning visible where possible — when people can see why the model made a recommendation, they trust it more and can correct it when it’s wrong.”

Strategy

AI Governance & Regulation

The regulatory landscape every AI PM must navigate. Knowing this separates safe operators from liability risks — and signals seniority in interviews.

Why it matters: The EU AI Act is the world's first comprehensive AI law. If your product operates in Europe or serves EU users, it applies to you — regardless of where your company is based.

The 4 Risk Tiers

Unacceptable Risk — Prohibited

Banned outright. Includes: real-time biometric surveillance in public spaces, government social scoring systems, AI that exploits psychological vulnerabilities or manipulates users subliminally. An AI PM must never build these.

High Risk — Strictly Regulated

Can be built but requires conformity assessment, transparency, human oversight, and registration in an EU database. Includes: AI in healthcare decisions, credit scoring, recruitment screening, law enforcement tools, critical infrastructure, and education assessment.

Limited Risk — Transparency Required

Must disclose that users are interacting with AI. Includes: chatbots, deepfakes, emotion recognition systems. Users must know they're not talking to a human.

Minimal Risk — No Legal Obligation

Most AI applications — spam filters, recommendation engines, inventory optimisation. Best practice still applies, but no mandatory compliance requirements beyond existing consumer law.

What High-Risk compliance requires from a PM

Risk management system — ongoing identification and mitigation throughout the AI lifecycle, not just at launch

Data governance — training data must be relevant, representative, and as free from errors and bias as possible

Technical documentation — a model card: capabilities, limitations, intended use, performance metrics, known failure modes

Human oversight — humans must be able to monitor, intervene, and override the system at any point

Accuracy & robustness — the system must perform reliably and be tested against adversarial inputs and edge cases

Logging & auditability — all events must be logged to enable post-hoc review and incident investigation

Finance — SR 11-7 (US Federal Reserve)

What it is

US supervisory guidance on model risk management for financial institutions. Requires independent model validation, documentation of assumptions and limitations, and ongoing performance monitoring for any model used in credit, risk, or trading decisions.

What it means for a PM

Any AI model used in lending, fraud detection, or risk management must be validated by an independent team before deployment. Build model validation gates into your release process — not as a blocker, but as a first-class sprint milestone with its own acceptance criteria.

Healthcare — FDA AI/ML Guidance (US)

What it is

The FDA regulates AI-enabled Software as a Medical Device (SaMD). Continuously-learning algorithms face additional scrutiny — any significant change to an algorithm used in clinical decision-making may require re-submission and re-approval.

What it means for a PM

In healthcare AI, your model update process is a regulatory event — not just a deployment. Plan for Predetermined Change Control Plans (PCCPs) that pre-approve the specific conditions under which a model may update autonomously without triggering a new review cycle.

What interviewers actually want to see

You don't need to be a lawyer. Interviewers want to see that you:

1. Know that regulation exists and varies by industry, use case, and geography
2. Proactively involve legal, compliance, and data protection stakeholders early — in discovery, not post-launch
3. Design products with compliance built in: explainability, audit logs, human review paths
4. Know how to identify your product's risk tier and what obligations flow from it

High-Probability Question

"How do you ensure your AI product complies with regulations?"

Answer in 3 parts:

Map the landscape early. In discovery, I identify applicable regulatory frameworks — EU AI Act risk tier, GDPR if personal data is involved, and any sector-specific rules (FDA for healthcare, SR 11-7 for finance). I don't wait for legal to flag this. I bring compliance into the conversation at the requirements stage.

Design compliance in, not on. For high-risk systems, that means explainability by default, human override mechanisms, audit logging from day one, and a model card in the technical documentation. These are acceptance criteria — not features to be added later.

Gate the release process. Model validation, privacy impact assessments, and compliance sign-off are milestone gates in my release plan. They're scheduled sprints, not surprises before launch.

Follow-Up Question

"Your AI system made a decision that harmed a user. What do you do?"

Show structured incident thinking:

Immediately: Trigger the human escalation path — which should already exist in the product design. If the failure mode appears systemic, pause automated decisions in that category. Document everything for the audit trail.

Short-term: Root cause analysis with data science — was this a model failure, data drift, an edge case, or a threshold misconfiguration? Scope the blast radius: how many other users were affected?

Systemic fix: Adjust the confidence threshold, retrain or re-evaluate the model, add this case to the evaluation set, and update the model card. Communicate transparently to affected users per your legal obligations.

Strategy

Stakeholder Management

The skill most interviewers probe without naming it. AI PMs operate at the intersection of data science, engineering, legal, and business — managing each relationship differently.

The Power / Interest Grid

The most useful stakeholder framework. Plot each stakeholder on two axes — Power (ability to affect the project) and Interest (how much they care about the outcome).

High power, high interest → Manage closely. Co-create with them. These are your executive sponsors and key decision-makers.

High power, low interest → Keep satisfied. Inform them at key milestones. Don't overwhelm them with detail — escalate only when a decision requires them.

Low power, high interest → Keep informed. These are often the end-users or subject matter experts. They're your advocates if engaged well.

Low power, low interest → Monitor. Minimum effort. Don't ignore — a disengaged stakeholder can become a blocker if circumstances change.

AI-specific stakeholders to map

Data Science / ML Engineering

High power over model feasibility and timelines. Engage early and continuously. They'll tell you what's possible — listen to the constraints, then push back on what matters.

Legal & Compliance

High power to stop or reshape the product. Bring them into discovery — not sprint review. The earlier they're involved, the less expensive their input.

Executive Sponsors

High power, often low day-to-day interest. Communicate in business outcomes, not model metrics. They want to know: what does this do for revenue, cost, or risk?

Engineering & Platform

High interest in technical decisions. Work with them on infrastructure, integration, and MLOps — not just the feature. They'll flag dependencies you won't see from the product side.

End Users / Operations

The people who live with the output every day. High interest, often low formal power. Critical for UAT, edge case discovery, and post-launch monitoring. Under-engage them at your peril.

Data Protection / DPO

Power to block any feature that processes personal data. Involve at requirements stage. Their job is to find problems — frame that as value, not obstruction.

The core challenge: Executives want certainty. AI development produces uncertainty. Your job is to communicate honestly without losing confidence.

How to communicate AI uncertainty upward

Translate model metrics into business language

"87% precision" means nothing to a CFO. "13 in every 100 recommendations will be wrong — here's our plan to catch them before they affect customers" means everything. Always translate before you present.

Lead with the outcome, not the model

Don't open with "we're fine-tuning a transformer." Open with "we're building a system that will reduce manual review time by 40%." The model is the implementation detail — the outcome is the story.

Give a range, not a number

AI timelines and performance targets are genuinely uncertain. Give a calibrated range: "We expect 80–90% accuracy by Q3, with a production decision gate at 85%." Ranges signal honesty — not weakness.

Define the go/no-go criteria in advance

Before you start, agree with your exec sponsor: at what model performance threshold do we launch? What defines success at 3 months post-launch? Pre-agreement prevents uncomfortable conversations later.

Reporting cadence for AI products

Weekly (team): Model performance metrics, drift monitoring, current confidence thresholds

Bi-weekly (stakeholders): Progress against milestone gates, blockers, upcoming decisions that need input

Monthly (executives): Business outcome progress (adoption, error rate reduction, cost impact), risks with mitigation plans, next major milestone

Ad-hoc (immediately): Any incident where the model caused a user-facing error, any significant performance degradation, any change to the confidence threshold

Working with Data Science

What they need from you

Clear problem definition, labelled data requirements, evaluation criteria before they start building, and protection from scope creep mid-sprint. The worst thing you can do is change the success metric after training has begun.

What you need from them

Honest performance estimates with uncertainty ranges, early flagging of data quality issues, and a model card for every model that reaches production. Build a working relationship where they tell you bad news early — not after the sprint review.

Working with Engineering

MLOps is a product conversation

Model deployment, monitoring, retraining pipelines, and rollback procedures are product decisions, not just infrastructure. A PM who treats MLOps as someone else's problem will be blindsided by production failures.

Define the integration contract early

How does the model output get consumed? As an API? A batch job? A stream? What's the latency budget? These decisions affect model architecture — get engineering into the design conversation before any code is written.

Working with Legal & Compliance

The golden rule

Involve them in discovery. Every week you delay their involvement costs you more. A legal blocker discovered in sprint 12 is ten times more expensive than one discovered in sprint 1.

Frame it as risk management, not permission-seeking

Don't ask "can we do this?" Ask "here's what we're building and here are the risks we've identified — what are we missing?" That's a collaborative conversation, not a gate-keeping one.

Setting AI expectations with clients

Under-promise, then explain why

AI performance is probabilistic, not deterministic. Never promise "it will do X" — promise "it will do X in approximately Y% of cases, and here's how we handle the rest." Clients who understand the limitation are more forgiving of failures than clients who were promised perfection.

Show them the error mode, not just the success mode

In demos and pilots, deliberately show the model getting something wrong — then show how the escalation path handles it. This builds more trust than a curated demo that only shows the best case.

Define success metrics together

Before the pilot begins, agree in writing: what does success look like at 30, 60, and 90 days? What's the threshold at which we continue vs. revisit? Clients who help define success are far more likely to declare it.

Handling disappointment

Don't defend the model — acknowledge the gap between expectation and reality first. "I hear you — this isn't performing the way we expected it to." Then diagnose.

Diagnose before you fix — is this a model issue, a data issue, an integration issue, or a scope mismatch? Don't promise a fix until you know what's broken.

Come back with a plan, not just an apology — "Here's what we found, here's what we're changing, and here's how we'll know it's working." Clients can accept problems — they can't accept silence or vague reassurance.

High-Probability Question

"Tell me about a time you managed a difficult stakeholder."

Use STAR, but lead with the complexity:

Structure: Situation (who was the stakeholder, what was at stake) → Task (what you needed from them, or what they were blocking) → Action (how you approached the relationship — meetings, framing, escalation) → Result (what changed and what you learned).

Strong signals: You identified the real concern (often not the stated concern), you changed your approach based on what motivated them, and you turned a blocker into a collaborator.

Weak signals: You escalated immediately, you worked around them, or you "won" by outranking them. Interviewers want influence, not force.

High-Probability Question

"How do you explain AI limitations to a non-technical executive?"

Demonstrate translation skill:

"I translate model behaviour into business language and consequences. If precision is 87%, I say: 13 in every 100 recommendations will be wrong. Then I immediately answer the follow-up question they haven't asked yet: here's our plan to catch those errors before they reach the customer.

I also set up the conversation upfront — before the executive ever sees the model output, I've aligned them on what good looks like and what the failure modes are. That way the first time they see an error is never a surprise."

Strategy

Roadmapping for AI Products

AI roadmaps are different. Model performance is uncertain, dependencies are non-linear, and stakeholders expect the certainty of a software roadmap from a system that doesn't behave like software.

Now / Next / Later — applied to AI

The standard Now/Next/Later format works for AI — but each horizon needs to account for model uncertainty.

Now (current sprint/quarter): Features backed by a model that has already met its performance threshold in evaluation. Confidence is high. Commitments are firm.

Next (next quarter): Features dependent on a model currently in training or evaluation. Commit to the outcome — "reduce manual review by 30%" — not the implementation. Flag the performance gate explicitly.

Later (beyond 6 months): Features that depend on model capabilities that don't yet exist or haven't been validated at scale. Label these as "directional" — not commitments. Revisit each quarter.

What makes AI roadmaps fail

Treating model training like feature development

A feature takes as long as it takes to build. A model takes as long as it takes to reach a performance threshold — which is genuinely uncertain. Never put a model performance milestone on the same timeline as a feature milestone without explicit uncertainty buffers.

Roadmapping the model, not the outcome

"Ship intent detection model" is an output. "Reduce support ticket misrouting by 40%" is an outcome. Roadmap the outcome — the model is an implementation detail that may change. Stakeholders care about results, not architectures.

Not accounting for data dependency

Every AI feature is blocked by data until it isn't. Map your data acquisition, labelling, and pipeline readiness on the roadmap as first-class work items — not invisible dependencies.

Dealing with model dependency

The performance gate

Every AI feature on the roadmap should have an explicit performance gate: the minimum model quality threshold at which you will ship. Define this in advance — precision, recall, latency, or whichever metric matters — so the "go" decision isn't made under launch pressure.

Build in a fallback path

If the model doesn't hit the gate in time, what happens? A good AI roadmap has a fallback for every AI-dependent feature: a rule-based approach, a manual process, or a simplified model. The fallback is not the plan B — it's the plan that keeps the product alive while the model catches up.

Stage your releases around model confidence

Shadow mode → limited beta → controlled rollout → full release. Each stage tests a broader population at a lower confidence threshold. Roadmap these stages explicitly — they are milestones, not testing phases.

Typical AI feature lifecycle on a roadmap

1. Data readiness — labelled data acquired, pipeline validated → gate: training set complete
2. Model training & evaluation — initial model built, evaluated against baseline → gate: performance threshold met
3. Shadow mode — model runs in parallel, predictions not shown to users → gate: production parity with evaluation metrics
4. Limited beta — shown to a small user cohort, human review on all decisions → gate: user feedback positive, no safety incidents
5. Controlled rollout — expanding cohort, automated decisions with monitoring → gate: drift monitoring stable
6. Full release — full production, ongoing monitoring, retraining schedule established

Platform vs Feature — the core AI PM decision

Build a platform when: multiple teams or products will need the same AI capability. Example: a shared intent classification service used by support, sales, and onboarding. The investment in a reusable platform pays back when the second consumer arrives.

Build a feature when: the use case is specific, timelines are tight, and reuse is speculative. Don't over-engineer for scale you may never need. A tightly-scoped feature that ships is worth more than a platform that's 6 months away.

Questions to ask before choosing

Who else needs this capability?

If you can name two other teams who would use this capability in the next 12 months, platform investment is worth discussing. If it's speculative, ship the feature.

How fast is the underlying model evolving?

Building a platform on top of a rapidly-evolving model is risky — the abstraction may need to be rebuilt as the model capabilities shift. Prefer platform investment when the model layer is relatively stable.

What's the retraining and update cadence?

A platform shared across products must handle model updates that don't break consumers. That versioning and compatibility contract is an additional engineering investment — factor it into the platform decision.

Communicating uncertainty in AI roadmaps

Use confidence levels, not just dates

Label roadmap items: High confidence (model validated, shipping next sprint), Medium confidence (model in evaluation, performance gate defined), Low confidence (model not started, directional intent only). This is more honest and more useful than a date that everyone knows is a guess.

Separate outcome commitments from implementation commitments

Commit to the business outcome ("reduce misrouting by 30% by Q3") but be explicit that the implementation path may change based on model performance. Stakeholders can accept implementation pivots when the outcome commitment is clear.

Make the performance gate visible

Put the go/no-go criteria on the roadmap itself: "Ship when precision ≥ 90% on evaluation set." This makes the decision criteria transparent and removes subjectivity from the launch conversation.

The stakeholder conversation you need to have upfront

"I want to set expectations about how this roadmap works differently from a traditional software roadmap.

The features that depend on model performance have two kinds of uncertainty: when the model will be ready, and whether it will reach the threshold we need. I've built in performance gates and fallback paths for each of those items.

What I'm committing to is the outcome and the process — not a specific implementation date for AI-dependent features. I'll give you a confident date once the model clears evaluation. Is that a framing you can work with?"

High-Probability Question

"How do you build a roadmap when you don't know how well the model will perform?"

Show structured thinking under uncertainty:

"I separate what I can commit to from what I can only direct. I commit to the business outcome and the process — discovery, data readiness, evaluation, staged rollout. I don't commit to a ship date for an AI-dependent feature until the model has cleared its performance gate.

I use a Now/Next/Later format with explicit confidence levels, and I define fallback paths for every model-dependent item — so the roadmap stays credible even if the model takes longer than expected.

The thing I do above all else is agree on the go/no-go criteria with stakeholders before development starts. That removes the subjective pressure at launch time and replaces it with a shared, data-driven decision."

Follow-Up Question

"A stakeholder wants a firm ship date for an AI feature. How do you handle that?"

Show you can hold your ground while being collaborative:

"I understand the pressure — a firm date is easier to plan around. But giving a date I can't defend doesn't help either of us.

What I'll do is give you two things: a date by which we'll have a go/no-go decision — that I can commit to firmly — and a range for the ship date conditional on the model hitting its threshold. If the model clears evaluation by sprint 6, we ship in sprint 8. If it doesn't, here's our fallback and the revised timeline.

That way you have a planning anchor and I'm not setting us both up for a missed commitment."