Part 11: From Idea to AI Product
Chapter 36 · Section 36.3

Risk and Feasibility Assessment

"The demo worked perfectly. Then we tried it on real data and discovered that 'summarize this document' is not one problem but forty-seven problems wearing a trench coat."

Deploy Deploy, Trench Coat Detecting AI Agent
Big Picture

AI feasibility is not software feasibility. In traditional product development, if you can describe a feature clearly, an engineer can almost certainly build it. AI inverts this assumption. A feature that sounds trivially simple ("summarize this legal contract") may be infeasible at the quality level the domain demands. This section equips you with structured tools to assess feasibility before committing engineering resources: an error tolerance framework, a technical feasibility matrix, a data readiness checklist, a regulatory pre-screen, and a reusable Feasibility Scorecard that forces explicit scoring across every dimension.

Prerequisites

This section builds on the AI Role Canvas from Section 36.2. It assumes familiarity with evaluation and observability (Chapter 29), safety and regulation (Chapter 32), and prompt engineering (Chapter 11). Prior exposure to AI strategy (Chapter 33) will help contextualize the organizational dimensions of feasibility.

A traffic light with risk, assessment, and feasibility icons in its three lights, with a product manager holding a scorecard beside it.
Figure 36.3.1: Before committing engineering resources, score every AI product idea across technical, data, regulatory, and economic dimensions. The Feasibility Scorecard makes hidden risks visible.

1. Why Feasibility Comes First

Traditional software product design follows a familiar sequence: identify the user need, design the feature, estimate the engineering effort, build it. Feasibility is rarely in doubt because the relationship between specification and implementation is predictable. If you can describe the business logic precisely, a competent team can implement it.

Fun Fact

In traditional software, "Can we build it?" is almost always yes. In AI products, "Can we build it well enough?" is the question that kills projects. The graveyard of AI startups is full of teams that could build the feature but could not build it at the quality the domain demanded.

AI changes this assumption at its root. The relationship between specification and implementation is probabilistic. You can describe "summarize this document accurately" with perfect clarity, but whether a model can actually do it depends on the document type, the required accuracy threshold, the domain vocabulary, the length distribution, and a dozen other variables that you cannot determine from the specification alone. A feature that works flawlessly on blog posts may hallucinate critical details when applied to medical discharge summaries.

This means feasibility assessment must move from the middle of the product cycle (where it lives in traditional development) to the very beginning. You must verify that the AI can do what you need it to do before you commit to building the product around it.

Key Insight

Feasibility-first product design is not pessimism; it is resource efficiency. Teams that validate feasibility early kill bad ideas cheaply and redirect effort toward features that can actually deliver value. Teams that skip feasibility assessment discover infeasibility after months of engineering, when the sunk cost makes it politically difficult to pivot. The Feasibility Scorecard introduced later in this section is designed to make that early validation systematic rather than ad hoc.

2. Error Tolerance as a Design Constraint

Not all errors are created equal. A writing assistant that occasionally suggests an awkward phrase is mildly annoying. A medical triage system that misclassifies a chest pain case is potentially lethal. A financial approval system that incorrectly greenlights a fraudulent transaction costs real money. The acceptable error rate for an AI feature is not a technical parameter; it is a product design constraint that must be set before any model selection or prompt engineering begins.

The following table provides a starting framework for thinking about error tolerance across domains. These are not rigid thresholds; they are conversation starters that force product teams to make the implicit explicit.

Table 36.3.1: Error Tolerance by Domain
Domain Example Feature Error Tolerance Consequence of Error
Creative writing Blog post draft High (5-15% error acceptable) Human editor catches issues; low stakes
Customer support Ticket classification Moderate (2-5%) Misrouted tickets delay resolution
Legal Contract clause extraction Low (0.5-2%) Missed clauses create liability
Healthcare Symptom triage Very low (<0.1%) Misclassification risks patient safety
Finance Transaction approval Very low (<0.1%) False approvals cause direct financial loss

Notice the pattern: as the cost of a single error increases, the acceptable error rate drops by orders of magnitude. This has direct implications for model selection, system architecture, and whether the feature is feasible at all. A domain that requires <0.1% error may need a multi-stage pipeline (model generates, second model verifies, human reviews edge cases) rather than a single model call. The evaluation framework from Chapter 29 provides the tooling to measure whether your system actually meets the target error rate in production.

Fun Note

There is an old joke in aviation: "If builders built buildings the way programmers write programs, the first woodpecker to come along would destroy civilization." The AI version is more pointed: if product managers shipped AI features the way they ship traditional features, the first edge case would destroy the quarterly revenue forecast. Error tolerance scoring exists to prevent the woodpecker scenario.

3. The Technical Feasibility Matrix

For each candidate AI feature, rate the following five dimensions on a 1-5 scale (1 = major blocker, 5 = no concern). A feature that scores below 3 on any single dimension requires a mitigation plan before proceeding. A feature with two or more dimensions below 3 should be reconsidered entirely.

  1. Model capability. Can current models perform this task at the required quality level? Run a quick benchmark with 50 to 100 representative examples before scoring. Do not rely on demos or anecdotal impressions. The evaluation techniques from Chapter 29 apply even at this early stage.
  2. Data availability. Does the training or retrieval data exist? Is it accessible to your team? Is it labeled or can it be labeled at reasonable cost? Section 4 below provides a detailed data readiness checklist.
  3. Latency budget. What response time does the user experience demand? Real-time chat requires sub-second responses. Batch document processing can tolerate minutes. Some model and architecture choices are eliminated by latency alone.
  4. Cost ceiling. What is the maximum per-request cost the business model supports? A feature that costs $0.50 per invocation is viable for a $200/month enterprise subscription but not for a free consumer app. Include model inference, retrieval, and any post-processing in the cost estimate.
  5. Regulatory constraints. Does this feature fall under specific regulatory requirements (EU AI Act, HIPAA, SOC 2, industry-specific rules)? Section 5 below covers regulatory pre-screening in detail. See also Chapter 32 for a comprehensive treatment of AI safety and regulation.
A large funnel filled with dozens of colorful lightbulb ideas at the top, passing through five progressively finer filter layers. Most lightbulbs bounce off with red X marks at each stage, while only two or three emerge at the bottom into a green viable zone. A product manager stands below catching the survivors.
Figure 36.3.2: The feasibility funnel in action. Dozens of promising ideas enter the top, but each filter layer (model capability, data readiness, latency, cost, regulatory compliance) eliminates ideas that cannot survive production conditions.
Real-World Scenario: Scoring a Contract Analysis Feature

Who: A product manager at a legal-tech startup planning an AI feature that extracts key clauses from commercial contracts.

Situation: The team ran a feasibility matrix across five dimensions: model capability (3/5), data availability (2/5), latency budget (5/5), cost ceiling (4/5), and regulatory constraints (2/5). Enterprise pricing at $500/month supported the cost, and batch processing met latency expectations.

Problem: Two dimensions scored below the viability threshold of 3. The startup had only 200 contracts, all unlabeled and covered by client NDAs. Client data also included PII and confidential business terms requiring SOC 2 compliance that the team had not yet secured.

Decision: Rather than pushing ahead with a full build, the team allocated a two-week spike to negotiate a data-use agreement with three pilot clients and evaluate SOC 2-compliant hosting providers.

Result: The spike confirmed that data access was achievable (two of three clients agreed) but SOC 2 hosting added $1,200/month to operating costs, which the team factored into revised pricing before committing to the build.

Lesson: Scoring feasibility dimensions numerically turns "we think this will work" into a structured decision with clear blockers and mitigation plans.

4. Data Readiness Assessment

Model capability is necessary but not sufficient. Even the most powerful model cannot perform well without appropriate data for retrieval, fine-tuning, or evaluation. The following checklist covers the four questions every team must answer before committing to an AI feature.

4.1 Does the Data Exist?

This sounds obvious, but teams frequently assume data availability. "We have thousands of customer support tickets" may be true, but if those tickets are in a legacy system with no API, or stored as screenshots rather than text, the data effectively does not exist for your purposes.

4.2 Is It Accessible?

Data may exist but be locked behind organizational, legal, or technical barriers. Cross-department data sharing agreements, vendor API rate limits, and data residency requirements all affect accessibility. Map these barriers early.

4.3 Is It Labeled?

For evaluation and fine-tuning, you need labeled examples. If labels do not exist, budget for the annotation effort. A common rule of thumb: 200 to 500 labeled examples for initial evaluation, 1,000 or more for fine-tuning. The synthetic data techniques from Chapter 13 can supplement human annotation, but they cannot fully replace it for high-stakes domains.

4.4 What Are the Privacy Constraints?

Does the data contain personally identifiable information (PII)? Is it subject to GDPR, CCPA, HIPAA, or other privacy regulations? Can it be sent to third-party model providers, or must inference run on-premises? These constraints directly affect architecture choices and cost. The AI Role Canvas from Section 36.2 includes a privacy field precisely because privacy constraints must be surfaced at the design stage, not discovered during implementation.

5. Regulatory Pre-Screening

Regulatory compliance is not a checkbox you tick at launch; it is a constraint that shapes your architecture from day one. Two frameworks are especially relevant for AI product teams in 2025 and beyond.

5.1 EU AI Act Risk Tiers

The EU AI Act classifies AI systems into four risk tiers, each with different compliance obligations:

Determine your feature's risk tier early. A feature classified as "high risk" adds months of compliance work and ongoing audit costs. This may not make the feature infeasible, but it must be factored into the timeline and budget. Chapter 32 provides a comprehensive guide to navigating these requirements.

5.2 OWASP LLM Top 10

The OWASP Top 10 for Large Language Model Applications catalogs the most common security vulnerabilities in LLM-based systems. During feasibility assessment, scan your proposed feature against this list to identify which vulnerabilities are relevant:

You do not need to solve every vulnerability at the feasibility stage, but you must identify which ones apply and estimate the mitigation effort. A feature where prompt injection could lead to unauthorized data access requires fundamentally different architecture than a feature where the worst outcome is a poorly worded summary.

6. Cross-Functional Decision-Making

In traditional product development, a product manager can assess feasibility by consulting with engineers about implementation complexity. AI feasibility requires a broader coalition. The product manager understands the user need and business constraints. The data scientist or ML engineer understands model capabilities and limitations. The data engineer knows what data exists and how to access it. The legal or compliance team understands regulatory requirements.

The Feasibility Scorecard (introduced below) is deliberately designed as a cross-functional artifact. No single person can fill it out completely. This is by design: if a product manager can fill out every field without consulting anyone else, the scorecard is not doing its job.

Key Insight

The Feasibility Scorecard is a communication tool as much as an assessment tool. Its primary value is not the final scores but the conversations it forces. When the product manager discovers that the data scientist rates "model capability" at 2 while the PM assumed it was a 4, that gap in understanding is the most important finding of the entire assessment. Surfacing these gaps before engineering begins prevents the far more expensive discovery during a failed sprint review. This principle echoes the cross-functional collaboration patterns discussed in Chapter 33 on AI strategy.

7. Deliverable: The Feasibility Scorecard

The Feasibility Scorecard brings together every dimension discussed in this section into a single, structured artifact. Like the AI Role Canvas from Section 36.2, it is designed to be filled out before any implementation begins. The scorecard produces a composite feasibility score and, more importantly, surfaces any dimension that falls below the viability threshold.

The scorecard is a structured document (a spreadsheet, a Notion table, or a YAML file in your repository). Each dimension gets a 1-to-5 score, a rationale, and, if the score falls below 3, a mandatory mitigation plan. The decision rules are straightforward:

Real-World Scenario: Contract Clause Extraction Scorecard

Who: A cross-functional team (product manager, ML engineer, legal counsel) at a legal-tech startup evaluating a clause extraction feature.

Situation: The team used the Feasibility Scorecard to assess extracting key clauses (termination, liability, IP assignment) from commercial contracts. Error tolerance: 1%. EU AI Act tier: Limited.

Problem: The scorecard revealed two blockers. Data availability scored 2/5 (200 contracts on hand, unlabeled and covered by client NDAs). Regulatory scored 2/5 (client data includes PII and confidential business terms requiring SOC 2 compliance). The remaining dimensions were healthy: model capability 3/5, latency 5/5, cost ceiling 4/5.

DimensionScoreRationaleMitigation
Model Capability3/5GPT-4 class models handle standard clauses well but struggle with unusual structures and jurisdiction-specific language.
Data Availability2/5 [BLOCKER]200 contracts on hand, but unlabeled and covered by client NDAs.Negotiate data-use agreement with 3 pilot clients. Budget 4 weeks for annotation.
Latency Budget5/5Batch processing; users expect results within 30s.
Cost Ceiling4/5Enterprise pricing at $500/mo supports ~$0.20 per analysis.
Regulatory2/5 [BLOCKER]Client data includes PII and confidential business terms. SOC 2 compliance required.Evaluate SOC 2 hosting providers. Prepare data processing agreements. ~6 weeks.

Decision: With a composite score of 3.2/5.0 and two blockers, the scorecard returned a NO_GO verdict. The team chose to run a 6-week spike to resolve data access and compliance before committing to full build.

Result: The spike secured data-use agreements with two pilot clients and identified a SOC 2-compliant hosting provider. A re-score after the spike yielded 4.0/5.0 with zero blockers, upgrading the verdict to GO.

Lesson: A structured scorecard turns a subjective "should we build this?" debate into a traceable decision with explicit blockers and re-score criteria.

8. Integrating the Scorecard into Your Workflow

The Feasibility Scorecard is most effective when it becomes a gate in your product development process, not an optional exercise. Here is a recommended workflow:

  1. After the AI Role Canvas. Once you have defined the model's role using the canvas from Section 36.2, immediately fill out a Feasibility Scorecard for that role. The canvas tells you what the model should do; the scorecard tells you whether it can.
  2. Cross-functional scoring session. Gather the product manager, ML engineer, data engineer, and legal or compliance representative in the same room (or call). Each person scores the dimensions they own. Discuss disagreements explicitly.
  3. Decision gate. GO means proceed to prototyping. CONDITIONAL means run a time-boxed spike (typically one to four weeks) to resolve the identified blocker, then re-score. NO_GO means pivot: change the feature scope, change the model's role, or drop the feature entirely.
  4. Re-score after spikes. When a mitigation spike completes, update the relevant dimension scores and re-run the decision logic. The scorecard's version history becomes a record of how feasibility evolved.
Fun Note

One team we interviewed printed their Feasibility Scorecards on large poster paper and hung them next to the team's sprint board. Within a week, engineers started checking the scorecard before picking up AI-related tickets, asking "did we actually validate this dimension?" The scorecards became the team's immune system against premature feature commitments.

9. Common Feasibility Traps

Even with a structured scorecard, teams fall into predictable traps during feasibility assessment:

  1. The demo delusion. The model performs impressively on five hand-picked examples, so the team scores "model capability" at 5. In reality, demo performance correlates poorly with production performance. Always score model capability based on a systematic benchmark of 50 or more representative examples, including edge cases. This connects directly to the evaluation philosophy from Chapter 29.
  2. The "data exists somewhere" assumption. The team assumes the data they need is available because the organization has lots of data in general. In practice, the specific data they need may be in a different department's system, in an incompatible format, or encumbered by legal restrictions that take months to resolve.
  3. Ignoring cost at scale. A feature that costs $0.05 per request during prototyping seems cheap. At 100,000 daily active users making 10 requests each, that is $50,000 per day. Always project cost at target scale, not prototype scale.
  4. Treating regulation as a launch-day concern. Teams discover regulatory requirements after building the feature, then face a choice between an expensive retrofit and scrapping months of work. Regulatory pre-screening at the feasibility stage prevents this.
Warning: The Sunk Cost Trap

The most dangerous moment in AI product development is when a team has spent three months building a feature, discovers a feasibility blocker, and decides to "push through" rather than pivot. The Feasibility Scorecard exists precisely to move this discovery to week one rather than month three. If you find yourself arguing that "we've invested too much to stop now," that is the sunk cost fallacy talking, and it is the strongest possible signal that you should stop.

Key Takeaways

What Comes Next

With feasibility validated (or blockers identified and mitigated), Section 36.4: Case Studies: Role Assignment in Practice walks through three real-world examples showing how teams applied the AI Role Canvas and the Feasibility Scorecard to make concrete product decisions.

Self-Check
Q1: A healthcare startup wants to build an AI feature that triages patient symptoms into urgency categories. What error tolerance range is appropriate, and what does that imply for the system architecture?
Show Answer
Symptom triage is a safety-critical domain requiring very low error tolerance (<0.1%). This implies a multi-stage architecture: the model performs initial classification, a second model or rule-based system verifies the classification, and a human clinician reviews all cases flagged as high-urgency or low-confidence. A single model call is almost certainly insufficient for this error tolerance. Additionally, the feature likely falls under "high risk" in the EU AI Act, requiring conformity assessments and human oversight by design.
Q2: Name the five dimensions of the Technical Feasibility Matrix and explain why a score below 3 on any single dimension is treated differently from a low composite score.
Show Answer
The five dimensions are: (1) model capability, (2) data availability, (3) latency budget, (4) cost ceiling, and (5) regulatory constraints. A single dimension below 3 is treated as a blocker (rather than relying on the composite) because feasibility dimensions are not interchangeable. A perfect score on latency and cost cannot compensate for data that does not exist or a regulatory prohibition. Each dimension represents an independent prerequisite, so a critical failure in any one of them can make the entire feature infeasible regardless of how well the other dimensions score.
Q3: Why is the Feasibility Scorecard described as a "cross-functional communication tool" rather than simply an assessment template?
Show Answer
No single person has the expertise to accurately score all five dimensions. The product manager understands user needs and business constraints but may overestimate model capability. The ML engineer understands model limitations but may underestimate regulatory complexity. The data engineer knows what data is accessible. The legal or compliance team knows the regulatory landscape. By requiring input from all these roles, the scorecard surfaces gaps in understanding (such as when the PM rates model capability at 4 while the ML engineer rates it at 2) that would otherwise remain hidden until implementation. These gaps are the most valuable output of the assessment process.

Bibliography

AI Product Design

Lovejoy, J. and Holbrook, J. (2024). "People + AI Guidebook." Google PAIR. pair.withgoogle.com/guidebook

Google's practical guide to designing human-AI experiences. Directly relevant to error tolerance design and the principle that AI capability must match user expectations before shipping.
Human-AI Design

Narayanan, D. and Kapoor, S. (2024). "AI Changes the Most Basic Assumption of Software Product Design." AI Snake Oil (Substack). aisnakeoil.com

The article that motivates this section's core thesis: traditional software feasibility assumptions do not apply to AI features. Introduces the concept that a clearly specified feature can still be infeasible due to the probabilistic nature of model outputs.
AI Product Strategy
Safety and Regulation

European Commission (2024). "Regulation (EU) 2024/1689: Artificial Intelligence Act." Official Journal of the European Union. EUR-Lex

The full text of the EU AI Act, including the risk tier classification system discussed in Section 5.1. Essential reference for any team building AI products that serve European users.
Regulation

OWASP Foundation (2025). "OWASP Top 10 for Large Language Model Applications." owasp.org

Catalogs the ten most critical security vulnerabilities in LLM-based systems, including prompt injection, insecure output handling, and excessive agency. A practical security checklist for the regulatory pre-screening step of feasibility assessment.
Security
Evaluation and Testing

Ribeiro, M. T., Wu, T., Guestrin, C., and Singh, S. (2020). "Beyond Accuracy: Behavioral Testing of NLP Models with CheckList." Proceedings of ACL 2020. doi:10.18653/v1/2020.acl-main.442

Introduces systematic behavioral testing for NLP models. Directly relevant to the "model capability" dimension of the feasibility matrix: structured testing with representative examples is essential for accurate capability scoring, as opposed to relying on demos or anecdotal results.
Model Evaluation