Open Playbook Data Organization

The Prompt Codex

A practical, role-specific guide to writing effective prompts for every function in a modern data organization. From data engineers to governance leads. Zero fluff.

7Chapters

9+Roles

50+Examples

20+Templates

Zero-Shot Few-Shot Chain-of-Thought Role Prompting RAG Guardrails

Technical Roles

Build & Ship

Data Engineering, ML Engineering, Data Science, QA. You build pipelines, train models, and write code. These prompts help you leverage LLMs for code generation, debugging, documentation, and testing.

Business & Strategy Roles

Analyze & Decide

Product Manager, BI, Data Analyst, Data Governance. You analyze data, define requirements, and set policy. These prompts help you extract insights, write specs, and enforce standards.

What's Inside

Table of Contents

01

Foundations

Core principles, prompt anatomy, common pitfalls

02

Techniques & Patterns

Zero-shot, few-shot, CoT, role prompting, structured output

03

Role-Based Examples

Bad → good prompts for 9+ data org roles

04

Advanced Strategies

Prompt chaining, RAG, evaluation, guardrails

05

Templates & Quick Ref

Copy-paste templates, decision trees, checklists

06

Resources & Next Steps

Tools, further reading, building a team prompt library

Chapter 01

Prompt Engineering Foundations

The core principles that separate effective prompts from wasted tokens. Master these before learning any technique.

The Four Pillars

Core Principle

Clarity

Be unambiguous. State exactly what you want the model to do. Remove words that could be interpreted multiple ways.

Core Principle

Specificity

Constrain the output format, length, and scope. Tell the model what you want and what you do not want.

Core Principle

Context

Provide the background the model needs. Include tech stack, data volumes, constraints, and domain knowledge.

Core Principle

Structure

Use delimiters, sections, and formatting to organize complex prompts. Numbered steps, XML tags, and markdown all work.

Anatomy of a Prompt

# A well-structured prompt has these components:

[ROLE]        — Who the model should act as
You are a senior data engineer with expertise in Spark and Airflow.

[CONTEXT]     — Background information the model needs
We have a daily ETL pipeline processing 50M rows from S3 to BigQuery.
The pipeline has been failing intermittently for the past week.

[TASK]        — The specific instruction
Analyze the error log below and provide a root cause analysis
with 3 ranked remediation options.

[FORMAT]      — How the output should be structured
Output as:
1. Root Cause (1-2 sentences)
2. Evidence (quote from logs)
3. Remediation Options (ranked by implementation effort)

[CONSTRAINTS] — Boundaries and limitations
- Do not suggest migrating off Spark
- Solutions must be implementable within 1 sprint
- Consider our 4-node r5.xlarge cluster

[EXAMPLES]    — Optional: show the expected pattern
Example output format:
Root Cause: Memory pressure from skewed join keys...

Common Pitfalls

Pitfall	What Happens	Fix
Vague instructions	Model guesses intent, produces generic output	State the exact task, audience, and purpose
No output format	Inconsistent structure across responses	Specify JSON, markdown, table, or bullet format
Missing context	Toy-level answers that ignore your constraints	Include tech stack, data volume, and constraints
Too many tasks	Model prioritizes some, drops others	One prompt, one task. Chain for multi-step work
No examples	Model invents its own format	Add 1-2 input/output examples for complex tasks
Leading language	Model confirms your bias instead of analyzing	Ask open-ended questions, request counterarguments

Prompt Quality Checklist

Clear task instructionDoes the prompt have a single, unambiguous task?
Output format specifiedJSON, markdown, table, numbered list, etc.
Relevant context providedTech stack, data volumes, constraints, domain info
Constraints definedWhat to avoid, boundaries, limitations
Examples included (if complex)1-2 input/output examples for non-trivial formats
Single focused taskOne prompt, one job. Multi-step? Use chaining.

Chatbot vs Agent Prompts

There is a fundamental difference between prompting a chatbot (one-shot Q&A) and configuring an AI agent (persistent, tool-using, acting on your behalf). Most people write prompts for chatbots when they actually need agent instructions.

Chatbot Prompt vs Agent Prompt

Chatbot Prompt

One-shot interaction. You ask a question, get an answer. No memory, no tools, no autonomy.

"Summarize this quarterly report."

Works fine for quick tasks where you review every output before acting on it.

Agent Prompt

Persistent instructions for an autonomous system. It acts on your behalf, uses tools, and makes decisions.

"You are a data quality monitor. Check the daily_transactions table every morning. Flag anomalies >2 sigma. If critical, alert #data-oncall in Slack. Never fabricate data points."

Requires: role, scope, tools, escalation rules, and failure instructions.

Key difference: Agent prompts need guardrails, failure modes, and escalation paths that chatbot prompts don't. Treat an agent like a new team member — define what they can and cannot do.

💡

The Agent Mindset When writing agent prompts, ask yourself: "If this agent encounters something unexpected at 3 AM with no human around, what should it do?" If you can't answer that, your prompt isn't ready for production.

Audience-Aware Prompting

The same task often needs different prompts depending on who will consume the output. A data report for an executive looks nothing like one for an engineer.

Same Task, Different Audiences

For Engineers

Summarize yesterday's pipeline failures. Include: error class, affected table, stack trace snippet, root cause hypothesis, and suggested fix with code.

Format: markdown with code blocks. Be technical. Include query IDs.

For Executives

Summarize yesterday's pipeline failures in 3 bullet points. For each: what business process was affected, estimated data delay, and whether it's resolved. Use plain language — no technical jargon. Lead with impact, not cause.

Why this matters: The underlying data is identical. The prompt controls the lens. Always specify your audience in the prompt — it changes vocabulary, detail level, and what counts as "the answer."

⚠️

Iteration is the norm Prompts are not one-and-done. Expect to iterate. The first version is a draft, not a finished product. Test, evaluate, refine. The best prompt engineers treat prompting as an empirical practice, not a one-shot exercise.

Chapter 02

Techniques & Patterns

The toolkit every prompt engineer needs — from basic zero-shot to advanced chaining strategies.

Zero-Shot Prompting

Zero-Shot

Basic

When to Use

Simple, well-defined tasks

Classification, summarization, formatting when the task is unambiguous

Pattern

Instruction only

No examples needed. Just describe the task clearly with constraints

Strength

Fast & cheap

Minimal tokens, quick to write, easy to iterate

Limitation

Format drift

Output format may vary without examples to anchor it

# Zero-shot example
Classify the following customer support ticket into exactly one category:
Bug Report, Feature Request, Account Issue, or General Question.

Ticket: "I can't export my dashboard to PDF. The button spins but nothing downloads."

Output only the category name, nothing else.

Few-Shot Prompting

Few-Shot

Intermediate

When to Use

Non-obvious output format

When the format or style is specific and hard to describe in words alone

Pattern

Examples + instruction

Provide 2-5 input/output pairs, then the actual input

Strength

Consistent output

Examples anchor the format, tone, and level of detail

Limitation

Token cost

Each example consumes tokens. Balance quality vs. cost

# Few-shot example: SQL column descriptions
Generate a business-friendly description for each database column.

Example 1:
Column: user_id (INTEGER)
Description: Unique identifier for each registered user in the platform

Example 2:
Column: created_at (TIMESTAMP)
Description: Date and time when the record was first created in the system

Now generate for:
Column: mrr_cents (BIGINT)
Description:

Chain-of-Thought (CoT)

Chain-of-Thought

Advanced

When to Use

Reasoning tasks

Math, logic, multi-step analysis, debugging, root cause analysis

Pattern

"Think step by step"

Ask the model to show its reasoning before giving the final answer

Strength

Accuracy

Dramatically improves performance on reasoning-heavy tasks

Limitation

Verbose output

More tokens in the response. May need parsing to extract the answer

# Chain-of-thought example
A data pipeline processes 50M rows daily. After adding a new deduplication step,
throughput dropped from 50M to 35M rows/day. The dedup step uses a hash-based
approach on 3 columns (email, phone, address) with a 10GB lookup table.

Think step by step about what could cause this throughput drop.
For each hypothesis:
1. State the hypothesis
2. Explain the mechanism
3. Suggest a diagnostic check
4. Rate likelihood (High/Medium/Low)

Then provide your top recommendation.

Role / Persona Prompting

Role Prompting

Intermediate

When to Use

Domain expertise needed

When you need responses calibrated to a specific skill level or perspective

Pattern

"You are a [role]..."

Set the persona in the system prompt or at the start of the user message

Strength

Calibrated depth

Responses match the expertise level and vocabulary of the assigned role

Limitation

Can over-index

Model may roleplay too hard. Keep the persona relevant and bounded

# Role prompting example
You are a senior data engineer with 10 years of experience building
large-scale ETL pipelines. You specialize in Spark, Airflow, and dbt.
You value production reliability over cleverness.

Review the following Airflow DAG and identify:
1. Any anti-patterns
2. Missing error handling
3. Scalability concerns

Be direct and specific. Reference line numbers.

Structured Output

# Structured output: JSON schema enforcement
Analyze the following error log and output your findings as JSON
matching this exact schema:

{
  "root_cause": string, 1-2 sentences,
  "severity": "critical" | "high" | "medium" | "low",
  "affected_tables": [list of table names],
  "remediation_steps": [
    {
      "step": number,
      "action": string,
      "effort": "low" | "medium" | "high"
    }
  ],
  "requires_immediate_action": boolean
}

Output ONLY valid JSON. No explanatory text before or after.

Iterative Refinement

Prompt Development Cycle

Draft Prompt → Test → Evaluate Output → Identify Gaps → Refine Prompt → Production Prompt

Technique Comparison

Technique	Best For	Complexity	Token Cost
Zero-Shot	Simple, well-defined tasks	Low	Minimal
Few-Shot	Format-specific output	Medium	Moderate
Chain-of-Thought	Reasoning & analysis	Medium	Moderate-High
Role Prompting	Domain expertise	Low	Minimal
Structured Output	Machine-readable results	Medium	Low
Prompt Chaining	Multi-step workflows	High	High

Practice Exercise

Try All Three

Pick a task from your daily work (e.g., writing SQL, reviewing a PR, drafting a spec)
Write it 3 ways: zero-shot, few-shot, and chain-of-thought
Run all 3 and compare the outputs for accuracy, format consistency, and usefulness
Note which technique worked best and why — this builds your intuition

Chapter 03

Role-Based Examples

Real-world bad → good prompt transformations for every role in a data organization. Each section shows what not to do, why, and how to fix it.

🔧

Data Engineering

Pipeline Builders & Infrastructure Owners

Builds and maintains data pipelines, ETL/ELT processes, data infrastructure. Owns data movement, transformation, and platform reliability.

Key LLM Use Cases: Pipeline debugging, SQL optimization, schema design, documentation generation, migration planning

Role Prompting Structured Output Chain-of-Thought

Example 1 — Pipeline Debugging

✗ Bad Prompt

Fix my Spark job

✓ Good Prompt

I have a PySpark job processing 50M rows from S3 parquet files. It fails with OutOfMemoryError at the groupBy stage. The cluster has 4 r5.xlarge nodes (32GB RAM each). The groupBy key has 10K distinct values. The job ran fine until we added a new column with high cardinality strings (avg 500 chars).

Suggest 3 optimization strategies ranked by implementation effort. For each, explain the mechanism and expected impact.

Why it works: Includes tech stack, data volume, error message, infrastructure specs, and what changed. Requests ranked, structured output.

Example 2 — Airflow DAG Generation

✗ Bad Prompt

Write an Airflow DAG

✓ Good Prompt

Write an Airflow DAG (Airflow 2.7+, TaskFlow API) that:
a) Extracts data from PostgreSQL (connection_id='pg_source')
b) Transforms using a Python function that deduplicates on email column
c) Loads to BigQuery dataset 'analytics.users'

Requirements:
- Retry logic: 3 retries, 5-min delay
- SLA miss callback that sends to Slack
- Schedule: daily at 06:00 UTC
- Use type hints and docstrings
- Include DAG-level tags: ['etl', 'production']

Why it works: Specifies framework version, API style, connection IDs, transformation logic, error handling, schedule, and code quality requirements.

Example 3 — Documentation Generation

✗ Bad Prompt

Document this pipeline

✓ Good Prompt

Generate technical documentation in Markdown for the following dbt model. Include these sections:
1. Purpose (1-2 sentences)
2. Source tables and their owners
3. Transformation logic (step by step)
4. Output schema (table with column name, type, description)
5. Data quality assumptions
6. Downstream dependencies
7. SLA and refresh schedule

Model SQL:
[paste your SQL here]

Why it works: Defines exact documentation sections, output format (markdown), and provides the source material. Produces consistent, comprehensive docs every time.

Pipeline Debugging Template

Technology & Version

e.g., PySpark 3.4, Airflow 2.7, dbt 1.6

Error Message

Exact error text or stack trace snippet

Data Volume & Shape

Row count, column count, key cardinality

Infrastructure

Cluster size, memory, node types

What Changed Recently

New columns, schema changes, volume growth

Desired Output

N ranked solutions with effort estimates

💡

Pro Tip — Data Engineering Always include your tech stack version, data volumes, and infrastructure constraints. Data engineering prompts without scale context produce toy answers that break at production volumes.

📋

Product Manager

Requirements Definers & Stakeholder Translators

Defines product requirements, prioritizes features, writes PRDs, and communicates across engineering, design, and business stakeholders.

Key LLM Use Cases: PRD drafting, user story generation, competitive analysis, feature prioritization, stakeholder communication

Role Prompting Chain-of-Thought Structured Output

Example 1 — PRD Drafting

✗ Bad Prompt

Write a PRD for a search feature

✓ Good Prompt

Write a PRD for adding semantic search to our internal knowledge base. Context:
- 5,000 Confluence pages, 200 DAU
- Current keyword search has 35% query failure rate
- Engineering team: 3 backend, 1 ML engineer

Include these sections:
1. Problem Statement (with metrics)
2. User Personas (3 types)
3. Proposed Solution (PM-level architecture)
4. Success Metrics with targets
5. Risks and Mitigations
6. Phased Rollout Plan (3 phases)

Keep it under 2 pages. Use data-driven language.

Why it works: Includes real metrics (35% failure rate), team constraints, specific sections, and length guidance. Produces a stakeholder-ready document, not a generic template.

Example 2 — User Story Generation

✗ Bad Prompt

Create user stories for our product

✓ Good Prompt

Generate user stories in the format: "As a [role], I want [capability], so that [benefit]" with acceptance criteria for each.

Feature: AI-powered data quality alerts in our warehouse monitoring tool
Roles: data engineer, data analyst, data governance lead

For each story include:
- 3-5 acceptance criteria
- Edge cases
- Non-functional requirements (latency, accuracy)
- Story points estimate (S/M/L)

Why it works: Specifies the story format, names the exact feature and user roles, and requests actionable acceptance criteria with edge cases.

Example 3 — Feature Prioritization

✗ Bad Prompt

Help me prioritize features

✓ Good Prompt

Score these 6 candidate features for Q3 using the RICE framework (Reach, Impact, Confidence, Effort). Output as a ranked table with columns: Feature | Reach | Impact | Confidence | Effort | RICE Score | 1-sentence Rationale.

Context: Team of 5 engineers, 10K monthly active users, goal is to reduce churn by 15%.

Features:
1. Email notification preferences
2. Dashboard sharing via link
3. Scheduled report exports
4. Single sign-on (SSO)
5. In-app onboarding tour
6. API rate limit dashboard

Why it works: Names the framework, provides team and metric context, lists specific features, and defines exact output columns.

PRD Prompt Template

Feature Name

What you're building

Current State & Metrics

What exists today, key numbers, pain points

Team Constraints

Team size, timeline, tech stack

Sections Required

Problem, personas, solution, metrics, risks, rollout

Length & Tone

e.g., "Under 2 pages, data-driven, stakeholder-ready"

💡

Pro Tip — Product Manager Include real metrics and constraints. A PM prompt without numbers produces generic output that requires heavy editing. The more specific your context, the closer the first draft is to shippable.

🤖

ML Engineer

Model Builders & MLOps Operators

Trains, deploys, and monitors ML models. Builds inference pipelines, manages model lifecycle, and optimizes for latency and cost.

Key LLM Use Cases: Model debugging, experiment design, MLOps configuration, performance optimization, model card generation

Chain-of-Thought Role Prompting Structured Output

Example 1 — Model Debugging

✗ Bad Prompt

My model accuracy is low

✓ Good Prompt

I'm training an XGBoost classifier for customer churn prediction.
- Dataset: 500K rows, 45 features, 8% positive class
- Current AUC-ROC: 0.72 after tuning (max_depth=6, n_estimators=500, lr=0.1)
- I suspect class imbalance and possible feature leakage

Provide a systematic debugging approach:
1. Diagnostic checks to run (with Python code)
2. Resampling strategies to try (with tradeoffs)
3. Feature importance analysis steps
4. Validation strategy review

For each suggestion, explain WHY it might help and what to look for in the results.

Why it works: Includes model type, dataset stats, current metrics, hyperparameters, and hypotheses. Requests structured debugging steps with reasoning.

Example 2 — Training Pipeline Design

✗ Bad Prompt

Write a training pipeline

✓ Good Prompt

Design a Kubeflow pipeline for fine-tuning DistilBERT for multi-label text classification (12 labels, 100K training samples).

Pipeline steps:
1. Data preprocessing (tokenization, train/val/test split 80/10/10)
2. Training with W&B experiment tracking
3. Evaluation (F1-macro, per-label precision/recall)
4. Model registry upload to MLflow

Constraints:
- GPU node pool (T4 GPUs)
- Training must complete in <2 hours
- Pipeline must be idempotent

Output: Python pipeline DSL with comments explaining each step.

Why it works: Specifies orchestration tool, model architecture, dataset size, evaluation metrics, compute constraints, and code format requirements.

Model Debugging Template

Model Type & Architecture

e.g., XGBoost classifier, DistilBERT, Random Forest

Dataset Stats

Rows, features, class distribution, data types

Current Metrics

AUC, F1, accuracy, loss curves

Hyperparameters

Current settings, what you've already tuned

Hypotheses

What you think might be wrong

💡

Pro Tip — ML Engineer Include your model architecture, dataset stats, and current metrics. ML prompts need quantitative context. Also state what you've already tried so the model doesn't waste your time with obvious suggestions.

🔬

Data Scientist

Analysts & Experiment Designers

Performs statistical analysis, builds predictive models, designs experiments, and communicates insights to stakeholders.

Key LLM Use Cases: EDA guidance, statistical test selection, visualization code, insight summarization, experiment design

Chain-of-Thought Structured Output Few-Shot

Example 1 — Exploratory Data Analysis

✗ Bad Prompt

Analyze this data

✓ Good Prompt

Generate a Python EDA script (pandas + matplotlib) for a dataset of 200K e-commerce transactions with columns: user_id, timestamp, product_category, amount, is_return, channel.

The script should:
a) Compute summary statistics per product category
b) Plot return rate trends over time (monthly, line chart)
c) Segment users by purchase frequency using RFM analysis
d) Test whether mobile vs desktop has a significantly different average order value (Mann-Whitney U, alpha=0.05)

Output: clean, commented Python code. Use seaborn for styling. Include print statements explaining each finding.

Why it works: Names exact columns, specifies statistical tests with parameters, defines visualization types, and requests commented code with interpretive output.

Example 2 — Statistical Test Selection

✗ Bad Prompt

What statistical test should I use?

✓ Good Prompt

I need to determine if a new recommendation algorithm (variant B) increases click-through rate vs. control (variant A).

Details:
- Binary outcome (click/no-click)
- Sample: 50K users per group
- Baseline CTR: 3.2%
- Minimum detectable effect: 0.5 percentage points
- We need 80% power at alpha=0.05

Questions:
1. Which test is appropriate and why?
2. Are the assumptions met?
3. Calculate required sample size
4. Provide Python code using scipy.stats
5. How should we handle multiple comparisons if we add a third variant?

Why it works: Provides outcome type, sample sizes, baseline metrics, power requirements, and asks specific statistical questions with code output.

Analysis Request Template

Dataset Description

Rows, columns, types, time range

Business Question

What decision will this analysis inform?

Statistical Rigor

Alpha level, power, confidence intervals needed?

Output Format

Python code, markdown report, executive summary

💡

Pro Tip — Data Scientist Specify your statistical rigor requirements upfront. Without them, you get blog-post-level analysis. Always include alpha levels, power requirements, and what assumptions you need validated.

📊

Business Intelligence

Dashboard Builders & KPI Definers

Builds dashboards, writes SQL for reporting, defines KPIs, and translates business questions into data queries.

Key LLM Use Cases: SQL query generation, dashboard design, KPI definition, report narration, data modeling

Structured Output Few-Shot Chain-of-Thought

Example 1 — SQL Query Generation

✗ Bad Prompt

Write a SQL query for revenue

✓ Good Prompt

Write a BigQuery SQL query that calculates monthly recurring revenue (MRR).

Tables:
- subscriptions (id, user_id, plan_id, start_date, end_date, status)
- plans (id, name, monthly_price)

Requirements:
a) Only include active subscriptions (status='active', end_date IS NULL or > CURRENT_DATE)
b) Handle mid-month upgrades/downgrades by prorating
c) Output columns: month, total_mrr, new_mrr, churned_mrr, expansion_mrr
d) Use CTEs for readability
e) Include comments explaining the logic
f) Handle NULLs in end_date gracefully

Why it works: Specifies SQL dialect, table schemas, business logic (proration), output columns, code style (CTEs), and edge cases (NULLs).

Example 2 — Dashboard Design

✗ Bad Prompt

Design a dashboard

✓ Good Prompt

Design a Tableau dashboard layout for an executive weekly business review.

KPIs to include: ARR, Net Revenue Retention, CAC Payback Period, DAU/MAU Ratio, Support Ticket Resolution Time.

For each KPI specify:
1. Visualization type (number, line, bar, gauge)
2. Comparison benchmark (WoW, MoM, vs target)
3. Drill-down dimension
4. Alert threshold (red/yellow/green)

Layout: 3 rows, 2 tiles each. Top row = revenue metrics. Middle = engagement. Bottom = operations. Include filter bar at top (date range, region, product).

Why it works: Names the tool, audience, specific KPIs, visualization requirements, layout structure, and interactive elements. Produces a buildable spec.

SQL Generation Template

SQL Dialect

BigQuery, PostgreSQL, Snowflake, Redshift

Table Schemas

Table names, columns, types, relationships

Business Logic

What to calculate, how to handle edge cases

Output Columns

Exact columns needed in the result

Code Style

CTEs, comments, formatting preferences

💡

Pro Tip — Business Intelligence Always specify your SQL dialect, table schemas, and edge cases (NULLs, duplicates, timezone handling). A BI prompt without schema context produces SQL that looks right but returns wrong numbers.

📈

Data Analyst

Question Answerers & Insight Framers

Answers ad-hoc business questions, performs cohort analysis, creates reports, and supports decision-making with data.

Key LLM Use Cases: Ad-hoc query help, diagnostic analysis, cohort analysis, report writing, insight framing

Chain-of-Thought Structured Output Role Prompting

Example 1 — Diagnostic Analysis

✗ Bad Prompt

Why is revenue down?

✓ Good Prompt

Revenue dropped 12% MoM in February 2026. I have access to tables: orders, users, products, marketing_spend (all in BigQuery).

Generate a diagnostic analysis plan as a numbered list of SQL queries to run. For each query include:
a) The hypothesis being tested
b) The SQL query (BigQuery syntax)
c) How to interpret the result

Cover these dimensions:
1. Channel mix shift
2. Cohort retention change
3. Product mix change
4. Pricing/discount impact
5. Seasonality vs. anomaly

Why it works: Quantifies the problem (12% MoM), names available data sources, specifies output structure per query, and lists the diagnostic dimensions to cover.

Example 2 — Cohort Analysis

✗ Bad Prompt

Do a cohort analysis

✓ Good Prompt

Write a Python script (pandas) for monthly user retention cohort analysis.

Input: CSV with columns user_id, signup_date, activity_date
Output:
a) Cohort pivot table showing retention % for months 0-12
b) Seaborn heatmap visualization with percentage annotations
c) Summary identifying the cohort with best/worst M1 retention
d) Overall trend line (are newer cohorts retaining better or worse?)

Include data cleaning for: duplicate rows, missing dates, users with activity before signup.

Why it works: Defines input schema, exact output artifacts (pivot + heatmap + summary), specific metrics (M1 retention), and data quality handling.

Diagnostic Analysis Template

The Problem

What metric changed, by how much, over what period

Available Data

Table names, SQL dialect, accessible dimensions

Hypotheses to Test

Channel mix, retention, pricing, seasonality, etc.

Output Per Query

Hypothesis, SQL, interpretation guidance

💡

Pro Tip — Data Analyst Quantify the problem before asking for a diagnosis. "Revenue is down" gets generic advice. "Revenue dropped 12% MoM, primarily in the EMEA segment" gets targeted analysis plans.

🧪

QA Engineer

Quality Gatekeepers & Test Designers

Designs test plans, writes test cases, performs regression testing, and validates data quality across the stack.

Key LLM Use Cases: Test case generation, test data creation, bug report analysis, regression planning, data validation rules

Structured Output Few-Shot Guardrails

Example 1 — API Test Cases

✗ Bad Prompt

Write test cases for the API

✓ Good Prompt

Generate test cases for REST endpoint POST /api/v2/users.

Input schema:
- name: string (required, 1-100 chars)
- email: string (required, valid email format)
- role: enum ['admin', 'viewer', 'editor']

Cover:
a) Happy path for each role
b) Validation errors (missing fields, invalid email, name too long, invalid role)
c) Duplicate email handling (expect 409 Conflict)
d) Rate limiting (100 req/min, expect 429)
e) Auth: requires Bearer token with 'user:create' scope

Format as table: Test ID | Category | Input | Expected Status | Expected Response Body

Why it works: Provides the exact API contract, field constraints, error scenarios, auth requirements, and output table format. Produces comprehensive, actionable test cases.

Example 2 — Data Quality Validation

✗ Bad Prompt

Help me test data quality

✓ Good Prompt

Write a Great Expectations validation suite for a 'daily_transactions' table.

Columns:
- transaction_id: UUID, must be unique
- amount: decimal, 0.01-999999.99
- currency: ISO 4217, only USD/EUR/GBP
- created_at: timestamp, not future, not older than 90 days
- user_id: foreign key to users table

Also include:
- Row count expectation: between 10K-500K per day
- Cross-column: if currency=GBP then amount < 100000
- Null checks on all columns

Output as Python code using the Great Expectations API.

Why it works: Specifies the validation framework, exact column constraints, cross-column rules, volume expectations, and output format.

Test Plan Generation Template

System Under Test

API endpoint, data table, pipeline, UI component

Input Schema / Contract

Fields, types, constraints, valid values

Test Categories

Happy path, validation, edge cases, security, performance

Expected Behaviors

Status codes, error messages, side effects

Output Format

Table, pytest code, Great Expectations suite, etc.

💡

Pro Tip — QA Engineer Always specify boundary values, error conditions, and security requirements. Happy-path-only test prompts produce happy-path-only tests. The bugs live at the boundaries.

🛡️

Data Governance

Policy Makers & Compliance Guardians

Defines data policies, manages data catalogs, ensures compliance, handles PII classification, and enforces data quality standards.

Key LLM Use Cases: Policy drafting, PII detection rules, data catalog enrichment, compliance documentation, access control reviews

Role Prompting Structured Output Chain-of-Thought

Example 1 — Data Retention Policy

✗ Bad Prompt

Write a data policy

✓ Good Prompt

Draft a data retention policy for our organization (fintech, US-based, SOC 2 Type II certified).

Cover these data categories:
a) Customer PII (name, email, SSN, phone)
b) Transaction records
c) Application logs
d) Marketing analytics data

For each category specify:
1. Retention period with legal justification
2. Storage requirements (encryption, access controls)
3. Deletion procedure (soft delete vs. hard delete, verification)
4. Exceptions (litigation hold, regulatory audit)
5. Responsible data steward role

Format as a numbered policy document. Reference SOC 2, CCPA, and GLBA where applicable.

Why it works: Specifies industry, compliance frameworks, exact data categories, and the structure of each policy section. Produces a near-final policy document.

Example 2 — Data Classification

✗ Bad Prompt

Classify our data

✓ Good Prompt

Classify each column in the following table schema into sensitivity tiers: Public, Internal, Confidential, Restricted.

Classification rules:
- Direct PII (name, email, SSN, phone) = Restricted
- Indirect PII (zip code, birth year, IP address) = Confidential
- Business metrics (revenue, user count) = Internal
- Product metadata (feature names, categories) = Public

Table schema: [paste your schema here]

Output as a table: Column Name | Data Type | Sensitivity Tier | Justification | Required Controls (encryption, masking, access level)

Why it works: Provides explicit classification rules, defines the tiers, and requests justification and required controls for each column. Auditable output.

Policy Document Template

Organization Context

Industry, size, certifications, regulatory environment

Data Categories

Types of data to cover (PII, transactions, logs, etc.)

Compliance Frameworks

GDPR, CCPA, SOC 2, HIPAA, GLBA, etc.

Sections Per Category

Retention, storage, deletion, exceptions, ownership

💡

Pro Tip — Data Governance Always specify your regulatory context (GDPR, CCPA, SOC 2, HIPAA). Generic governance prompts produce generic policies that don't survive an audit. Name the frameworks, and the model will cite them correctly.

Additional Roles

The patterns above apply to every data role. Here are quick examples for three more common positions.

Analytics Engineer

dbt Model Review

✗ Bad Prompt

Review my dbt model

✓ Good Prompt

Review this dbt model for: (1) naming convention violations (we use snake_case, prefix staging models with stg_), (2) missing tests (unique, not_null on PKs), (3) join anti-patterns, (4) opportunities to use incremental materialization. Output findings as a table: Line | Issue | Severity | Suggested Fix.

[paste model SQL]

Platform Engineer

Infrastructure Configuration

✗ Bad Prompt

Set up Kubernetes for ML

✓ Good Prompt

Write a Terraform module for a GKE cluster optimized for ML training workloads. Requirements: 1 CPU node pool (n2-standard-4, 3-10 nodes autoscaling) for orchestration, 1 GPU node pool (T4, 0-4 nodes, scale to zero when idle), preemptible instances for cost, Workload Identity enabled, network policy for namespace isolation. Include comments explaining each decision.

Data Product Manager

Data Product Spec

✗ Bad Prompt

Write a spec for our data product

✓ Good Prompt

Write a data product spec for a "Customer Health Score" API consumed by the sales team. Include: data sources (CRM events, support tickets, usage logs), scoring methodology (weighted composite, 0-100), API contract (REST, JSON, <200ms p99), refresh frequency (hourly), data quality SLAs (99.5% completeness, <5min staleness), and a rollout plan (alpha with 10 reps → beta → GA).

Chapter 04

Advanced Strategies

Multi-step workflows, prompt chaining, RAG patterns, and evaluation frameworks for production-grade prompt engineering.

Multi-Step Workflows

Complex tasks should be broken into atomic prompts. Each step has one clear objective. The output of step N becomes the input to step N+1.

Example: Data Quality Pipeline

Step 1: Profile → Step 2: Detect Anomalies → Step 3: Root Cause → Step 4: Report

# Step 1: Data Profiling Prompt
Profile the following dataset. For each column, output:
- Data type, null count, unique count
- Min/max/mean (numeric) or top 5 values (categorical)
- Distribution shape (normal, skewed, bimodal, uniform)
Output as a JSON array.

# Step 2: Anomaly Detection Prompt (uses Step 1 output)
Given this data profile: [paste Step 1 output]
Identify columns with anomalies: unexpected nulls, outliers beyond 3 sigma,
cardinality changes >20% from expected. Output as a ranked list.

# Step 3: Root Cause Prompt (uses Step 2 output)
These anomalies were detected: [paste Step 2 output]
For each anomaly, hypothesize the root cause. Consider:
upstream schema changes, data source outages, ETL bugs, seasonality.
Rank by likelihood.

# Step 4: Report Prompt (uses Steps 1-3)
Generate an executive summary of data quality findings.
Profile: [Step 1]  Anomalies: [Step 2]  Root causes: [Step 3]
Format: 1-page markdown with severity indicators.

💡

Key Principle Break complex tasks into atomic prompts. Each step should have one clear objective and a well-defined output schema that feeds the next step. This dramatically improves reliability and makes debugging easier.

RAG-Aware Prompting

When your prompt includes retrieved context (documents, wiki pages, previous outputs), you need special instructions to keep the model grounded.

# RAG prompt pattern
Context: The following documents were retrieved from our internal knowledge base.

[DOCUMENT 1]
{retrieved_document_1}

[DOCUMENT 2]
{retrieved_document_2}

Instructions:
Using ONLY the information in the documents above, answer the following question.
If the answer is not found in the provided documents, respond with:
"Not found in provided context."

Do NOT use your general knowledge. Cite the document number for each claim.

Question: {user_question}

Format:
Answer: [your answer]
Sources: [Document N, paragraph M]

⚠️

RAG Grounding Always instruct the model to cite sources and stay within provided context. Without this, the model will confidently blend retrieved facts with its own training data, which is the #1 failure mode in RAG systems.

Prompt Evaluation Framework

Dimension	What to Measure	How to Test
Accuracy	Is the output factually correct?	Compare against known-good answers on a test set
Completeness	Does it cover all required elements?	Checklist scoring: did it include all sections?
Format	Does output match the specified structure?	Schema validation (JSON), regex matching
Relevance	Is it focused on the question asked?	Evaluate ratio of useful vs. filler content
Consistency	Same prompt → same quality across runs?	Run 5x and compare variance in output quality
Safety	No PII leakage, no harmful content?	Adversarial input testing, PII scanning

Guardrails & Safety

# System prompt: Output guardrails
SAFETY INSTRUCTIONS:
1. Never include real PII in outputs. If the input contains PII, replace with
   placeholders: [NAME], [EMAIL], [PHONE], [SSN].
2. If the query asks for information outside your knowledge, say
   "I don't have enough information to answer this accurately."
3. Never generate SQL with DROP, DELETE, or TRUNCATE unless explicitly
   requested and confirmed.
4. All generated code must include input validation.
5. Flag any data that appears to contain test/dummy values mixed with
   production data.

VALIDATION:
Before outputting, verify:
- No email addresses, phone numbers, or IDs from the input appear in your output
- Any SQL is read-only unless write operations were explicitly requested
- Numerical claims are supported by the provided data

🚨

Production Rule Never deploy a prompt to production without guardrails. Test adversarial inputs. What happens when someone passes malicious SQL through a text field? What happens when the model encounters PII it should redact?

The Four Agent Safety Principles

When your prompt powers an autonomous agent (not just a one-shot chatbot), these four principles are non-negotiable.

Principle 1

Minimal Access

Grant only the permissions the agent needs. An email drafting agent should not have access to your calendar, contacts, and file system. Scope tools and data access to the minimum required for the task.

Principle 2

Plan for Failure

Define explicit fallback behavior. Every agent prompt must answer: "What do you do when you don't know?" The answer should never be "guess." Include: "If you cannot find the information, respond with 'I don't have enough information' and flag for human review."

Principle 3

Human Oversight

Keep approval gates on consequential actions. An agent can draft an email, but a human should send it. An agent can recommend a schema change, but a human should execute it. Define the line explicitly in your prompt.

Principle 4

Honest Limitations

Agents must flag uncertainty, not fabricate confidence. Include constraints like: "Do not fill in gaps with assumptions. If data is missing or ambiguous, say so explicitly and explain what additional information would be needed."

# Agent safety pattern: the "uncertainty clause"
# Add this to EVERY agent system prompt:

CRITICAL CONSTRAINTS:
- Do not fill in gaps with assumptions. If information is missing,
  say "I don't have enough information to answer this" and specify
  what data you would need.
- Do not take irreversible actions without human approval.
  Draft, propose, and recommend — but never execute unless
  explicitly authorized for that specific action class.
- Do not access data or systems beyond your defined scope.
  If a task requires access you don't have, flag it and stop.
- Always flag low-confidence outputs. If your confidence in
  a conclusion is below 80%, say so and explain why.

Pre-Deployment Checklist

Before any prompt goes into production (powering an agent, API, or automated workflow), run through this checklist.

Goal definedCan you state in one sentence what this prompt should accomplish?
Permissions scopedDoes the agent have access only to what it needs? No more.
Failure mode testedWhat happens with missing data, bad input, or ambiguous requests?
Escalation path definedWhen should the agent stop and ask a human? Is that path clear?
Adversarial inputs testedWhat happens with prompt injection, SQL injection, or PII in unexpected fields?
Output validated on 10+ examplesRun the prompt against real-world inputs and grade each output.
Human approval gate in placeFor any action with consequences (sending, deleting, modifying), is there a confirmation step?

💡

Start Small, Ship Something That Works Don't try to automate everything at once. Pick one workflow, one template. Test it thoroughly. Get it working reliably. Then expand. The teams that succeed with AI agents are the ones that start with a single, well-scoped prompt — not the ones that try to build an "AI platform" on day one.

Practice Exercise

Design a Prompt Chain

Pick your team's most common multi-step workflow (e.g., incident analysis, data onboarding, report generation)
Break it into 3-5 atomic steps with clear input/output contracts
Write the prompt for each step including the handoff format
Test the chain end-to-end with real data
Add guardrails at steps where PII or sensitive data might flow through

Chapter 05

Templates & Quick Reference

Copy-paste prompt templates and decision aids for daily use. Bookmark this chapter.

Universal Prompt Template

The Master Template

Role / Persona

You are a [role] with expertise in [domain]. You value [priorities].

Context

Background: [situation]. We have [data/resources]. The constraint is [limitation].

Task

[Specific action verb]: analyze / generate / review / design / debug...

Output Format

Output as [JSON / markdown table / numbered list / code]. Include [sections].

Constraints

Do not [limitation]. Must [requirement]. Keep under [length].

Examples (if needed)

Example input: ... → Example output: ...

Role-Specific Quick Templates

Data Engineering

You are a senior data engineer. I have a [technology] pipeline that [problem].
Tech: [version]. Data: [volume]. Infra: [specs]. Error: [message].
Provide [N] solutions ranked by effort. For each: mechanism, code change, risk.

Product Manager

You are an experienced PM. Write a [PRD/user story/spec] for [feature].
Context: [users, current metrics, team size, timeline].
Include: [sections]. Keep under [length]. Use data-driven language.

ML Engineer

You are a senior ML engineer. My [model type] for [task] shows [metric = value].
Dataset: [rows, features, class balance]. Hyperparams: [current settings].
I've tried: [previous attempts]. Provide a systematic debugging approach with code.

Data Scientist

You are a statistician. I need to [analysis goal].
Data: [description, columns, rows]. Statistical requirements: alpha=[val], power=[val].
Output: [Python code / methodology / both]. Include assumption checks.

Business Intelligence

SQL dialect: [BigQuery/Snowflake/Postgres]. Tables: [schemas].
Write a query that [business requirement]. Handle: [edge cases].
Output: CTEs, comments, columns: [list]. Test with: [sample data note].

Data Analyst

Problem: [metric] changed by [amount] over [period].
Available tables: [list] in [SQL dialect].
Generate [N] diagnostic queries. Each: hypothesis, SQL, interpretation.
Dimensions to cover: [list].

QA Engineer

System: [API endpoint / data table / pipeline].
Contract: [input schema, constraints, valid values].
Generate test cases covering: [happy path, validation, edge cases, security, performance].
Format: [table / pytest code / Great Expectations suite].

Data Governance

Organization: [industry, size, certifications].
Draft a [policy/classification/audit doc] for [data categories].
Compliance: [GDPR, CCPA, SOC 2, HIPAA]. For each category:
[retention, controls, deletion, exceptions, ownership].

Decision Tree: Which Technique?

Technique Selection Guide

Is the task simple? → Yes → Zero-Shot

Need specific format? → Yes → Few-Shot

Requires reasoning? → Yes → Chain-of-Thought

Needs domain expertise? → Yes → Role Prompting

Multi-step workflow? → Yes → Prompt Chaining

Uses external docs? → Yes → RAG Pattern

Audience-Aware Prompt Template

Use this when the same underlying task needs different outputs for different stakeholders.

Audience-Aware Template

Task

[What to analyze / summarize / report on]

Audience

[Executive / Engineer / PM / External client] — this controls vocabulary and detail level

What They Care About

[Impact & timeline / Root cause & fix / Scope & priority / Cost & risk]

Tone

[Plain language, no jargon / Technical with code / Data-driven with charts / Formal with recommendations]

Length

[3 bullet points / Detailed with sections / 1-page summary / Slide-ready]

Agent Pre-Launch Checklist Template

Before deploying any prompt that powers an autonomous agent or automated workflow.

Agent Pre-Launch Template

Agent Goal (1 sentence)

What does this agent accomplish? e.g., "Monitor data quality and alert on anomalies"

Permissions & Scope

What can it access? What is off-limits? e.g., "Read-only on analytics tables. No access to PII tables."

Failure Behavior

What does it do when uncertain? e.g., "Flag for human review. Never guess or fabricate."

Escalation Path

When and how to involve a human. e.g., "If anomaly severity > high, page #data-oncall in Slack."

Approval Gates

Which actions require human confirmation? e.g., "Can draft alerts. Cannot send without approval."

Test Scenarios Verified

Happy path, missing data, bad input, adversarial input, edge cases

Prompt Quality Checklist

Single, clear taskOne prompt = one job
Output format specifiedJSON, table, markdown, code
Context providedTech stack, data volume, constraints
Constraints definedWhat NOT to do, boundaries, limitations
Examples includedFor complex or non-obvious formats
Role assignedIf domain expertise matters
Tested with edge casesWhat happens with bad input?
Guardrails in placePII handling, safety, error conditions
Evaluated on test setRun 3-5x and check consistency
Version controlledTracked in your team's prompt library

Token Optimization Tips

Technique	Description
Remove filler	Cut "please", "I would like you to", "could you kindly". Direct instructions use fewer tokens.
Use abbreviations	In structured prompts, use short keys: "ctx:" instead of "context:", "fmt:" instead of "format:"
Minimize examples	Start with 1-2 examples. Only add more if output quality is inconsistent.
Reference, don't repeat	Say "use the same format as above" instead of restating the entire format specification.
Compress context	Summarize large documents before including them. Send the summary, not the full text.
Use system prompts	Move static instructions (role, constraints) to the system prompt. They persist across turns.

Chapter 06

Resources & Next Steps

Where to go from here — tools, further reading, and how to build a team prompt library.

Recommended Tools

LLM Platform

Claude (Anthropic)

Strong reasoning, large context window, excellent for code and analysis tasks. System prompts for persistent instructions.

LLM Platform

ChatGPT (OpenAI)

Broad knowledge, code interpreter, image understanding. Custom GPTs for team-specific workflows.

Framework

LangChain / LangGraph

Build multi-step prompt chains with tooling. Great for RAG pipelines and agent workflows.

Framework

LlamaIndex

Purpose-built for RAG applications. Handles document ingestion, indexing, and retrieval-augmented generation.

Evaluation

Promptfoo / Braintrust

Test prompts against eval sets. Track quality metrics across versions. CI/CD for prompts.

IDE Integration

Cursor / GitHub Copilot

Code-focused AI assistants that use prompt engineering principles under the hood. Learn from how they construct prompts.

Resource	What You'll Learn
Anthropic Docs	Official prompt engineering guide. Best practices for Claude, system prompts, and structured output.
OpenAI Cookbook	Production patterns for GPT-based applications. Code examples for common tasks.
DAIR.AI Prompt Guide	Comprehensive academic reference covering all major prompting techniques with research citations.
Chain-of-Thought Paper	The original research (Wei et al., 2022) showing how step-by-step reasoning improves LLM accuracy.
Few-Shot Learners Paper	GPT-3 paper (Brown et al., 2020) establishing in-context learning and few-shot patterns.

Building a Team Prompt Library

The most effective data teams maintain a shared prompt library. Here's how to start one.

Prompt Library Entry Template

Prompt Name

e.g., "Pipeline Debugging - Spark OOM"

Purpose

What task does this prompt solve?

Template

The prompt text with [placeholders] for variable parts

Example Input / Output

One real-world example showing the prompt in action

Evaluation Criteria

How to judge if the output is good enough

Version & Owner

v1.2 | @jane | Last updated 2026-03-15

💡

Library Best Practices Store prompts in version control (Git). Review prompt changes like code changes. Include eval results with each version. Assign owners per prompt. Retire prompts that haven't been used in 90 days.

Prompt Review Process

Peer reviewHave a colleague run the prompt blind and evaluate the output
Edge case testingTest with unusual inputs, empty data, adversarial queries
A/B comparisonCompare your new prompt against the current version on 10+ examples
Cost analysisMeasure token usage and compare against simpler alternatives
DocumentationUpdate the library entry with new eval results and learnings

Final Exercise

Build Your Team's First 5-Prompt Library

Identify the 5 tasks your team performs most frequently with LLMs
Write a prompt template for each using the principles from this playbook
Test each prompt on 3 real-world examples and document the results
Get peer review from at least one colleague per prompt
Store in version control with the Library Entry Template above
Set a calendar reminder to review and update the library quarterly

🎯

The Takeaway The best prompt engineers aren't the ones who memorize techniques — they're the ones who understand their domain deeply and translate that understanding into clear, structured instructions. Domain expertise + prompt structure = consistently great results.

The Prompt Codex

Build & Ship

Analyze & Decide

What's Inside

Prompt Engineering Foundations

The Four Pillars

Clarity

Specificity

Context

Structure

Anatomy of a Prompt

Common Pitfalls

Prompt Quality Checklist

Chatbot vs Agent Prompts

Audience-Aware Prompting

Techniques & Patterns

Zero-Shot Prompting

Zero-Shot

Few-Shot Prompting

Few-Shot

Chain-of-Thought (CoT)

Chain-of-Thought

Role / Persona Prompting

Role Prompting

Structured Output

Iterative Refinement

Technique Comparison

Try All Three

Role-Based Examples

Pipeline Builders & Infrastructure Owners

Requirements Definers & Stakeholder Translators

Model Builders & MLOps Operators

Analysts & Experiment Designers

Dashboard Builders & KPI Definers

Question Answerers & Insight Framers

Quality Gatekeepers & Test Designers

Policy Makers & Compliance Guardians

Additional Roles

Analytics Engineer

Platform Engineer

Data Product Manager

Advanced Strategies

Multi-Step Workflows

RAG-Aware Prompting

Prompt Evaluation Framework

Guardrails & Safety

The Four Agent Safety Principles

Minimal Access

Plan for Failure

Human Oversight

Honest Limitations

Pre-Deployment Checklist

Design a Prompt Chain

Templates & Quick Reference

Universal Prompt Template

Role-Specific Quick Templates

Data Engineering

Product Manager

ML Engineer

Data Scientist

Business Intelligence

Data Analyst

QA Engineer

Data Governance

Decision Tree: Which Technique?

Audience-Aware Prompt Template

Agent Pre-Launch Checklist Template

Prompt Quality Checklist

Token Optimization Tips

Resources & Next Steps

Recommended Tools

Claude (Anthropic)

ChatGPT (OpenAI)

LangChain / LangGraph

LlamaIndex

Promptfoo / Braintrust

Cursor / GitHub Copilot

Further Reading

Building a Team Prompt Library

Prompt Review Process