The Prompt Codex
A practical, role-specific guide to writing effective prompts for every function in a modern data organization. From data engineers to governance leads. Zero fluff.
Build & Ship
Data Engineering, ML Engineering, Data Science, QA. You build pipelines, train models, and write code. These prompts help you leverage LLMs for code generation, debugging, documentation, and testing.
Analyze & Decide
Product Manager, BI, Data Analyst, Data Governance. You analyze data, define requirements, and set policy. These prompts help you extract insights, write specs, and enforce standards.
What's Inside
Table of ContentsPrompt Engineering Foundations
The core principles that separate effective prompts from wasted tokens. Master these before learning any technique.
The Four Pillars
Clarity
Be unambiguous. State exactly what you want the model to do. Remove words that could be interpreted multiple ways.
Specificity
Constrain the output format, length, and scope. Tell the model what you want and what you do not want.
Context
Provide the background the model needs. Include tech stack, data volumes, constraints, and domain knowledge.
Structure
Use delimiters, sections, and formatting to organize complex prompts. Numbered steps, XML tags, and markdown all work.
Anatomy of a Prompt
Common Pitfalls
| Pitfall | What Happens | Fix |
|---|---|---|
| Vague instructions | Model guesses intent, produces generic output | State the exact task, audience, and purpose |
| No output format | Inconsistent structure across responses | Specify JSON, markdown, table, or bullet format |
| Missing context | Toy-level answers that ignore your constraints | Include tech stack, data volume, and constraints |
| Too many tasks | Model prioritizes some, drops others | One prompt, one task. Chain for multi-step work |
| No examples | Model invents its own format | Add 1-2 input/output examples for complex tasks |
| Leading language | Model confirms your bias instead of analyzing | Ask open-ended questions, request counterarguments |
Prompt Quality Checklist
- Clear task instructionDoes the prompt have a single, unambiguous task?
- Output format specifiedJSON, markdown, table, numbered list, etc.
- Relevant context providedTech stack, data volumes, constraints, domain info
- Constraints definedWhat to avoid, boundaries, limitations
- Examples included (if complex)1-2 input/output examples for non-trivial formats
- Single focused taskOne prompt, one job. Multi-step? Use chaining.
Chatbot vs Agent Prompts
There is a fundamental difference between prompting a chatbot (one-shot Q&A) and configuring an AI agent (persistent, tool-using, acting on your behalf). Most people write prompts for chatbots when they actually need agent instructions.
"Summarize this quarterly report."
Works fine for quick tasks where you review every output before acting on it.
"You are a data quality monitor. Check the daily_transactions table every morning. Flag anomalies >2 sigma. If critical, alert #data-oncall in Slack. Never fabricate data points."
Requires: role, scope, tools, escalation rules, and failure instructions.
Audience-Aware Prompting
The same task often needs different prompts depending on who will consume the output. A data report for an executive looks nothing like one for an engineer.
Format: markdown with code blocks. Be technical. Include query IDs.
Techniques & Patterns
The toolkit every prompt engineer needs — from basic zero-shot to advanced chaining strategies.
Zero-Shot Prompting
Zero-Shot
BasicClassification, summarization, formatting when the task is unambiguous
No examples needed. Just describe the task clearly with constraints
Minimal tokens, quick to write, easy to iterate
Output format may vary without examples to anchor it
Few-Shot Prompting
Few-Shot
IntermediateWhen the format or style is specific and hard to describe in words alone
Provide 2-5 input/output pairs, then the actual input
Examples anchor the format, tone, and level of detail
Each example consumes tokens. Balance quality vs. cost
Chain-of-Thought (CoT)
Chain-of-Thought
AdvancedMath, logic, multi-step analysis, debugging, root cause analysis
Ask the model to show its reasoning before giving the final answer
Dramatically improves performance on reasoning-heavy tasks
More tokens in the response. May need parsing to extract the answer
Role / Persona Prompting
Role Prompting
IntermediateWhen you need responses calibrated to a specific skill level or perspective
Set the persona in the system prompt or at the start of the user message
Responses match the expertise level and vocabulary of the assigned role
Model may roleplay too hard. Keep the persona relevant and bounded
Structured Output
Iterative Refinement
Technique Comparison
| Technique | Best For | Complexity | Token Cost |
|---|---|---|---|
| Zero-Shot | Simple, well-defined tasks | Low | Minimal |
| Few-Shot | Format-specific output | Medium | Moderate |
| Chain-of-Thought | Reasoning & analysis | Medium | Moderate-High |
| Role Prompting | Domain expertise | Low | Minimal |
| Structured Output | Machine-readable results | Medium | Low |
| Prompt Chaining | Multi-step workflows | High | High |
Try All Three
- Pick a task from your daily work (e.g., writing SQL, reviewing a PR, drafting a spec)
- Write it 3 ways: zero-shot, few-shot, and chain-of-thought
- Run all 3 and compare the outputs for accuracy, format consistency, and usefulness
- Note which technique worked best and why — this builds your intuition
Role-Based Examples
Real-world bad → good prompt transformations for every role in a data organization. Each section shows what not to do, why, and how to fix it.
Pipeline Builders & Infrastructure Owners
Builds and maintains data pipelines, ETL/ELT processes, data infrastructure. Owns data movement, transformation, and platform reliability.
Suggest 3 optimization strategies ranked by implementation effort. For each, explain the mechanism and expected impact.
a) Extracts data from PostgreSQL (connection_id='pg_source')
b) Transforms using a Python function that deduplicates on email column
c) Loads to BigQuery dataset 'analytics.users'
Requirements:
- Retry logic: 3 retries, 5-min delay
- SLA miss callback that sends to Slack
- Schedule: daily at 06:00 UTC
- Use type hints and docstrings
- Include DAG-level tags: ['etl', 'production']
1. Purpose (1-2 sentences)
2. Source tables and their owners
3. Transformation logic (step by step)
4. Output schema (table with column name, type, description)
5. Data quality assumptions
6. Downstream dependencies
7. SLA and refresh schedule
Model SQL:
[paste your SQL here]
Requirements Definers & Stakeholder Translators
Defines product requirements, prioritizes features, writes PRDs, and communicates across engineering, design, and business stakeholders.
- 5,000 Confluence pages, 200 DAU
- Current keyword search has 35% query failure rate
- Engineering team: 3 backend, 1 ML engineer
Include these sections:
1. Problem Statement (with metrics)
2. User Personas (3 types)
3. Proposed Solution (PM-level architecture)
4. Success Metrics with targets
5. Risks and Mitigations
6. Phased Rollout Plan (3 phases)
Keep it under 2 pages. Use data-driven language.
Feature: AI-powered data quality alerts in our warehouse monitoring tool
Roles: data engineer, data analyst, data governance lead
For each story include:
- 3-5 acceptance criteria
- Edge cases
- Non-functional requirements (latency, accuracy)
- Story points estimate (S/M/L)
Context: Team of 5 engineers, 10K monthly active users, goal is to reduce churn by 15%.
Features:
1. Email notification preferences
2. Dashboard sharing via link
3. Scheduled report exports
4. Single sign-on (SSO)
5. In-app onboarding tour
6. API rate limit dashboard
Model Builders & MLOps Operators
Trains, deploys, and monitors ML models. Builds inference pipelines, manages model lifecycle, and optimizes for latency and cost.
- Dataset: 500K rows, 45 features, 8% positive class
- Current AUC-ROC: 0.72 after tuning (max_depth=6, n_estimators=500, lr=0.1)
- I suspect class imbalance and possible feature leakage
Provide a systematic debugging approach:
1. Diagnostic checks to run (with Python code)
2. Resampling strategies to try (with tradeoffs)
3. Feature importance analysis steps
4. Validation strategy review
For each suggestion, explain WHY it might help and what to look for in the results.
Pipeline steps:
1. Data preprocessing (tokenization, train/val/test split 80/10/10)
2. Training with W&B experiment tracking
3. Evaluation (F1-macro, per-label precision/recall)
4. Model registry upload to MLflow
Constraints:
- GPU node pool (T4 GPUs)
- Training must complete in <2 hours
- Pipeline must be idempotent
Output: Python pipeline DSL with comments explaining each step.
Analysts & Experiment Designers
Performs statistical analysis, builds predictive models, designs experiments, and communicates insights to stakeholders.
The script should:
a) Compute summary statistics per product category
b) Plot return rate trends over time (monthly, line chart)
c) Segment users by purchase frequency using RFM analysis
d) Test whether mobile vs desktop has a significantly different average order value (Mann-Whitney U, alpha=0.05)
Output: clean, commented Python code. Use seaborn for styling. Include print statements explaining each finding.
Details:
- Binary outcome (click/no-click)
- Sample: 50K users per group
- Baseline CTR: 3.2%
- Minimum detectable effect: 0.5 percentage points
- We need 80% power at alpha=0.05
Questions:
1. Which test is appropriate and why?
2. Are the assumptions met?
3. Calculate required sample size
4. Provide Python code using scipy.stats
5. How should we handle multiple comparisons if we add a third variant?
Dashboard Builders & KPI Definers
Builds dashboards, writes SQL for reporting, defines KPIs, and translates business questions into data queries.
Tables:
- subscriptions (id, user_id, plan_id, start_date, end_date, status)
- plans (id, name, monthly_price)
Requirements:
a) Only include active subscriptions (status='active', end_date IS NULL or > CURRENT_DATE)
b) Handle mid-month upgrades/downgrades by prorating
c) Output columns: month, total_mrr, new_mrr, churned_mrr, expansion_mrr
d) Use CTEs for readability
e) Include comments explaining the logic
f) Handle NULLs in end_date gracefully
KPIs to include: ARR, Net Revenue Retention, CAC Payback Period, DAU/MAU Ratio, Support Ticket Resolution Time.
For each KPI specify:
1. Visualization type (number, line, bar, gauge)
2. Comparison benchmark (WoW, MoM, vs target)
3. Drill-down dimension
4. Alert threshold (red/yellow/green)
Layout: 3 rows, 2 tiles each. Top row = revenue metrics. Middle = engagement. Bottom = operations. Include filter bar at top (date range, region, product).
Question Answerers & Insight Framers
Answers ad-hoc business questions, performs cohort analysis, creates reports, and supports decision-making with data.
Generate a diagnostic analysis plan as a numbered list of SQL queries to run. For each query include:
a) The hypothesis being tested
b) The SQL query (BigQuery syntax)
c) How to interpret the result
Cover these dimensions:
1. Channel mix shift
2. Cohort retention change
3. Product mix change
4. Pricing/discount impact
5. Seasonality vs. anomaly
Input: CSV with columns user_id, signup_date, activity_date
Output:
a) Cohort pivot table showing retention % for months 0-12
b) Seaborn heatmap visualization with percentage annotations
c) Summary identifying the cohort with best/worst M1 retention
d) Overall trend line (are newer cohorts retaining better or worse?)
Include data cleaning for: duplicate rows, missing dates, users with activity before signup.
Quality Gatekeepers & Test Designers
Designs test plans, writes test cases, performs regression testing, and validates data quality across the stack.
Input schema:
- name: string (required, 1-100 chars)
- email: string (required, valid email format)
- role: enum ['admin', 'viewer', 'editor']
Cover:
a) Happy path for each role
b) Validation errors (missing fields, invalid email, name too long, invalid role)
c) Duplicate email handling (expect 409 Conflict)
d) Rate limiting (100 req/min, expect 429)
e) Auth: requires Bearer token with 'user:create' scope
Format as table: Test ID | Category | Input | Expected Status | Expected Response Body
Columns:
- transaction_id: UUID, must be unique
- amount: decimal, 0.01-999999.99
- currency: ISO 4217, only USD/EUR/GBP
- created_at: timestamp, not future, not older than 90 days
- user_id: foreign key to users table
Also include:
- Row count expectation: between 10K-500K per day
- Cross-column: if currency=GBP then amount < 100000
- Null checks on all columns
Output as Python code using the Great Expectations API.
Policy Makers & Compliance Guardians
Defines data policies, manages data catalogs, ensures compliance, handles PII classification, and enforces data quality standards.
Cover these data categories:
a) Customer PII (name, email, SSN, phone)
b) Transaction records
c) Application logs
d) Marketing analytics data
For each category specify:
1. Retention period with legal justification
2. Storage requirements (encryption, access controls)
3. Deletion procedure (soft delete vs. hard delete, verification)
4. Exceptions (litigation hold, regulatory audit)
5. Responsible data steward role
Format as a numbered policy document. Reference SOC 2, CCPA, and GLBA where applicable.
Classification rules:
- Direct PII (name, email, SSN, phone) = Restricted
- Indirect PII (zip code, birth year, IP address) = Confidential
- Business metrics (revenue, user count) = Internal
- Product metadata (feature names, categories) = Public
Table schema: [paste your schema here]
Output as a table: Column Name | Data Type | Sensitivity Tier | Justification | Required Controls (encryption, masking, access level)
Additional Roles
The patterns above apply to every data role. Here are quick examples for three more common positions.
Analytics Engineer
[paste model SQL]
Platform Engineer
Data Product Manager
Advanced Strategies
Multi-step workflows, prompt chaining, RAG patterns, and evaluation frameworks for production-grade prompt engineering.
Multi-Step Workflows
Complex tasks should be broken into atomic prompts. Each step has one clear objective. The output of step N becomes the input to step N+1.
RAG-Aware Prompting
When your prompt includes retrieved context (documents, wiki pages, previous outputs), you need special instructions to keep the model grounded.
Prompt Evaluation Framework
| Dimension | What to Measure | How to Test |
|---|---|---|
| Accuracy | Is the output factually correct? | Compare against known-good answers on a test set |
| Completeness | Does it cover all required elements? | Checklist scoring: did it include all sections? |
| Format | Does output match the specified structure? | Schema validation (JSON), regex matching |
| Relevance | Is it focused on the question asked? | Evaluate ratio of useful vs. filler content |
| Consistency | Same prompt → same quality across runs? | Run 5x and compare variance in output quality |
| Safety | No PII leakage, no harmful content? | Adversarial input testing, PII scanning |
Guardrails & Safety
The Four Agent Safety Principles
When your prompt powers an autonomous agent (not just a one-shot chatbot), these four principles are non-negotiable.
Minimal Access
Grant only the permissions the agent needs. An email drafting agent should not have access to your calendar, contacts, and file system. Scope tools and data access to the minimum required for the task.
Plan for Failure
Define explicit fallback behavior. Every agent prompt must answer: "What do you do when you don't know?" The answer should never be "guess." Include: "If you cannot find the information, respond with 'I don't have enough information' and flag for human review."
Human Oversight
Keep approval gates on consequential actions. An agent can draft an email, but a human should send it. An agent can recommend a schema change, but a human should execute it. Define the line explicitly in your prompt.
Honest Limitations
Agents must flag uncertainty, not fabricate confidence. Include constraints like: "Do not fill in gaps with assumptions. If data is missing or ambiguous, say so explicitly and explain what additional information would be needed."
Pre-Deployment Checklist
Before any prompt goes into production (powering an agent, API, or automated workflow), run through this checklist.
- Goal definedCan you state in one sentence what this prompt should accomplish?
- Permissions scopedDoes the agent have access only to what it needs? No more.
- Failure mode testedWhat happens with missing data, bad input, or ambiguous requests?
- Escalation path definedWhen should the agent stop and ask a human? Is that path clear?
- Adversarial inputs testedWhat happens with prompt injection, SQL injection, or PII in unexpected fields?
- Output validated on 10+ examplesRun the prompt against real-world inputs and grade each output.
- Human approval gate in placeFor any action with consequences (sending, deleting, modifying), is there a confirmation step?
Design a Prompt Chain
- Pick your team's most common multi-step workflow (e.g., incident analysis, data onboarding, report generation)
- Break it into 3-5 atomic steps with clear input/output contracts
- Write the prompt for each step including the handoff format
- Test the chain end-to-end with real data
- Add guardrails at steps where PII or sensitive data might flow through
Templates & Quick Reference
Copy-paste prompt templates and decision aids for daily use. Bookmark this chapter.
Universal Prompt Template
Role-Specific Quick Templates
Data Engineering
Product Manager
ML Engineer
Data Scientist
Business Intelligence
Data Analyst
QA Engineer
Data Governance
Decision Tree: Which Technique?
Audience-Aware Prompt Template
Use this when the same underlying task needs different outputs for different stakeholders.
Agent Pre-Launch Checklist Template
Before deploying any prompt that powers an autonomous agent or automated workflow.
Prompt Quality Checklist
- Single, clear taskOne prompt = one job
- Output format specifiedJSON, table, markdown, code
- Context providedTech stack, data volume, constraints
- Constraints definedWhat NOT to do, boundaries, limitations
- Examples includedFor complex or non-obvious formats
- Role assignedIf domain expertise matters
- Tested with edge casesWhat happens with bad input?
- Guardrails in placePII handling, safety, error conditions
- Evaluated on test setRun 3-5x and check consistency
- Version controlledTracked in your team's prompt library
Token Optimization Tips
| Technique | Description |
|---|---|
| Remove filler | Cut "please", "I would like you to", "could you kindly". Direct instructions use fewer tokens. |
| Use abbreviations | In structured prompts, use short keys: "ctx:" instead of "context:", "fmt:" instead of "format:" |
| Minimize examples | Start with 1-2 examples. Only add more if output quality is inconsistent. |
| Reference, don't repeat | Say "use the same format as above" instead of restating the entire format specification. |
| Compress context | Summarize large documents before including them. Send the summary, not the full text. |
| Use system prompts | Move static instructions (role, constraints) to the system prompt. They persist across turns. |
Resources & Next Steps
Where to go from here — tools, further reading, and how to build a team prompt library.
Recommended Tools
Claude (Anthropic)
Strong reasoning, large context window, excellent for code and analysis tasks. System prompts for persistent instructions.
ChatGPT (OpenAI)
Broad knowledge, code interpreter, image understanding. Custom GPTs for team-specific workflows.
LangChain / LangGraph
Build multi-step prompt chains with tooling. Great for RAG pipelines and agent workflows.
LlamaIndex
Purpose-built for RAG applications. Handles document ingestion, indexing, and retrieval-augmented generation.
Promptfoo / Braintrust
Test prompts against eval sets. Track quality metrics across versions. CI/CD for prompts.
Cursor / GitHub Copilot
Code-focused AI assistants that use prompt engineering principles under the hood. Learn from how they construct prompts.
Further Reading
| Resource | What You'll Learn |
|---|---|
| Anthropic Docs | Official prompt engineering guide. Best practices for Claude, system prompts, and structured output. |
| OpenAI Cookbook | Production patterns for GPT-based applications. Code examples for common tasks. |
| DAIR.AI Prompt Guide | Comprehensive academic reference covering all major prompting techniques with research citations. |
| Chain-of-Thought Paper | The original research (Wei et al., 2022) showing how step-by-step reasoning improves LLM accuracy. |
| Few-Shot Learners Paper | GPT-3 paper (Brown et al., 2020) establishing in-context learning and few-shot patterns. |
Building a Team Prompt Library
The most effective data teams maintain a shared prompt library. Here's how to start one.
Prompt Review Process
- Peer reviewHave a colleague run the prompt blind and evaluate the output
- Edge case testingTest with unusual inputs, empty data, adversarial queries
- A/B comparisonCompare your new prompt against the current version on 10+ examples
- Cost analysisMeasure token usage and compare against simpler alternatives
- DocumentationUpdate the library entry with new eval results and learnings
Build Your Team's First 5-Prompt Library
- Identify the 5 tasks your team performs most frequently with LLMs
- Write a prompt template for each using the principles from this playbook
- Test each prompt on 3 real-world examples and document the results
- Get peer review from at least one colleague per prompt
- Store in version control with the Library Entry Template above
- Set a calendar reminder to review and update the library quarterly