Text-to-SQL Pipelines: Avoiding Hallucinations in Database Queries

LLMs are great at writing code, but they are notoriously bad at writing SQL for complex, production-grade schemas. I’ve spent the last six months building internal data-querying tools, and I’ve learned the hard way that if you simply feed a schema to GPT-4 or Claude and ask for a query, you’re going to get hallucinations—column names that don't exist, joins on the wrong keys, and non-existent aggregate functions.

The problem isn't the model's reasoning; it's the lack of context and the ambiguity of natural language. Here is how I’ve been structuring my pipelines to minimize these errors and keep my database safe.

The Semantic Layer Strategy

Instead of letting the LLM look at raw DDL (Data Definition Language) exports, I build a semantic layer. If you expose 200 tables to an LLM, it will get distracted. I create a "context file" that is essentially a condensed version of the schema.

I strip out all the noise—no triggers, no indexes, no constraints. I only include table names, relevant column names, and, most importantly, a description field for every column. If a column is named status_code, the LLM needs to know that 1 means "Active" and 0 means "Archived."

Implementation: The Guarded Query Builder

I prefer using a Pydantic-based approach to enforce structure. I don't let the LLM return raw strings. I force it to return a JSON object that I can validate before it ever touches my database.

from pydantic import BaseModel, Field
from typing import List

# We define the structure we expect back from the LLM
class SQLQuery(BaseModel):
    reasoning: str = Field(description="Step-by-step logic for the query")
    sql: str = Field(description="The executable SQL query")
    columns_used: List[str] = Field(description="List of columns used to verify existence")

def generate_safe_query(user_prompt: str, schema_context: str):
    # Use a system prompt that explicitly forbids joins on non-existent keys
    system_prompt = f"""
    You are a SQL expert. Use the following schema context:
    {schema_context}
    
    If the user asks for data not present in the schema, return an empty query string.
    Always use qualified column names (table.column).
    Only use columns explicitly listed in the context.
    """
    
    # In practice, I use instructor or LangChain's structured output
    # to ensure the LLM returns the SQLQuery model above.
    return client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "system", "content": system_prompt},
                  {"role": "user", "content": user_prompt}],
        response_model=SQLQuery
    )

Operational Trade-offs

The biggest trade-off I’ve faced is latency versus accuracy. Providing a massive schema context increases tokens and cost, but it lowers the hallucination rate.

My current workflow involves a two-step retrieval process:

Schema Retrieval: I use a vector store (like Pinecone or pgvector) to find only the tables relevant to the user’s question. If the user asks about "monthly revenue," I only inject the invoices and payments tables into the prompt.
Validation: Before execution, I run a "dry run" using EXPLAIN in PostgreSQL. If the query fails to parse or references a column that doesn't exist in my actual metadata store, I don't execute it. I feed the error back to the LLM and ask for a correction.

Debugging Tips for Production

If you’re seeing weird behavior, stop looking at the SQL and start looking at the prompt.

The "Case Sensitivity" Trap: I’ve had models fail because they assumed user_id was User_ID. I now normalize all schema names to lowercase before they hit the prompt.
The "Limit" Problem: LLMs love to dump data. I force a LIMIT 100 in the system prompt to prevent a user from accidentally querying 50 million rows and blowing up the memory of my application server.
Human-in-the-loop: For anything involving UPDATE or DELETE statements, I never let the LLM execute directly. I generate the SQL, display it to the user in the UI, and require a manual "Confirm" click before the transaction starts.

Building these pipelines is less about the model's raw intelligence and more about how much guardrail you place around its input and output. Keep the schema tight, validate the JSON, and always use an EXPLAIN dry-run. That’s how you move from a prototype to a reliable production tool.

The Semantic Layer Strategy

Implementation: The Guarded Query Builder

Operational Trade-offs

Debugging Tips for Production

Aditya Shenvi