# Search pipeline > **Full-text first. Vectors only rerank. If it's not there, we say so.** luplo's search is deliberately unromantic. It is a four-stage pipeline where Postgres `tsquery` does the retrieving, the glossary does the rewriting, and vectors — when present — only re-order the candidates that tsquery already found. If retrieval finds nothing, the answer is **nothing** — not a synthesized guess. ## Query dialect The search surface accepts a tiny web-search-style dialect. The grammar is intentionally small; nothing else parses. | Syntax | Meaning | Example | |---|---|---| | `word` | required term (AND-joined with siblings) | `auth budget` | | `"exact phrase"` | adjacent-word match | `"JWT rotation"` | | `word OR word` | disjunction (the literal word `OR`, uppercase) | `auth OR password` | | `-word` | negated term | `auth -session` | | `-"exact phrase"` | negated phrase | `auth -"session cookie"` | Clauses are AND-joined by default. An `OR` run folds its neighbours into a disjunction group. Glossary expansion is applied only to **required** and **OR-group** terms — phrases and negated tokens pass through literally (expanding a negation would silently re-include the concept the user excluded). ### What is not supported - **Parentheses** for grouping. Nested tsquery-native syntax like `(!A & B) & !C` does not parse. Use [De Morgan's laws](https://en.wikipedia.org/wiki/De_Morgan%27s_laws) to rewrite: `(!A & B) & !C` becomes `B -A -C`; `(A | B) & !(C | D)` becomes `A OR B -C -D`. - **Operators as literals.** `&`, `|`, `!`, `(`, `)` are passed through as part of the word and almost always return zero hits under the `simple` dictionary. Avoid typing them unless you actually want the character in the lexeme. - **Regex / fuzzy matching.** Out of scope — see the philosophy doc's honesty-over-coverage commitment. ### Worked example The filter "women who are not men and not in their 50s": ``` # Literal tsquery you might type: (!남자 & 여자) & !50대 # luplo dialect equivalent: 여자 -남자 -50대 ``` Both describe the same set. The dialect version parses; the literal tsquery version does not. If the negated-OR case (`!(A & B)`, which cannot be rewritten with plain negation) starts coming up, open an issue — parens support is a declared v0.7 candidate, not a forever "no". ## The four stages ``` user query │ ▼ ┌───────────────────┐ │ 1. Context router │ project_id auto-filled from .luplo, │ │ optional system filter └────────┬──────────┘ ▼ ┌───────────────────┐ │ 2. Glossary │ normalise → strict alias expansion: │ expansion │ "auth" → (auth | authentication | sign-in) └────────┬──────────┘ ▼ ┌───────────────────┐ │ 3. tsquery │ Postgres full-text search over │ retrieval │ items.ts (GIN index scan) └────────┬──────────┘ ▼ ┌───────────────────┐ │ 4. Vector rerank │ OPTIONAL. pgvector cosine similarity │ (optional) │ reorders the tsquery candidate set. └────────┬──────────┘ ▼ results (with explicit match reasons: which aliases matched which fields) ``` ### 1. Context routing Every search carries a `project_id`. In the CLI it comes from `.luplo`; in MCP it comes from the tool argument. The router can also apply a system filter so that "auth" queries do not drag in items from the "rendering" system. Nothing magical — just scoped WHERE clauses. ### 2. Glossary expansion The glossary is the mechanism that lets `"auth"` find items indexed under `"authentication"` or `"sign-in"` without requiring the caller to know every alias. The pipeline is **strict-first**, with three layers: 1. **Deterministic normalization.** Lowercase, whitespace collapse, Korean morpheme splitting. Zero false positives — this step is purely mechanical. 2. **Strict LLM matching.** Only translation-grade synonyms (`sign-in` ↔ `login`) make it into a glossary group. The prompt is tuned so that **NONE** is a better answer than a wrong grouping. 3. **Human curation queue.** Candidates the LLM is unsure about go into `glossary_terms` with `status='pending'`, visible via `lp glossary pending` and `luplo_page_sync`. The query is rewritten into a tsquery expression that ORs the aliases within each matched group: ```text "auth rate limit" → (auth | authentication | sign-in) & ("rate limit" | throttle | quota) ``` Every alias that fires is recorded with the result, so downstream tooling can show **why** a match was returned. #### `is_protected` — when clustering should stay away The LLM flags terms as `is_protected=true` whenever they look like identifiers that must never be pulled into a synonym cluster: - Upper-case acronyms (`JWT`, `API`, `OTP`, `TLS`, `RPC`) - Programming identifiers (`snake_case`, `camelCase`) - Proper nouns not in the general dictionary — project codenames like `Prometheus`, `Sentinel`, `Gatekeeper` Protected terms participate in exact match and deterministic normalization, but the strict LLM step refuses to cluster them with anything else. This is the guardrail that keeps `OTP` from becoming an alias of `opt` or `Sentinel` from being merged with `guard`. ### 3. tsquery retrieval Retrieval runs against the generated `items.ts` column — a concatenation of title, body, rationale, alternatives, tags, and (when present) a few `context` fields — indexed by GIN. The result limit at this stage is intentionally generous (default: fetch `limit × 4` candidates) so the vector reranker, when enabled, has room to re-order. ### 4. Vector reranking (optional) When the `vector-local` extra is installed and pgvector is available, each candidate's `embedding` is compared to the query embedding via cosine similarity. The top `limit` results after reranking are returned. **Vector search never originates candidates.** If tsquery returns zero rows, vector rerank has nothing to do, and the search returns empty. This is the honesty rule — see {doc}`philosophy` — encoded in code. #### Embedding backends Three drop-in backends exist: | Backend | Dimensions | When to use | |---|---|---| | `null` (default) | — | Don't want the ML dependency. Search is still glossary + tsquery, just no rerank. | | `local` | 1024 | `uv sync --extra vector-local` — runs sentence-transformers locally (~500MB). | | `remote` | 1024 | Call an external embedding service. For deployments where the worker can afford it but clients can't. | Switching backends does not re-embed history automatically — new writes get embeddings under the new backend; older rows keep whatever they had (including NULL). ## Why this shape, not RAG luplo's domain is **engineering decisions**: rationale, alternatives, policy constraints. Semantic proximity is useful, but **traceable retrieval** is the job. When a future maintainer asks "why did we decide X?" the answer must come with a receipt: - Exact terms that matched - Glossary groups that fired - Item ids and supersedes chain A pure vector search fabricates relevance from embedding distance and cannot produce that receipt. luplo keeps vectors in a ranking role so retrieval always has a defensible reason to show you what it showed. ## How to tune it - **Missing matches?** Add aliases via `lp glossary pending` → approve. - **False positives in the glossary?** Reject them — they'll land in `glossary_rejections` and never be suggested again. - **Need closer-to-semantic ranking?** Install `vector-local`. Existing writes will rerank from the next worker pass onward. - **Want to restrict to a system?** `lp items search "jwt" --system auth` (CLI) or `system_ids=['']` (MCP). ## Next - {doc}`../guides/mcp-client` — hooking MCP clients to these tools. - {doc}`../reference/semantic-impact` — how item edits are categorised (related but distinct from search).