The main magic takes place in the preparation of the query strategy directly in LLM. The whole prompt can be found here, but I'll "reprint" part of it here:
Your task is to create a query strategy for a Retrieval Augmented Generation system. The strategy should:
# Instructions
- Analyze the user query and decide whether to use semantic search, full_text search or graph search.
- Identify one or more traits (e.g., Movie, Genre, Character, Setting, Theme, Series) to target.
- Rewrite the user query:
- For keyword search, extract ONLY the essential identifying terms (names, entities, specific descriptors) from the user's question. DO NOT include any additional words or reformulate the query. Extract only the raw keywords themselves.
- GOOD EXAMPLES:
- For "Who directed Inception?" → "Inception" (just the movie title)
- For "Tell me about characters in The Godfather" → "The Godfather" (just the film title)
- For "Are there any horror movies set in Alaska?" → "Alaska horror" (just location and genre)
- BAD EXAMPLES:
- "Director of Inception" (contains unnecessary word "Director")
- "Characters in The Godfather" (includes unnecessary words "characters in")
- "Horror movies set in Alaska" (includes unnecessary words "movies set in")
- For semantic search, rewrite the user query to semantically match articles that might be in the knowledge base.
- Provide one, preferably two or more strategies. Make sure each strategy is distinct.
- All SQL searches must return **no more than 20 results**. Use various techniques to select best ones if query matches more, eg. reorder results using semantic search (vector similarity).
- For graph search, use the `MATCH` clause to find related nodes and edges. Use `WITH` clause to create temporary tables for intermediate results.
- For keyword search, use `CONTAINS` clause to find related nodes and edges.
- Use `TOP 20` in the final result to limit the number of results returned.
# Data model
- There are multiple node tables: `Movie`, `Character`, `Setting`, `Theme`, `Series`, and `Genre`. with following columns with names defined exactly the same in all tables:
- `Id`: unique identifier for each node.
- `Name`: name of the node (e.g., movie title, character name, etc.).
- `Content`: text content of the node. This must be included in all outputs.
- `Embedding`: vector representation of the node for semantic search.
- There are multiple edge tables: `IN_GENRE`, `SET_IN`, `INCLUDES_THEME`, `FEATURES_CHARACTER`, `PART_OF_SERIES`.
## SQL schema of used tables
...
# Example queries:
- Similarity search on Movie:
SELECT TOP 10 Id, Content, vector_distance('cosine', Embedding, @q) AS Similarity
FROM [dbo].[Movie]
WHERE Embedding IS NOT NULL AND Content IS NOT NULL
ORDER BY Similarity ASC
- Fulltext search on Movie together with semantic search on results to get top 10 results:
SELECT TOP 10 Id, Content, vector_distance('cosine', Embedding, @q) AS Similarity
FROM [dbo].[Movie]
WHERE CONTAINS (Content, '"Indiana Jones" AND "Temple"')
ORDER BY Similarity ASC
- Semantic search on Character, graph search to find Movie related to that Character ans lastly get top 10 results ordered by amount of shared relations (edges) with the first one:
WITH topcharacters AS (
SELECT TOP 50 Id, vector_distance('cosine', Embedding, @q) AS Similarity
FROM [dbo].[Character]
WHERE Embedding IS NOT NULL AND Content IS NOT NULL
ORDER BY Similarity ASC
),
MovieMatches AS (
SELECT TOP 10 m.Content as Content, COUNT(c.Id) AS MatchedCharacterCount
FROM [dbo].[Movie] m, [dbo].[FEATURES_CHARACTER] fc, [dbo].[Character] c
WHERE MATCH (m-(fc)->c)
AND c.Id IN (SELECT Id FROM topcharacters)
GROUP BY m.Content
)
SELECT TOP 10 Content
FROM MovieMatches
ORDER BY MatchedCharacterCount DESC;
After the model, we want it to figure out how best to formulate SQL queries in order to obtain data relevant to the user's query. In order to be able to process it well, I use Structured Outputs, that is, I prescribe the structure of the answer to the model, including the description of the fields. This can be done the Pydantic way and here is the output structure:
class QueryStrategy(BaseModel):
id: str = Field(
description="Unique random identifier for this strategy"
)
message_to_user: str = Field(
description="Message to be displayed to the user, explaining the purpose of this query. Include type of each query step and its relation to other steps. Wording should be about what you are doing, example: 'Searching Movie using keyword search for Star Wars followed by graph query to get related genres and semantic movie search among results."
)
query: str = Field(
description="Rewrite query to better match knowledge base using semantic search. This field is used to create embedding and store as @q variable for command to use."
)
command: str = Field(
description="Command to be executed in the database. This should be a SQL query that can be executed directly against the database. All text-based queries should be hardcoded, just vector will be referenced as %q."
)
class QueryStrategies(BaseModel):
strategies: List[QueryStrategy] = Field(
description="List of strategies, each containing a series of steps to be executed sequentially. Each strategy is a sequence of steps, where each step depends on the results of previous steps."
)
It works surprisingly well and LLM serves up some pretty good ideas on how to access the data.