Test Sets - Eval Suite

Overview

Test Sets allow you to create and execute bulk collections of prompts to validate the accuracy and reliability of the Jedify Semantic Fusion™ Model. By comparing current model outputs against a known “Golden Standard,” you can ensure Jedify’s responses remain trustworthy. For each test set, you see which semantic entities (concepts and metrics) each question uses. This works for all types of questions, including those from the Semantic Catalog. If an entity has been deleted from the semantic model, it still appears in the list so you can see what the question referenced. Test sets also show whether any question uses a deleted entity and which runs are currently pending or running.

Purpose & Objectives

Continuous Monitoring: Automatically monitor the accuracy of the Semantic Fusion™ Model over time to detect drift or anomalies.
Regression Testing: Safely deploy updates or changes to the semantic model by validating that existing questions still produce the correct SQL and data outputs.
Consistency Checks: Verify that the model produces deterministic results across multiple execution cycles.

Step 1: Creating a New Test Set

To begin validating your data, navigate to the Test Sets page in the main menu under the Semantic Fusion™ Model.
Click the + New Test Set button.
Enter a descriptive Name and Description for the set.
Click Create.

Step 2: Adding Questions to a Test Set

A Test Set is comprised of individual questions. Each question requires a “Golden Standard” — the confirmed correct response against which the model will be tested.

To add a question to the test set, open a previously asked question through your History or ask a new one from the main homepage.
Click on the (⋮) on the bottom left of the response, and select Add To Test Set.
Choose the test set you wish to add this question to and click Done.
Repeat for additional questions to capture a comprehensive test set.

You can add both regular (agent) questions and Semantic Catalog questions. The semantic entities used by each question are shown for all of them.

Which questions can be added

Not every question is a good fit for regression testing. The Add To Test Set option appears on a question when:

It’s an Agent or Semantic Catalog question. Deep Research questions can’t be added — they’re exploratory and don’t produce a single answer to validate against.
It produced a complete answer with data. Questions that errored out, returned no data, or didn’t generate a written answer can’t be used as a Golden Standard.
The result is focused — 100 rows or fewer. Test sets compare results row-by-row, so very large result sets aren’t reliable to validate. If you need to test a broad query, narrow it down first (e.g. add a filter or date range).
It was asked from the Jedify app. Questions sent through external integrations (such as MCP clients) aren’t eligible.
It’s a standalone question, not a chart follow-up. Follow-ups that only adjust a chart (without re-running the underlying data question) can’t be added on their own — add the original data question instead.
It isn’t itself a test run. Questions generated by running a test set can’t be added back into one.

If you don’t see Add To Test Set on a question’s menu, one of the conditions above isn’t met.

Step 3: Running a Test Set

Once your set is populated, you can execute it on demand or on a schedule (coming soon).

Select the Test Set you wish to evaluate.
Click the Run All Tests button on the top right corner.
Configure Iterations: You will be prompted to select the number of execution cycles (1, 2, or 3).
Confirm to begin execution.

While a run is in progress, the test set will show that run as pending or running. You can filter or sort test sets by status, including Not evaluated for sets that have not been run yet.

Step 4: Interpreting Results

After execution, Jedify marks each test as pass, fail, or inconsistent by comparing the actual results of each question with the expected results. For each question, Jedify compares:

The output data
The SQL generated
The entities selected (the semantic entities — concepts and metrics — used for that question)

For any inconsistencies detected, Jedify highlights possible reasons and recommended actions to resolve them within the semantic model.

Semantic entities and deleted entities

Each question displays the semantic entities (concepts and metrics) that were used. This list is kept in sync with your current semantic model.
If an entity has been deleted from the model, it still appears in the list for that question and is marked as deleted. That way you can see what the question was using and decide whether to update or remove the question from the test set.
A test set may show that it uses a deleted entity when one or more of its questions reference an entity that has since been deleted. Use this to prioritize cleanup or updates.

Filtering and sorting test sets

By status: You can filter test sets by the status of their last run (e.g. completed, failed, running). Use Not evaluated to see only test sets that have never been run.
Pending or running runs: For each test set, you can see which runs are currently pending or running. When a run finishes (completed, stopped, or failed), it no longer appears in that list.

Tips on Defining a Comprehensive Test Set

To ensure your test sets effectively catch regressions and drift, aim for variety and depth rather than just volume. A well-constructed test set should cover the full spectrum of your data model’s capabilities.

1. Vary Query Complexity

Don’t rely solely on simple lookups. Your test set should include a mix of difficulty levels to test the model’s logic:

Simple: Single table selects (e.g., “List all active customers”).
Intermediate: Aggregations and groupings (e.g., “Total revenue by region for 2024”).
Complex: Multi-metrics, filtering, and advanced logic (e.g., “Show me the top 3 products by sales for each quarter compared to the previous year”).

2. Target “Golden” KPIs

Ensure that your organization’s most critical metrics are heavily represented. If your team relies on “Gross Margin” or “Daily Active Users,” create multiple test questions that derive these metrics in different ways to ensure the semantic definition remains stable.

3. Test for Edge Cases & Nulls

The model needs to handle data anomalies gracefully. Include questions such as:

Zero Results: Queries that should correctly return no data (e.g., “Sales in Antarctica”).
Ambiguity: Queries using synonyms or business slang to ensure the semantic layer maps them correctly (e.g., using “bookings” vs. “orders”).
Null Handling: Queries on columns that may contain partial data.

4. Test Fixed Time Dimensions

For regression testing and consistent validation, all time-based queries must use fixed, non-moving dates. This ensures the Expected Data Output is stable over time, allowing for a reliable comparison.

Avoid: Dynamic phrases like “last month,” “yesterday,” or “Year to Date.”
Use fixed parameters: Specify exact date ranges in your prompts and reference SQL.
- Prompt example: “Show total sales for the entire month of October 2024.”
Granularity: Test aggregation by day, week, month, and year within these fixed periods to ensure the semantic model is generating the correct aggregation or grouping logic.

Pro Tip: Start small with a “Smoke Test” set containing your top 10 most used queries. Once that is stable, expand into a “Deep Regression” set covering the edge cases above.

Understanding

Editing the Model

Best Practices

Semantic Statements

Use Cases & Examples

Test Sets & Catalog

Test Sets - Eval Suite

Overview

Purpose & Objectives

Step 1: Creating a New Test Set

Step 2: Adding Questions to a Test Set

Which questions can be added

Step 3: Running a Test Set

Step 4: Interpreting Results

Semantic entities and deleted entities

Filtering and sorting test sets

Tips on Defining a Comprehensive Test Set

1. Vary Query Complexity

2. Target “Golden” KPIs

3. Test for Edge Cases & Nulls

4. Test Fixed Time Dimensions

​Overview

​Purpose & Objectives

​Step 1: Creating a New Test Set

​Step 2: Adding Questions to a Test Set

​Which questions can be added

​Step 3: Running a Test Set

​Step 4: Interpreting Results

​Semantic entities and deleted entities

​Filtering and sorting test sets

​Tips on Defining a Comprehensive Test Set

​1. Vary Query Complexity

​2. Target “Golden” KPIs

​3. Test for Edge Cases & Nulls

​4. Test Fixed Time Dimensions

Overview

Purpose & Objectives

Step 1: Creating a New Test Set

Step 2: Adding Questions to a Test Set

Which questions can be added

Step 3: Running a Test Set

Step 4: Interpreting Results

Semantic entities and deleted entities

Filtering and sorting test sets

Tips on Defining a Comprehensive Test Set

1. Vary Query Complexity

2. Target “Golden” KPIs

3. Test for Edge Cases & Nulls

4. Test Fixed Time Dimensions