Base Datasets
Base Datasets are a high-performance, scalable dataset type in Qrvey, designed to solve slow data loads when a dataset acts as a source for other datasets. Instead of extracting data from Elasticsearch indexes, Base Datasets store their output as files in a data lake (such as S3 or Blob Storage), allowing direct, fast retrieval for joins, unions, and Managed Dataset creation. Base Datasets cannot be used directly for analytics, dashboards, or reports—they are only used as sources for Managed Datasets.
Why Base Datasets?
Base Datasets are pre-processed datasets optimized for:
- Fast, scalable data loading from files (not indexes)
- Efficient joins and unions when creating Managed Datasets
- Handling large data volumes and complex transformations
Key Features:
- Data is stored in a lake as files, enabling direct, high-speed access
- All dataset capabilities (formatting, sync, transformation) are applied before storage
- Improved performance for Managed Dataset creation and syncs
- Support for large-scale data loads (e.g., 1M+ records)
- Robust error handling and data integrity during reloads
How Base Datasets Work
Technical Motivation
Traditionally, Qrvey stored dataset information in Elasticsearch indexes, which made extraction slow when using datasets as sources for other datasets. Base Datasets solve this by storing the output as files in a data lake, leveraging the fastest data processing path in Qrvey.
Performance Enhancements
- Data is split into files and "baskets" for parallel processing and fast loading.
- Managed Datasets created from Base Datasets can join or union multiple sources directly from the lake, eliminating slow extraction steps.
- Future syncs and reloads are faster due to optimized file storage and incremental updates.
Create a Base Dataset
- Go to Data > Datasets.
- Click Create New Dataset > New Base Dataset.
- Select a data source (existing connection or new connection).
- Configure data sync and transformation options as needed.
- Click Save to create the Base Dataset.
- Load the dataset to begin processing data. The output is stored in the data lake as files.
Base Datasets are not available for direct analytics or reporting. They are only used as sources for Managed Datasets.
Save as Base Dataset
You can also use the Save as Base Dataset feature to create a Base Dataset from an existing dataset. This allows you to optimize data storage and performance for downstream Managed Datasets.
Use Base Datasets to Create Managed Datasets
- Managed Datasets can be created by joining or unioning one or more Base Datasets.
- The join process is highly optimized for speed and efficiency, as data is read directly from the lake files.
- Future syncs and reloads of Managed Datasets are faster, as the underlying Base Datasets are optimized for fast loading and incremental updates.
Limitations & Considerations
- Base Datasets are available in Qrvey v9.1 and later.
- In Qrvey v9.2 and later, JLO (Join Lake Optimization) improves file maintenance and reduces duplicate reads for Base Datasets.
- This dataset type cannot be used for analytics, dashboards, or reports.
- Data lake management is required for storage, especially for full reloads and compaction.
- Sync options (append/update) are supported, but maintenance of many files and duplicate reads should be considered.
- Column discovery may differ when consuming data directly from files.
- The Analyze tab is hidden for Base Datasets, but the data can still be reviewed in a table view (only the first records are shown, not the entire dataset).
- The following features are not available for Base Datasets:
- Advanced Tab
- Visualization format
- Geolocations
- Share data with my organization
- Internationalization