Dataset Design
The Design tab enables you to transform and customize your dataset so it is ready for analysis.
Dataset Options
- Name and dataset status — The dataset name, status, last load time, number of columns, and number of records are displayed in the upper left corner.
- To change the name of the dataset, click its name and type a new one.
- To mark a dataset as a favorite, click on the star icon next to its name.
- Load/Reload Dataset
- Click Load Dataset to begin using a new dataset.
- Click Reload Dataset after making changes and applying changes to an existing dataset.
- Inspect Data Source — Click the 'refresh' icon to the left of the Data Source Pill to check for any changes made to the dataset.
- Discard Changes and Apply Changes — Changes on this page are not saved until applied.
- Data Source Pill — Click the three-dot menu to display the following options:
- Hover over the pill to display a tooltip with the Source Name, Connection Name, and Connection Type.
- Edit Connection Settings
- Edit Query
- Join to Another Data Source
- Union with Another Data Source
- Switch to Another Data Source
- Rename
- Dataset Tabs
- Columns — View each column in the dataset and modify its settings as needed. For details, see Managed Datasets.
- View Source Data — Provides a read-only preview of your data as a simple table.
- Data Sync — Set a schedule to automatically synchronize the Dataset with the source data.
- Advanced Settings — See Advanced Settings below.
Transformations
Use Transformations to manipulate data within existing columns or create new columns so the dataset is analysis-ready.
Create a Transformation
- Click the Transformations button in the Design Page. A modal will appear.
Predefined transformation functions are presented in a list on the right. - Drag a transformation type onto the workspace.
When working with a single Data Source, you will have a single workspace. - Configure as desired.
- Once a transformation is complete, click outside the transformation panel. It will save automatically.
Please note:
- A a data load/reload is required for any transformation to take effect.
- If transformations are added or modified, but the data is not reloaded, it will only apply to future records or loads.
For example, this situation may arise, when a data sync process is initiated after the transformations have been added. - Click and drag to reorganize them. They run in sequence, and a transformation may depend on results of previous ones.
- Within the Design > Columns tab, any resulting columns(s) can be identified with the Fx icon to the right of the column.
Configuration
The following configurations are standard across most transformations.
- Enable/Disable — Toggle the checkbox to the left of the Transformation name to enable or disable it.
- Transformation Name — Defaults to "Untitled". Click and type to edit. We recommend using the same as your resulting output column.
- Special Configuration Options — Configurations are unique for each transformation type, shown below.
- Output Choices — When creating a transformation, you will be able to decide if you want the result to update existing columns or create new ones by selecting either the Update Existing or Create New option.
- Output Column Name — If the Create New option is selected, you need to give your new column a name.
Note: Column names must be unique across the entire dataset.
- In Case of Record Error — This feature allows you to decide how possible errors should be handled. Only available for transformations that don’t cause errors, such as text type transformations. The three options are:
- Reject Record — The whole record will not be skipped if an error happens.
- Ignore Error — The record remains as-is.
- Replace Value — The value will be replaced with the entered value in case of an error.
- Test Transformation — This links to the Testing Tab, with this transformation already selected.
- Action Menu — Use this menu to access the Delete command for the transformation.
Transformation Types
Numeric
Perform mathematical operations on numeric data columns, tokens, or manually entered numbers. For example, multiply the price
column by 0.08
to create a salesTax
column. Use tokens by typing {{token}}
; numeric tokens are required.
- Addition — Adds values from multiple columns, tokens, or constants.
- Division — Divides values from multiple columns, tokens, or constants.
- Multiplication — Multiplies values from multiple columns, tokens, or constants.
- Round — Rounds values from multiple columns, tokens, or constants to the nearest integer or specified number of decimal places. For multiple columns, updates are applied in-place, and new column creation is unavailable.
- Subtraction — Subtracts values from multiple columns, tokens, or constants. Configure error handling and choose to update an existing column or create a new one.
Text
Perform operations on text data columns for standardization or analysis.
- Concatenate — Combines multiple text columns into a single value, separated by a specified delimiter. Adjust column order and manage inclusion using + and - icons.
- Lowercase — Converts text to lowercase. For multiple columns, updates are applied in-place.
- Text Analysis — Analyzes text columns for language, sentiment, or key phrases. Results are stored in a new object column.
- Text Replace — Replaces specified text with a new value. Optionally case-sensitive. Choose to update an existing column or create a new one.
- Trim — Removes specified characters from the start or end of text. For multiple columns, updates are applied in-place, and new column creation is unavailable.
- Uppercase — Converts text to uppercase. For multiple columns, updates are applied in-place, and new column creation is unavailable.
Date
Perform operations on date columns using tokens, numbers, or other date columns.
- Date Add — Adds or subtracts specified units (e.g., days) to/from a date column.
- Date Difference — Calculates the interval between two date columns in selected units.
Logic
Array Filter — Removes a value from a JSON array that does not match the specified condition.
- Only works with arrays of objects; arrays of values are not supported.
- Use the Array Path dropdown menu to select a specific array property to filter.
- Select Standard for simple "equals" comparisons and Advanced for filtering with custom Javascript.
Array Flatten — Inflates child objects in hierarchical JSON data to merge them into parent records.
- Useful for transforming object arrays into individual parent-child records.
- Mimics a database "join" operation, changing the structure of the dataset.
- Provides options to inflate (merge) or remove arrays.
- Example: if applied to
"cities": [ {"id": "1", "cityName": "New York"}, {...} ]
, thencity.id
andcity.cityName
columns will be added on the dataset.
Array Select Single Item — Extracts a single object or value from an array based on sorting logic.
- For object arrays, select sorting column(s) defines one or more
column.value
items to sort on by min/max criteria. - For value arrays, sorting is applied directly to the array’s values by min/max criteria.
- The extracted result can replace the source array or be stored in a new property.
Note: By default, this transformation replaces the existing source array with the single object/value selected.
- For object arrays, select sorting column(s) defines one or more
Formula — Enables ad-hoc data manipulation using JavaScript.
- Provides an autocomplete feature and a list of supported functions to assist users.
- Allows transformations of object columns and their individual properties.
- Supports creating new columns or updating existing ones with the transformed results.
- Handles complex use cases, such as updating nested properties or transforming object arrays.
- Requires careful attention to syntax and data type handling to ensure successful transformations.
If/Else — Implements branching logic for conditional transformations.
- Executes different transformation streams depending on whether a condition is met.
- Child transformations inherit the enabled/disabled state of their parent condition.
- Ideal for handling conditional data operations, such as applying different transformations based on data values.
Record Copy — Duplicates records for use in separate transformation streams.
- Works alongside other transformations, such as Dataset Write, for advanced data handling.
- Allows conditional copying based on a boolean column.
- The copied records remain separate and do not affect the original transformation stream.
- Cannot be tested independently; its effect is observed when combined with downstream transformations.
External
- Dataset Write — Writes transformed data into an existing dataset during the loading process, allowing you to load or modify/update another dataset any time you load the dataset that you are working in.
- Use Target Dataset to select an available dataset for data transfer.
- Under Record Replacement, choose to replace existing records (effectively a reload) or merge new values into existing records (modify/update).
- Allows mapping of columns from the source data, token value, or constants into a target dataset.
- Often combined with the Record Copy transformation to send transformed records to an external dataset.
- Cannot be tested independently and leaves no trace in the original dataset.
Test a Transformation
Transformations can be tested individually or collectively before applying them. Always test transformations to ensure they work as expected.
- Navigate to the Testing tab. There are two options:
- Click Test Transformation in the transformation panel, which pre-selects the transformation.
- Open the Testing tab in the Transformations dialog and manually select a transformation.
- Configure your test.
- Input: The system suggests test data as input, but you may adjust as needed.
- Run Test: Click Test to view results or errors in the Output panel.
- Console Output: View additional details like execution time and error specifics in the Console Output panel.
- Click Test and then check the example output.
Multiple Data Sources
When working with joined data sources, transformations occur in distinct workspaces:
Data Source Transformations
- Applied before the join to cleanse, augment, or improve individual data sources.
- Enhances performance by running on smaller datasets.
Join Output Transformations
- Applied after the join, allowing access to columns from all data sources involved.
- Update existing columns or create new ones to add to the dataset.
- Columns are grouped by their original data sources for easy identification.
By separating these workspaces, transformations are more efficient and easier to manage.
Geolocations
To display address information on map charts, create a Geolocation Group using the relevant address fields in your dataset.
- Click the Transformations button in the Design Page. A modal will appear.
- Map your address data columns to the corresponding address type fields in the dialog.
- Example: Map the
Zip
column to the Postal/Zip Code field.
- Example: Map the
- While any single address field is sufficient to create a Geolocation Group, mapping additional fields (e.g., country) improves accuracy.
- Assign a unique name to your Geolocation Group for easy identification.
- Creating or modifying a Geolocation Group does not require a data reload.
- Mapping unique fields like
country
enhances precision in map visualizations.
Data Sync
The Data Sync feature keeps datasets up-to-date with their data sources. You can run synchronization manually or on a schedule. Cascade synchronization is also supported when datasets are based on other datasets.
Configure a Data Sync
- Navigate to Data > Design > Advanced Settings > Data Sync.
The dataset must have been loaded at least once before setting up a sync schedule. - Click the Data Sync toggle to enable the feature.
- Configure Data Source Settings:
- Off: No synchronization for this data source. Use for static data, like geographical or employee data.
- Full Reload: Reloads all data. Use when the source lacks unique identifiers or a timestamp column, or to reflect deletions in the dataset.
- Append and Update: Updates only new or modified records since the last load.
This method is efficient but requires specific conditions. Requires:- A unique identifier. Open the Design tab, click Columns, and toggle Unique ID for the desired column.
- A timestamp column indicating when records were added/updated.
Only processes records with timestamps between the last load and "now minus one minute." - Does not delete records removed from the source.
Note: If performance degrades over time, consider optimizing the join lake. See Join Lake Optimization.
- Click Sync Now for a manual sync.
- Under Schedule, configure sync scheduling options.
- Click Apply Changes to save the configuration.
- Repeat these steps for all data sources in the dataset.
- Data remains accessible during synchronization as the process runs in the background.
- Refreshed data is only available after the sync finishes.
- Multiple synchronizations for the same dataset cannot run simultaneously.
- You can also use Qrvey's Automated Workflows or APIs to synchronize data.
Schedule a Data Sync
You can schedule syncs using two methods:
- Basic: Set schedules from every minute to every month using drop-down menus.
- CRON: Use AWS CRON expressions for advanced scheduling.
Test the expression to ensure proper syntax (see AWS CRON Documentation).
Scheduling Tips:
- Schedule syncs during low-traffic hours to reduce resource strain.
- Stagger multiple sync schedules to avoid conflicts.
- Ensure syncs are spaced far enough apart to prevent overlapping.
Note: Times are in GMT.
Configure a Cascade Synchronization
Cascade synchronization triggers a sync automatically when underlying datasets are updated. No schedule is required.
- Under Schedule, select When data sources are updated.
- Ensure the dataset contains at least one other dataset as a data source.
Run a Manual Sync
To run a manual synchronization:
- Open the dataset and go to Data Sync.
- Toggle Data Sync to activate it.
- Configure the Sync Type under Data Source Settings.
- Click Sync Now. The sync runs immediately, and scheduled syncs will continue as configured.
Determine Sync Logic
- Does the Dataset Need Scheduled Refreshing?
- If not, consider manual updates or automation triggers.
- If yes, proceed to step 2.
- Determine the Best Sync Type for Each Data Source.
- Does this Data Source Need Refreshing?
- If static, set Sync Type to Off.
- Does the Data Source Have a Unique Identifier and Timestamp?
- If yes: Do Deletions Need to Be Reflected?
- If no, use Append and Update.
- If yes, use Full Reload.
- If missing identifiers or timestamps, use Full Reload.
- If yes: Do Deletions Need to Be Reflected?
- Adjust the Query Start Time (if needed)
- Usually, defaults are appropriate, but adjust for:
- Historical data added before the last load.
- Setting the Query Start Time to the earliest record's timestamp ensures inclusion.
- Choose the Sync Frequency
- Keep syncs as infrequent as possible to reduce resource consumption.
- Schedule during off-hours or low-traffic periods.
- Enable Data Sync
- Use the toggle to activate synchronization.
- (Optional) Run a Manual Sync
- Use Sync Now to initiate immediate synchronization.
- Save Changes
- Click Apply Changes to persist your configuration.
- Monitor Performance
- If performance degrades over time, optimize the join lake (see Join Lake Optimization).
A Note on Sync Logic
Starting with version 7.2, Qrvey uses the previous data load’s timestamp to sync new or updated records. If your use case requires the older logic (using the newest value stored), set the environment variable DATASYNC_START_TIME_MODE
to FromIndex
in the AWS Lambda deployment.
Lambda: deploymentId_dataload_drInit
Environment Variable: DATASYNC_START_TIME_MODE
Value: FromIndex
Advanced Settings
- Share Data — Toggle whether the dataset available to the rest of the organization. If enabled, users can create new datasets from this dataset.
- Pivoting for Analysis — Configure data for pivot and export.
- Click Configure Pivoting. A modal will appear.
- Configure your pivot table as desired:
- Pivot Column — Choose the field whose unique values get transformed into columns.
For example, if you have a dataset with a column called "Month," using it as a pivot column will create separate columns for each month (e.g., January, February, March). - Pivot Value — Define the column who's value will be displayed. The first row value for each unique Pivot Column and Row Key Column combination will be displayed. Note: Aggregation of pivoted data is not supported. If there are multiple rows that have the same Pivot Column and Row Key Column combination, the first row's value will be shown, the others will not be included.
- Row Key Columns — Choose fields whose unique values become the rows of the pivot table. This is what you categorize or group the data by in rows.
For example, if you have a column called "Region" and use it as the row key column, each region (e.g., North, South, East, West) will become a row in the pivot table, with 4 rows in total. - Pivot Label — Optional. Assigns labels for the columns generated from the Pivot Column.
- Pivot Column — Choose the field whose unique values get transformed into columns.
- Click Save to confirm. The modal will close.
- Under "Pivoting for Analysis", click the Enable radio button.
- Click Apply Changes in the top right tool bar area.
- Label Internationalization — Specifies whether to enable internationalization in the dataset. If enabled, the dataset name and column names can be translated to multiple languages. For more information, see Configuring Dynamic Labels for Datasets.
- Discovered Columns — When a new column is discovered during a Data Source query, specifies whether to automatically add it to the dataset.
- Estimated Dataset Size — Specifies the estimated size of the dataset based on its number of records. This option determines the number of shards allocated to the dataset’s ES index, which helps the system optimize the performance of loading and displaying data. Defaults to Small:
- Small — Up to 10M records (1 shard).
- Medium — 10M and 250M records (3 shards).
- Large — Over 250M records (5 or more shards).
If the node count is higher than 5, the number of shards equals the number of nodes.
Data Joins
Create a new dataset by joining multiple data sources from multiple connection types. Datasets must have at least one column that can be matched in order to join them.
- Click the three-dot menu on the Data Source Pill and select Join to Another Data Source.
- On the Select the Data Source dialog, select the data source that you wish to join. Click the Connection, My Data, or Shared Data tabs to display existing data sources.
- Click Next. The Join Data dialog displays.
- Under Join Type, select the join type that you would like to use.
- Left Join - Keeps all rows of data from the left table and the matching rows from the right table.
- Right Join - Same as the left join, it uses rows of data from the right table and the matching rows from the left table.
- Inner Join - Uses only the rows of data that are common in the joined data.
- Full Join - Uses all records in all joined tables.
- Click the dropdown box for each dataset and choose the common column(s).
- Click Save. The Join Output pill displays on the Design page.
- Click Join Output to display all the combined columns from the joined datasets. This is the view of the output columns that will be loaded to the dataset. Click on any data source pill to filter the view to show only columns from the selected data source.
- Repeat the join process as needed to achieve the desired structure for your dataset.
- Click Apply Changes to make the joined datasets available to create charts, pages, and reports.
A joined dataset may include a large number of columns that can be difficult to navigate. To filter the list of columns, click the data source pill. The list updates to display only the columns in that data source.
Data Unions
Unions combine data from multiple sources of similar structure into a single dataset.
- Click the three-dot menu on the Data Source Pill and select Union with Another Data Source.
- Choose the data source you want to union with the base source.
The data sources will appear in a group visually representing the union and all combined columns will display under Columns.
Data Source Rules
- Data sources must be added to the union one at a time.
- Data sources in a union can be of different types, such as CSV files, datasets, or SQL queries.
However, Push API data sources are not supported. - The unionized dataset acts as a single data source and can be joined with any number of other data sources.
Column Handling
- Columns are matched by name between data sources, and the matching is case-sensitive.
- Transformations can be applied to participating data sources to rename or modify columns before creating the union.
- Data types can vary across sources. The unionized column type is determined by the first data source with the column.
- Object arrays are excluded from union output. However, if an object array is flattened, its properties can be included.
Feature Stacking
- Transformations applied to individual data sources are reflected in the union output.
- Features such as RLS (Row-Level Security), Unique ID, Column Link, and Geolocation are supported.
However, configurations to a unioned dataset will not apply to its data sources. - Internationalization (i18n) tokens are supported in the union output.
Management
- The View Source Data tab is unavailable for unioned data sources.
- Removing a data source from a union does not require a data reload and will not break existing charts unless the removed source contains a used column used in those charts.
- Replacing a data source in a union does not trigger a data reload. However, the data from the new source will not load unless explicitly reloaded.