Multi-tenant Dataset Architecture
The first step with any self-service embedded analytics implementation is to design, develop and then populate the datasets that will become the sources for all charts, metrics and visualizations created in the platform. SaaS organizations building datasets for multi-tenant deployments of embedded analytics must think about the following topics with respect to the data architecture:
- Platform Specific Design Guidelines
- Commingled vs Segregated Tenant Data Models
- Handling Tenant Custom Data Schemas
There are additional considerations around dataset architecture such as data pipeline creation, hard vs soft record deletions, live connect datasets and data sync configuration. This article focuses mainly on dataset architecture as it pertains to a multi-tenancy deployment.
Platform Specific Design Guidelines
The very first design principle to understand is that a single chart, metric, visualization or interactive filter control can only be sourced from a single dataset. While a single dataset can be sourced from multiple different data connections, and even other existing datasets joined together, these joins must be configured prior to building out content. For Managed Datasets, the data is always preloaded, indexed and cached after the join logic is executed. For Live Connect datasets, the pointer to an existing source table or view must be defined at dataset creation time, and that source table/view is used as the reference data entity for all queries generated by the back-end.
There are some additional guidelines to keep in mind when designing the schema for a Qrvey dataset:
- Remember that Qrvey is your data analytics engine, and that means that your Qrvey dataset architecture is not going to be a replica of the schema you already have in your existing OLTP data source.
- All of the columns you will use to create a single chart must come from a single dataset. For this reason, most datasets in Qrvey should be composite views across multiple data source entities.
- Avoid using source queries for datasets that perform any type of grouping with aggregations. The majority of the types of charts that your users create will perform these grouping and aggregate computations in real-time.
- Qrvey dataset transformations operate at the record level, which means you cannot maintain a cursor to reference values that exist within other records.
- Qrvey dataset transformations are excellent for reformatting data for analytic purposes (e.g. standardizing strings with minor variations in char length that are associated with one another), as well as performing computations on values from multiple existing fields in the same row to produce a new value for a new field.
- There is no need to ever extract datepart components from a DateTime field, as the data engine does this automatically for you.
- Qrvey dataset transformations can handle JSON documents with complex property/object hierarchies, allowing you to filter, reformat and flatten the documents into usable columns for analytics.
- Qrvey dataset transformations support the ability to apply specific transforms only when certain boolean conditions are satisfied, and you can even map the output of a set of transforms to an entirely new dataset.
Managed datasets are not visible within the Dashboard Builder and Chart Builder UIs until at least one data load operation has completed successfully. After data has been loaded into a managed dataset, you can still make changes to the dataset configuration settings; however, some changes may require a full data reload to take effect. The following dataset configuration changes will require a full data reload:
- Changing the data type of a column (i.e. Text, DateTime or Numeric).
- Adding a new dataset transformation that needs to be executed for all existing records.
- Performing a join or a union to another data source or existing dataset.
- Assigning or unassigning a column as a unique identifier.
- Changing the underlying dataset source SQL query.
All other changes to the dataset’s configuration will take effect as soon as you apply the changes and do not require a full reload.
Commingled vs Segregated Tenant Dataset Models
The most common multi-tenant dataset architectures are the commingled model and the segregated model. In a commingled dataset model, a single data source entity with a unified schema contains all of the data for all tenants. Individual tenants have access only to the subset of data that is associated with their organization. In a segregated dataset model, each tenant has their own copy of each dataset, and the datasets only contain data associated with their organization. The decision to use the commingled vs the segregated model should be based entirely on your organization’s requirements around data security, established processes for keeping data secure, as well as any contractual obligations to customers.
Commingled Dataset Model
For the commingled dataset model, you will have a set of common datasets with common schemas across all tenants. Data for all tenants will be loaded into these common datasets. Since all users across all tenant organizations access the same set of common datasets, it becomes absolutely necessary to configure record-level security (RLS) on at least one column within each common dataset. You can learn more about how RLS security is configured and implemented for Qrvey widgets by reviewing the Record Level Security article.
The biggest advantage to the commingled dataset model is the simplicity of its implementation, deployment process and maintenance when in production. Specifically, the commingled dataset model provides the following technical benefits:
- Vastly reduced index count in your Elasticsearch cluster
- Individual tenant application workspaces are entirely optional
- Content Deployment jobs between environments can be much faster with less objects to deploy and maintain
Each individual managed dataset in Qrvey is tied to exactly one index in your Elasticsearch cluster. In a commingled dataset architecture, you would only have a few Elasticsearch indexes to maintain; however, those few indexes would become very large over time. Depending on the volume of data that is expected for each tenant organization, multiplied by the number of tenants you expect to onboard, it may become necessary to configure sharding for specific datasets. Qrvey makes Elasticsearch sharding easy by asking you at dataset creation time what the expected record count could be. Large datasets (i.e. record counts of 250M+) will have a sharding configuration, whereas small datasets (i.e. record counts of 10M or less) will not. It is important to choose the right sizing up front, as any changes to an index’s sharding configuration will require a full and complete data reload.
Dataset Size | Number of Records | Number of Shards |
---|---|---|
Small | Up to 10M records | 1 shard |
Medium | Between 10M and 250M records | 3 shards |
Large | Over 250M records | 5 or more shards. If the node count is higher than 5, the number of shards equals the number of nodes. |
The commingled dataset architecture also allows you to maintain a single application workspace across all tenants, provided that you restrict self-service capabilities to the built-in End User Personalization (EUP) functionality. A single deployed Qrvey application works for this architecture because end users will always have a pared down view across all of the common datasets in the application by passing RLS filter values in the widget configuration. Additionally, any customizations that are made with EUP are not saved as additional objects within the application workspace. Rather, all EUP configurations are directly written to the DynamoDB platform database. Having a single application workspace comes with the additional added benefit of shorter content deployment job execution times, since you can deploy a common set of datasets and dashboards from one environment to another.
If you need to support tenant specific custom data schemas, then you can configure dedicated application workspaces for each tenant and still take advantage of the commingled dataset model for common data schemas. For this solution architecture to work, the commingled datasets will reside in a single separate application workspace. Each commingled dataset must be configured as a shared dataset so that they are accessible via dataset views that reside within each of the tenant application workspaces. You can then create and populate tenant specific custom datasets within each tenant’s application workspace.
The architecture diagrams below illustrate this concept:
This application will only contain datasets with the following configuration:
- Tenant data will be commingled
- RLS will be configured on one or more columns
- Fixed number of custom fields will be labeled as “custom1”, “custom2”...”customN”
- Data sync schedules will be configured on these datasets
- Datasets will be “shared” so that dataset views can be created
This application will contain all of the baseline assets. A dataset view should be created for each shared dataset in the master data app. This application should only be accessed by your organization's Composer users and will not have any security filters applied.
Each tenant application will have its own set of dataset views that point back to the shared datasets in the master data app. RLS filters need to be passed in order to restrict data rows for each specific tenant. When tenant users create their own charts, those charts will be bound to the dataset views that reside in their tenant application.
For content deployment purposes, the dataset views that reside in the master content app are linked to the dataset views within each tenant app. The purpose of having the master content app is to facilitate content deployment updates.
In the diagrams above, dataset views A, B and C within the master content application would all have a schema that is common/unified amongst all tenants. The tenant specific dataset views An, Bn and Cn would have identical schemas, though each column could potentially have a slightly different name for each tenant. Additionally, you could also have entire datasets that only exist within a single tenant application and nowhere else. For tenant specific datasets, the associated data source connection can simply be defined within that specific tenant application.
Segregated Dataset Model
For the segregated dataset model, the column schemas for each dataset are typically common across all tenants but each tenant will have their own copy of every dataset. For this type of dataset architecture, it is not necessary to implement RLS because datasets are only populated with data for a single tenant. However, you may wish to implement RLS to further restrict data based on the end user’s set of roles or group membership.
The segregated dataset architecture offers a few technical advantages for your solution:
Data security is not entirely dependent on RLS because every tenant has a physically separate set of indexes in your Elasticsearch cluster Datasets will be much smaller in size and will not require any special index sharding configuration Separate application workspaces for every tenant is a requirement and this allows for tenant end users to create their own dashboards, reports and workflows Maintaining shared datasets and dataset views is no longer required
The segregated dataset architecture does not require that any datasets in the solution be configured as shared datasets. Consequently, dataset views are not necessary to maintain, as each tenant application workspace will contain only managed datasets populated with their organization’s data. The following diagram illustrates this concept:
This application will contain all of the datasets and baseline/reference content:
- Data in these datasets can either be dummy data or data from one or more tenants
- Baseline/reference content must be created here and then get pushed out to tenant apps
- No tenants will ever access content in this app Tenant users will create content within their designated apps. All charts, dashboards and reports are sourced from the dataset contained within the tenant app. Datasets contained within the tenant app are managed datasets, so there is no need for dataset views.
Baseline content must exist within each tenant app to be accessible to users. Tenant datasets can contain custom fields. Custom datasets specific to a single tenant can be created within a tenant app.
In the diagrams above, datasets A, B and C within the master application would all have a schema that is common/unified amongst all tenants. The tenant specific datasets An, Bn and Cn would have the same columns as their respective source datasets, but could also contain tenant specific columns that are not present in any other tenant application. Additionally, you could also have entire datasets that only exist within a single tenant application and nowhere else. For tenant specific datasets, the associated data source connection can simply be defined within that tenant’s application workspace. Using a segregated dataset architecture may mean that your content deployment jobs take longer to execute, and there will ultimately be many more system objects to maintain, but the benefits of the extensibility that this architecture provides can outweigh the extra development effort.
NOTE: As of version 8.2, Qrvey does not support multiple, distributed Elasticsearch clusters from within a single instance of the platform. If your data security requirements mandate that tenant data not be commingled within the same Elasticsearch cluster, then you will need separate deployments of the Qrvey platform.
Handling Tenant Custom Data Schemas
Some tenants may have special data gathering requirements and metrics that they need to capture, and it’s quite common for SaaS organizations to accommodate these types of custom data capture requests. Most custom data capture requests fall under the following areas:
- An extra column or two that is directly associated with a single common dataset schema
- Some variation on the exposed friendly name of a column
- A completely custom dataset with a custom schema
You can easily support all of the scenarios described above using either the commingled or the segregated dataset models. This section will describe in detail how to support each of the tenant custom data scenarios for both architectures.
Commingled Dataset Model
In the commingled dataset model, you will have a Qrvey application that contains the commingled datasets for all tenants as well as a Qrvey application that contains dataset views that reference the shared datasets. You can support custom tenant columns by modifying the schema of the shared dataset and adding these custom columns to the schema. You can then control the visibility of these custom data columns in one of two ways:
- Create a special role in Qrvey for each tenant and utilize Column Level Security (CLS) to control visibility of each custom column in a shared dataset
- Hide all custom columns not associated with a specific tenant when deploying their workspace and dataset views
Both methods would be effective; however, simply relying on the dataset view’s configuration for which columns are visible to end users could be problematic if the dataset view were to be overwritten during a content deployment job. You would need a programmatic routine to reapply all of the column visibility settings after the content deployment job finishes executing. Alternatively, the CLS settings would apply to the shared dataset and would not be affected by content deployment jobs.
Supporting the scenario where only a custom friendly name is required for a shared column in a shared dataset is very straightforward with dataset views. You can easily change the name of a shared column in a dataset view, and that custom friendly column name will only be visible to that specific tenant’s end users. Since every tenant will have their own application workspace, creating a completely custom dataset for a tenant is also very straightforward. You can simply create the custom dataset and custom connection directly in the tenant’s application workspace.
Segregated Dataset Model
In the segregated dataset model, tenants have their own copies of each dataset that are only populated with their own data. This architecture makes it very easy to completely customize the entire schema of every dataset deployed to their application workspace. With the segregated architecture, there is still a common baseline application that contains all of the common datasets and dashboards. This baseline application becomes the release package source when onboarding a new tenant and creating their application workspace.
If you need to add one or more custom columns to a common dataset that is deployed from a baseline application, the best way to accomplish this is through a set of API calls to modify the deployed target dataset’s schema during the content deployment process. The general approach to accomplishing this is as follows:
- Create a deployment definition that deploys all assets from the baseline application but does not populate the datasets. This is important because the custom columns should be added to the schema before the dataset is populated with data.
- Execute the deployment job to create the tenant’s application workspace.
- Call the API endpoint(s) to add the tenant custom columns to the dataset schema and save the changes.
- Call an API endpoint to kick off the data load process for each dataset in the tenant application workspace.
When using the segregated dataset model, you do not need CLS since tenants never share a common dataset. As long as you ensure that the app_id
is configured correctly when tenant end users access your application, you can be sure that they will be restricted to the datasets present in their own application workspace.