Architecting Qrvey Data Integrations with Redshift, Snowflake, Databricks & Elasticsearch
This guide explores optimization and configuration of the Qrvey platform, covering integrations with Amazon Redshift, Snowflake, Databricks, and Elasticsearch. It provides best practices for each, addressing both Live Connect datasets (real-time queries against your source) and Managed datasets (data ingested into Qrvey’s internal Elasticsearch engine) to help architect optimal solutions. The focus is on conceptual and architectural considerations, performance, and cost trade-offs.
Performance Defined
Performance in analytics platforms like Qrvey is multi-dimensional. What is the definition of performance?
There are three interrelated and interdependent dimensions that define performance: Data Volume, User Concurrency, and Time.
- What is the chart rendering time for a single user and a high volume of data?
- What is chart rendering time with a high number of concurrent end-users and low record volume?
- High volume of users, high-volume records, and one dataset?
- Modest user and record counts, with 300 datasets?
Achieving a "2-second chart render time" depends on factors such as cluster size, data volume, query complexity, and the number of concurrent users. It encompasses not only how quickly a chart renders (query performance), but also how efficiently data can be loaded or synchronized into the system (data write performance). As your data volumes and user concurrency grow, understanding and optimizing these aspects becomes critical.
The essential question isn’t just “How much data can I load?” or “How fast can I query?” Instead, focus on your business requirements:
- How quickly does data need to be available for analysis after loading?
- As concurrency increases, what realistic chart rendering performance do you need to maintain, based on any performance SLAs?
Core Performance Metrics
- Chart Rendering (Query Performance): The time it takes for a visualization to appear after a user interacts with a dashboard.
- Data Loads (Write Performance): How quickly datasets can be loaded or synchronized into Qrvey’s Elasticsearch engine.
Scaling Considerations
When you introduce scale, either in terms of data volume or user concurrency, performance challenges evolve:
- High-Volume Parallel Data Write Performance: Measures how fast the cluster can ingest large datasets, especially when multiple data loads happen simultaneously.
- High-Volume Parallel Query Performance: Assesses how quickly the system can serve queries when many users are interacting with dashboards at the same time.
Business Scenarios
Loading Large Volumes of Data Quickly:
- Challenge: Ingesting millions or billions of records into a dataset as fast as possible.
- Solution: This requires vertical scaling.
- Primarily a CPU and RAM challenge for the Elasticsearch engine.
- Increase the instance type (CPU/memory) of your cluster to speed up data ingestion.
- Scale up during heavy loads, then scale down to control costs. Consider adjusting resources on a weekly or monthly basis, as needed.
- Key Point: Costs increase with processing power. Identify the optimal trade-off point based on your business needs and configure accordingly.
Fast Chart Rendering for a Single User:
- Challenge: Ensuring a single user sees chart results instantly, even with large datasets.
- Solution: Again, vertical scaling.
- Similar to the data loading example above, this improves single-session performance.
- Shard Configuration:
- Properly configure Elasticsearch shards for the dataset powering the chart.
- Optimal shard size is typically around 25 GB.
- The only thing that matters is file size on disk. Number of records doesn't matter.
- Key Point:
- Think in terms of disk space consumed by the index, not just record count.
- For example, 10 million rows with 3 columns is much smaller (and faster to query) than 10 million rows with 300 columns.
Maintaining Performance with Increased Users and Datasets
- Challenge: Keeping chart rendering fast:
- As more users interact with dashboards.
- As the number of managed datasets grows.
- Why Datasets Matter: Each managed dataset adds shards to Elasticsearch. Hundreds of datasets mean hundreds of shards, which consume memory and disk, and can require more nodes even with low user concurrency.
- Solution: This requires horizontal scaling.
- Add more data nodes to the Elasticsearch cluster to distribute query load and shard storage.
- Key Point: Monitor both user concurrency and total shard count (datasets) to anticipate scaling needs.
Cost Considerations for Elasticsearch Clusters
- Compute Costs: CPU and memory resources for each node/server.
- Storage Costs: Disk space used by each node/server.
- Shard Overhead Costs: Large shard counts (often from many managed datasets) increase memory usage and cluster coordination load, potentially forcing horizontal scaling.
- Scaling Costs: Additional nodes increase costs; balance performance needs with budget.
- Pricing Models: Compare on-demand vs. reserved instance pricing to optimize long-term spend.
Plan for Future Costs and Performance
- Establish Baselines: Measure performance for a single user and your largest dataset to set expectations.
- Load Testing: Simulate incremental increases in user concurrency to identify scaling thresholds and plan for future growth.
- Proactive Scaling: Monitor cluster health and usage metrics; scale resources before bottlenecks impact users.
- For AWS, see the guide on Cloud Watch Alarms.
By understanding these performance dimensions and scaling strategies, you can architect your Qrvey deployment to meet both current and future analytics needs, delivering fast and reliable insights regardless of data size or user load.
Live Connect vs. Managed Datasets in Qrvey
Live Connect: Qrvey queries your external database in real time, ensuring dashboards always reflect the latest data. There’s no ETL or data duplication. Qrvey acts as a live query layer on your data warehouse or lake, ideal for real-time analytics. No data is stored in Qrvey’s Elasticsearch; all queries are executed on the source.
Managed Datasets: Data is imported into Qrvey’s internal Elasticsearch engine and synced periodically. Each managed dataset corresponds to an Elasticsearch index, creating a materialized view for fast queries and enabling Qrvey’s full transformation capabilities (joins, unions, formulas, etc.). This approach introduces data latency (due to sync intervals) and duplicates storage.
Key Differences and Trade-offs
-
Data Freshness:
- Live Connect: Up-to-the-minute data, ideal for real-time reporting requirements.
- Managed: Data is refreshed by the sync interval. Never real-time. Ideal for high-volume, historical data.
-
Transformations and Joins:
- Managed: Supports joins/unions and on-the-fly calculations within Qrvey.
- Live Connect: No joins or multi-source unions within Qrvey; all blending must be done in the source (e.g., SQL views).
-
Performance Load:
- Live Connect: Query load is on the external database. Ensure your source is designed for analytics workloads.
- Managed: Query load is offloaded to Qrvey’s Elasticsearch cluster.
-
Connectivity:
- Managed datasets work with virtually any database or cloud storage system.
- Live Connect is supported for Redshift, Snowflake, PostgreSQL, and Databricks (as of v9.0).
When to use which?
- Use Live Connect for real-time updates, with to-the-second reporting.
- Use Managed for in-platform data modeling, combining sources, offloading query workloads, and performing historical analysis.
- Many deployments use a hybrid approach, mixing both dataset types as needed.
Qrvey’s Internal Elasticsearch Engine (Managed Datasets)
Qrvey’s built-in Elasticsearch engine powers managed datasets, enabling fast, scalable analytics on ingested data. This approach is ideal for frequently queried datasets, multi-source joins, and scenarios requiring advanced in-platform transformations. Data limits on Elasticsearch are in the petabyte range, so most users are unlikely to reach its storage limits.
Performance, Cost, and Limitations (Managed Datasets)
Performance
- Monitor JVM heap usage (keep below 80%), CPU, and disk utilization (maintain at least 20% free space).
- Each dataset defaults to 1 primary shard plus 1 replica. For large datasets (greater than 25 GB data), increase shard count to improve parallelism and avoid bottlenecks.
- Use Qrvey’s admin dashboard or cloud-native monitoring tools to track cluster health and resource usage.
If heap usage exceeds 80%, refer to the Business Scenarios listed above to decide whether to scale horizontally or vertically.
For monitoring, use the tools available for your deployment type:
- AWS OpenSearch Service: Use the Monitoring tab in the AWS Console.
- Kubernetes OpenSearch Cluster: Use Qrvey’s Admin Center monitoring tab or integrate with an external monitoring tool.
- Qrvey Elasticsearch Management
- Qrvey Health Dashboard in the Admin Center.
For maintenance, schedule regular backups (snapshots) and clean up unused datasets to optimize storage.
Data Sync and Freshness
- Choose between scheduled pulls (batch updates) and API-triggered pushes (near real-time syncs).
- Ensure your dataset includes enough uniquely identifying columns to guarantee record uniqueness. Configure at least one timestamp column to support incremental data syncs.
- Managed datasets introduce data latency based on sync intervals; not suitable for scenarios requiring up-to-the-second data.
Joins and Multi-Source Analytics
- Managed datasets support joins and unions across multiple data sources, using an intermediate “join lake” before indexing results in Elasticsearch.
- Enables advanced in-platform transformations, column selection, and multi-tenant column and record level security.
- Cost-effective for high-query volumes on any size data.
Cost Considerations
- Running the Elasticsearch cluster incurs fixed infrastructure costs, but queries are not metered—cost-effective for high query volumes on moderate data sizes.
Limitations
- Not ideal when join relationships within a dataset change frequently due to full-reload requirements (because you have to completely rebuild the index).
- Data syncs do not physically remove records. Data syncs do not behave like a "true sync" because records are not physically removed.
Best Practices
- Right-size your cluster and shard configuration up front based on expected data volume and query load.
- Monitor health metrics and scale proactively as usage grows.
- Clean up unused datasets and manage dataset lifecycle (such as removal of irrelevant or expired data) to control costs.
- Test at scale before full rollout to ensure performance meets requirements.
Conclusion
In summary, Managed Datasets and Qrvey’s Elasticsearch engine are optimized to support advanced features like joins, dataset transformations, and vast amounts of historical data—but at the cost of a slight delay in the most relevant data.
Live Connect Datasets can provide up-to-the-second analytics, but may take a performance hit if the data is not organized and optimized for analytics.
Optimal platform configuration depends on your data sources and business needs. Qrvey is flexible—you can have multiple connections and datasets, some live, some managed. Many deployments benefit from a hybrid strategy. Remember, the architecture you put in place is not rigid; it can evolve. As data volumes increase or usage patterns change, you might switch strategies over time.