Azure Synapse, the bare bones

Carlos Cruz
6 July 2020

The background

Data warehousing has been around for many years and arose from a need to provide businesses with accurate reporting as well as to help make the best decisions for the business (Business Intelligence).

Data warehouses typically store historical data by integrating copies of transaction data from disparate sources. Data warehouses can also use real-time data feeds for reports that use the most current, integrated information. Over the last decade, business data consumption has grown many fold and the traditional data warehouse has struggled to keep up with the high demand for data analytics from business.

Microsoft introduced Parallel data warehouses which were appliances installed on-premises that leveraged the use of many server nodes to parallel process large volumes of data. Parallel data warehouses evolved to the cloud and Azure SQL Data Warehouse was born.

Azure Synapse has evolved from Azure SQL Data Warehouse with added features that essentially makes it data warehousing on steroids.

What is Azure Synapse?

Azure Synapse Analytics is a cloud-based enterprise data warehouse that leverages Massively Parallel Processing (MPP) to run complex queries across large quantities of data. It is a limitless analytics service that brings together enterprise data warehousing and Big Data analytics.

Synapse SQL uses a scale-out architecture to distribute computational processing of data across multiple nodes (MPP), the number of compute nodes ranges from 1 to 60. The unit of compute power is called a Data Warehouse Unit (DWU). Compute is separate from storage, which enables organisations to scale compute independently of the data in their system/s.

Control and Compute Nodes

Applications connect and issue T-SQL commands to a Control node, which is the single point of entry for Synapse SQL. The Control node runs the MPP engine, which optimizes queries for parallel processing, and then passes operations to Compute nodes to do their work in parallel.

The Compute nodes store all user data in Azure Storage and run the parallel queries. The Data Movement Service (DMS) is a system-level internal service that moves data across the nodes as necessary to run queries in parallel and return accurate results.

Azure Synapse Diagram

Azure Storage

Synapse SQL leverages Azure Storage to store the data. Since data is stored and managed by Azure Storage, there is a separate charge for storage consumption. The data is sharded into distributions to optimise the performance of the system. The sharding pattern to distribute the data can be chosen when defining the table. The following sharding patterns are supported:

  • Hash        
    A hash distributed table distributes rows based on the value in the distribution column. A hash distributed table is designed to achieve high performance for queries on large tables. There are several factors to consider when choosing a distribution column. Hash-distribution is used with clustered column indexes in Fact tables. There is an increased performance when two hash tables are joined on the same distribution column. Hash distributed tables offer the highest query performance for joins and aggregations on large tables and are therefore well suited for use with Fact tables.
  • Round Robin
    A round-robin table distributes table rows evenly across all distributions. The rows are distributed randomly. Loading data into a round-robin table is fast though queries can require more data movement than the other distribution methods and will be slower. Round Robin tables are the simplest to create and provide fast loading performance in
    Staging tables.
  • Replicated
    A replicated table has a full copy of the table available on every Compute node. Queries run fast on replicated tables because joins on replicated tables don’t require data movement. Replication requires extra storage, though, and isn’t practical for large tables. Small Dimension tables can be stored using replicated distribution but if the dimension table is too large hash-distribution should be used. Replicated tables provide the fastest query times on small tables.

How is Integrove using it?

Integrove is using Azure synapse within a mining processing project where the data has all the aspects of the four V’s of big data namely,

  • Volume: large amounts of data,
  • Variety: structured and unstructured data( voice recordings. Pdf files etc.)
  • Velocity: High frequency of incoming data,
  • Veracity: Trustworthiness of the data.

With Azure Synapse the project at completion will be able to meet the demands placed by a big data environment by using MPP as well as other features not discussed here such as PolyBase and External tables.

Conclusion

Azure Synapse combines the benefits of data warehousing and data modeling with the speed of MPP and the ease of Azure. From a user perspective, the data looks like it would in any data warehouse with dimensions and facts but to use an analogy – it has a supercharged engine running under the hood. The scalability and the ability to quickly get started means that data warehouses will be with us for a few more years at least.

Find out how Integrove can help you with your data analytics using the latest technology

Integrove
Cloud & Data Analytics