The Good and the Bad of Snowflake Data Warehouse

Reading time: 15 minutes

Not long ago setting up a data warehouse — a central information repository enabling business intelligence and analytics — meant purchasing expensive, purpose-built hardware appliances and running a local data center. With the consistent rise in data volume, variety, and velocity, organizations started seeking special solutions to store and process the information tsunami. This demand gave birth to cloud data warehouses that offer flexibility, scalability, and high performance. These days, Snowflake is one of the most popular options that meets these and a lot of other important business requirements.

For everyone who is considering Snowflake as a part of their technology stack, this article is a great place to start the journey. We’ll dive deeper into Snowflake’s pros and cons, its unique architecture, and its features to help you decide whether this data warehouse is the right choice for your company.

Data warehousing in a nutshell

Before we get into Snowflake technology, let’s deal with the key concepts of data warehousing for a common understanding.

The main idea of any data warehouse (DW) is to integrate data from multiple disjointed sources (e.g., CRMs, OLAP/OLTP databases, enterprise applications, etc.) within a single, centralized location for analytics and reporting. Traditionally, it is a relational database that stores all data in tables and allows users to run SQL (Structured Query Language) queries on it.

By the type of deployment, data warehouses can be categorized as

  • on-premise — hardware and software are installed locally;
  • cloud-based — resources are deployed either in public or private cloud environments; and
  • hybrid cloud — the aforementioned capabilities are available under one roof.

Depending on the type and capacities of a warehouse, it can become home to structured, semi-structured, or unstructured data.

  • Structured data is highly-organized and commonly exists in a tabular format like Excel files.
  • Unstructured data comes in all forms and shapes from audio files to PDF documents and doesn’t have a pre-defined structure.
  • Semi-structured data is somewhere in the middle, meaning it is partially structured but doesn’t fit the tabular models of relational databases. Examples are JSON, XML, and Avro files.

The data journey from different source systems to a warehouse commonly happens in two ways — ETL and ELT. The former extracts and transforms information before loading it into centralized storage while the latter allows for loading data prior to transformation.

These are the basics needed to explore the world of Snowflake and how it works.

What is Snowflake?

Developed in 2012 and officially launched in 2014, Snowflake is a cloud-based data platform provided as a SaaS (Software-as-a-Service) solution with a completely new SQL query engine. As opposed to traditional offerings, Snowflake is a tool natively designed for the public cloud, meaning it can’t be run on-premises. The platform provides fast, flexible, and easy-to-use options for data storage, processing, and analysis. Initially built on top of the Amazon Web Services (AWS), Snowflake is also available on Google Cloud and Microsoft Azure. As such, it is considered cloud-agnostic.

Modern data pipeline with Snowflake technology as part of it

Modern data pipeline with Snowflake technology as part of it. Source: Snowflake

With Snowflake, multiple data workloads can scale independently from one another, serving well for data warehousing, data lakes, data science, data sharing, and data engineering.

BTW, we have an engaging video explaining how data engineering works

Snowflake is considered a more serverless offering, meaning as a user you don’t have to select, install, configure, or manage any software and hardware (virtual or physical) except for the number and size of compute clusters (more on this later). Also, Snowflake has a unique architecture that can scale up and down based on the requirements and workloads. For example, when the number of queries increases, the system instantly mobilizes more computing resources.

Snowflake architecture overview

In most cases, if you want to build a scalable data warehouse, you need massively parallel processing (MPP) to handle multiple concurrent workloads. So, you either use shared-disk or shared-nothing MPP architecture for that.

Shared-disk vs shared-nothing architecture

Shared-disk vs shared-nothing architecture

  • The shared-disk architecture uses multiple cluster nodes (processors) that have access to all data stored on a shared memory disk. Nodes have CPU and Memory but no disk storage, so they communicate to a central storage layer to get data.
  • The shared-nothing architecture stores and processes portions of data on different cluster nodes in parallel and independently. Each node has its own disk storage.

Snowflake combines the benefits of both architectures in its new, unique hybrid design.

Similar to shared-nothing architecture, Snowflake uses MPP-based computing to process queries concurrently with each node locally storing a portion of the entire data. As for the similarity with shared-disk architecture, there is a centralized data repository for a single copy of data that can be accessed from all independent compute nodes. As such, data management is as simple as in shared-disk architecture with performance and scale-out benefits of the shared-nothing architecture.

Three layers of Snowflake architecture

Three layers of Snowflake architecture

Snowflake has a multi-cluster, shared-data architecture that consists of three separate tiers, namely

  • data storage layer,
  • query processing (compute) layer, and
  • cloud services (client) layer.

Physically separated but logically integrated, each layer can scale up and down independently, enabling Snowflake to be more elastic and responsive.

To understand how Snowflake works, we’ll walk you through all layers and explain their features.

Database storage layer

The database storage layer, as the name suggests, handles tasks related to secure, reliable, and elastic storage of data that comes from disparate sources. Snowflake supports both ETL/ELT processes to insert data in scheduled bulks or batches. Also, it enables continuous data ingestion from source files in micro-batches, making information available to users almost instantaneously. This feature is available through a separate Snowflake service called Snowpipe.

With loading, data is optimized, compressed, and reorganized into an internal columnar format. It is broken down into so-called micro-partitions. For example, if the table contains transactions, micro-partitions are the days of transactions. Each day is a separate micro-partition – a separate file.

This optimized data is stored in a cloud object storage such as S3 by AWS, Google Cloud Storage, or Microsoft Azure Blob Storage. Customers can neither see nor access these data objects. They use Snowflake to run SQL query operations.

Query processing layer

Query processing or compute layer provides the means for executing various SQL statements. It consists of multiple independent compute clusters with nodes processing queries in parallel. Snowflake calls these clusters virtual warehouses. Each warehouse is packed with compute resources, such as CPU, memory, and temporary storage required to perform SQL and DML (Data Manipulation Language) operations. Users can:

  • retrieve rows from tables,
  • load data into tables,
  • unload data from tables, and
  • delete, update, or insert separate rows in tables, etc.

The warehouse size chart with the number of virtual nodes. Source: Snowflake

The warehouses’ size chart with the number of virtual nodes. Source: Snowflake

Virtual warehouses come in ten sizes from X-Small to 6X-Large: Each increase in size to the next larger warehouse doubles the computing power.

Since warehouses don’t share compute resources with one another, there’s no impact on the performance of other machines if one goes down. Besides, nodes do not store any data, so losing them isn’t critical. If a failure occurs, Snowflake will recreate a new instance in minutes.

Cloud services layer

Cloud services or client layer hosts a bunch of services that coordinate activities across Snowflake. The layer also runs on compute instances provided by Snowflake from different cloud providers. These services pull together all Snowflake components into a cohesive whole.

Services managed in this layer include

  • authentication and access control (authenticating users and connections, managing encryption and keys);
  • infrastructure management (managing virtual warehouses and storage);
  • metadata management (storing metadata and handling queries that can be executed from it); and
  • query parsing and optimization.

The pros of Snowflake data warehouse

In this section, you will find things that make Snowflake a real deal and may serve as reasons to consider this cloud data warehouse as a solution.

Snowflake pros and cons

Snowflake pros and cons

Adequate security and data protection

With Snowflake, data is highly secure. Users can set regions for data storage to comply with regulatory guidelines such as HIPAA, PCI DSS, and SOC 1 and SOC 2. Security levels can be adjusted based on requirements. The solution has built-in features to encrypt all data at rest and in transit, regulate access levels, and control things like IP allows and blocklists.

To achieve better data protection, Snowflake offers two advanced features — Time Travel and Fail-safe. Time Travel gives you an opportunity to restore tables, schemas, and databases from a specific time point in the past. By default, there’s one day of data time travel. But Enterprise users can choose a period of time up to 90 days. The Fail-safe feature allows for protecting and recovering historical data. Its 7-day period starts right after the Time Travel retention period ends.

Great performance and scalability

Thanks to the separated storage and compute, Snowflake has the ability to run a virtually unlimited number of concurrent workloads against the same, single copy of data. This means that multiple users can execute multiple queries simultaneously.

While benchmarks can be configured to perform in a certain way and fit particular use cases, most show great results for Snowflake performance. This one proves that Snowflake is capable of processing 6 to 60 million rows of data in 2 seconds to 10 seconds, which is pretty impressive. Out of the box, Snowflake has what it takes to outperform other cloud warehouse solutions with no prior fine-tuning.

When it comes to scalability, Snowflake has a unique auto-scaling and auto suspend feature to start and stop warehouses depending on whether they are active or inactive. For comparison, autoscaling in Redshift is quite limited. Apart from scaling being handled by Snowflake, it can be both vertical and horizontal. Vertical scaling means the platform adds more computer power to existing virtual warehouses, e.g., upgrading CPUs. With horizontal scaling, more cluster nodes are added.

Data caching

The virtual warehouse memory is used for caching. When a query is executed, data from different tables in storage is cached by different compute clusters. Then all subsequent queries can use this cache to generate results. With data in the cache, queries run up to 10 times faster.

Micro partitions

A really powerful element of the tool is that data stored in Snowflake comes in the form of micro-partitions. These are continuous units of storage that hold data physically. They are called “micro” because their size ranges from 50 to 500 MB before compression. Besides, resizing the micro-partition blocks can be executed by both users and Snowflake automatically.

How Snowflake stores data in micro-partitions. Source: Snowflake

Within each micro-partition, data is stored in a columnar data structure, allowing better compression and efficient access only to those columns required by a query. Shown in the picture above, 24 rows from the table are stored and sorted in 4 micro-partitions by columns. Repeated values are stored only once. Let’s imagine that you need data from two different tables for your SQL query to be executed. So, instead of copying both tables fully to the compute cluster, Snowflake retrieves only relevant micro-partitions. As a result, the query needs less time to be completed.

Light learning curve

Many people think that setting up and using a data warehouse properly is a difficult task requiring solid knowledge of different tech stacks. Well, if we take some Hadoop or Spark that have a steep learning curve due to completely new syntax, the former words will make sense. Things are different with Snowflake since it is fully SQL-based. Chances are, you have some experience using BI or data analysis tools that work on SQL. Most of what you already know can be applied to Snowflake. Not to mention that SQL is an easy-to-learn language, a significant benefit for general users. To that we add an intuitive UI that fits the needs of both users with and without coding experience.

Zero management

One of the real strengths of Snowflake is the serverless experience it provides. Well, almost serverless, to be exact. As mentioned above, as a user you don’t have to go behind the scenes of Snowflake’s work. The platform handles all the management, maintenance, upgrades, and tuning tasks. It also takes care of all aspects of software installation and updates. And it goes for all types of users — general end-users, business analysts, or data scientists.

With almost zero management, you can be up and running in Snowflake in minutes, start loading your data, and derive insights from it. Why “almost” then? You still need to set the number of warehouses and configure the sizes of compute clusters per the warehouse. These tasks require knowledge of SQL as well as an understanding of how data warehouse architecture works. Hence, we can say that Snowflake is completely serverless.

Connectors, tools, and integrations

Connectivity is one of the strengths of the platform. Users can access data in a variety of ways such as Snowflake Web UI, Snowflake Client command-line interface, and a set of connectors and drivers like ODBC, and JDBC.

Web UI. A web-based user interface is leveraged to interact with cloud services. Users can manage their account and other general settings, monitor usage of resources, and query data as well.

Command-line interface. Snowflake provides a Python-based CLI Client called SnowSQL to connect to the data warehouse. It is a separate downloadable and installable utility for executing all queries including both data definition and data manipulation types

Connectors. There’s a rich set of connectors and drivers for users to connect to cloud data. Some of them include Connector for Python for writing Python apps that connect to Snowflake, ODBC driver for C and C+ development, and JDBC driver for Java programming.

There’s an extensive ecosystem of various tools, extensions, and modules that provide native connectivity to Snowflake.

For data integration purposes, a few popular tools and technologies to natively use with Snowflake include:

  • Hevo Data is an official Snowflake ETL Partner that provides a no-code data pipeline to bring information from various sources to Snowflake in real-time;
  • Apache Kafka software uses a publish/subscribe model to write and read streams of records and is available through the Snowflake Connector for Kafka to load data from its topics; and
  • Informatica Cloud and Informatica PowerCenter are cloud data management tools that work collaboratively with Snowflake.

For machine learning and data science purposes, consider the following platforms:

  • Amazon SageMaker — a cloud machine-learning platform to build, train, and deploy machine learning (ML) models — has no requirements for connecting to Snowflake and
  • Qlik AutoML — a no-code automated machine learning platform — is a readily available integration with no requirements.

For BI and analytics purposes, you can choose from a variety of tools and integration provided by Snowflake, namely:

  • Looker — business intelligence software and big data analytics platform powered by Google Cloud — is validated by the Snowflake Ready Technology Validation Program;
  • Power BI — Microsoft business intelligence platform — can be connected to Snowflake via ODBC driver; and
  • Tableau — one of the leading analytics platforms — is also among Snowflake partner integrations.

The toolset isn’t exhaustive, there are far more technologies that Snowflake uses to extend its capabilities.

Awesome documentation

Snowflake’s documentation is truly a gem. Neatly organized and well-written, it explains all the aspects of the technology from general concepts of the architecture to detailed guides on data management and more. Whether you are a business user with no tech background or an experienced solution architect, Snowflake has resources for everyone.

Convenient pricing

Unlike traditional data warehouses, Snowflake gives you the flexibility to pay only for what you use. On-demand pricing means you pay based on the amount of data you store and compute hours you use. Compute resources are charged per second, with a 60-second minimum. Worth noting that if the warehouse is idle for some time, it can be auto-stopped so time isn’t billed. The warehouse wakes up only when a query is submitted to it. In case you want to move from a Large to an X-Large warehouse, you will double your compute power and the number of credits billed per hour.

The cons of Snowflake data warehouse

The sky’s the limit, right? No matter how great a solution is, there are always some weaknesses that may be critical to customers who are considering certain technology as a part of their stack. Snowflake is no exception.

Challenging data streaming

As we’ve said, Snowflake allows data loading in different formats and from various sources. You can easily load all your relational information as well as semi-structured data such as JSON, Avro, ORC, Parquet, and XML files directly into Snowflake tables. Storage and governance of unstructured data are also rather simple with the newly-arrived Data Cloud capabilities. At the same time, building continuous data pipelines requires additional tools such as the abovementioned Snowpipe. So, while data streaming is possible, it doesn’t come naturally as in Apache Kafka, for example.

No opportunity to run on-premises

Snowflake is designed in the cloud and for the cloud only. All components of Snowflake’s service for both compute needs and for persistent storage of data run in public cloud infrastructures. This means that there are no opportunities for users to deploy Snowflake on private cloud infrastructures (on-premises or hosted).

Hidden costs

While pay-as-you-go pricing is definitely a strong side of Snowflake, the solution may be more expensive than its competitors such as Amazon Redshift. This is due to the fact that Snowflake pricing depends heavily on your usage pattern. According to one Redshift comparison, Redshift is 1.3 times cheaper for on-demand pricing, and even cheaper when purchasing a 1 or 3-year reserved instance in advance.

Difficult bulk data migration

Since Snowflake is quite a popular solution, it limits the number of transactions in its system. That means bulk data migration can be a problem. Unless you get special permission, there’s a chance that you can be limited to transferring only 10GB of data a day. And if you need to migrate a 2 TB system, that’s 200 days. So, to avoid prolonging an already long data migration process, it’s in your best interest to process data, get rid of the garbage, and validate each transfer as it happens.

Relatively small community

Compared to its main competitors such as Amazon Redshift and Google BigQuery, Snowflake has a relatively small community of 30,000 members with only 3,400 users on an unofficial subreddit (the BigQuery subreddit has around 15,000 users). On StackOverflow, the tag [google-bigquery] has raised nearly 21,000 questions with Google Cloud Collective available while Snowflake’s popularity is a bit more modest with Snowflake-tagged questions having about 7,000 questions and no Collective. At the same time, the community is amazingly active and keeps growing. You can also become a member of like-minded people by emailing the Snowflake team. In case you have some questions, you can fill out a form on the website and they contact you via phone or email.

Cloud-agnostic approach: a weakness?

The cloud-agnostic nature of Snowflake is both an advantage and disadvantage. On the one hand, there’s no vendor lock-in and you are free to run Snowflake in the Amazon, Google, and Microsoft public clouds of your choice. On the other hand, each of these public clouds offers its own cloud data warehouse tool: Amazon Redshift, Google BigQuery, and Microsoft Azure SQL DW respectively. In some cases, choosing a more tightly coupled cloud ecosystem can be more beneficial than going the Snowflake path.

Snowflake alternatives

Having looked at Snowflake architecture and toolset, you probably have a general understanding of whether this solution meets your needs. Overall, this solution is a good fit for companies in search of an easily deployed data warehouse with nearly unlimited, automatic scaling and top-notch performance.

In case you think Snowflake doesn’t have exactly what your project requires, you can look at the alternative vendors on the market. Here are a few possible options.

The Apache Hadoop is an open-source framework for the distributed processing of large data sets across clusters of computers. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. However, for the sake of scalability to handle big data, systems like Hadoop may compromise on security and the convenience of SQL. Besides, the Hadoop programming environment is quite complex.

Initially built as a processing engine managed by Apache Spark, Databricks is a cloud platform with a hybrid architecture of a data lake and a data warehouse known as a data lakehouse. In this case, it includes Delta Lake storage and a SQL engine called Databricks SQL Analytics. Just like Snowflake, the platform is equipped with services for all your data and analytics needs, from loading to storage to query processing to machine learning, and so on. It’s not serverless though. Databricks requires a specialized set of skills to manage and maintain its infrastructure, e.g., experience in configuring Spark, which in turn calls for expertise in Scala and Python.

Amazon Redshift is another alternative to consider. It is also a cloud-based data warehouse designed to tackle Business Intelligence use cases among other things. Redshift is a more serverless offering that has a shared-nothing MPP architecture to enable users to run multiple complex analytical queries at the same time. Redshift is a native AWS service that is built to work in unison with other AWS technologies. So, if you are already using AWS technology, Redshift might be a better option for you. At the same time, the solution comes with more baggage compared to Snowflake because users have to optimize the platform in order to get the most out of the solution.

For a more detailed comparison of Snowflake’s main competitors, please read our dedicated article comparing major cloud data warehouses including our hero.

How to get started with Snowflake

If you want to implement or migrate to Snowflake, we have prepared a few useful links with information on how to make that happen and what to start with.

Snowflake Documentation. This is a general reference for all the services provided by Snowflake. From the Getting Started info pack explaining how to create an account and start working with it to detailed step-by-step guides on how to use REST API to access unstructured data.

Snowflake ecosystem of partner integrations. Chances are you are already using some sort of software to work with data. You may follow this link to check integration options. There’s a wide array of third-party partners and technologies that provide native connectivity to Snowflake. They range from data integration solutions to Business Intelligence tools to machine learning and data science platforms, and more.

Pricing page. This link will explain the details of Snowflake pricing plans. At the moment, there are four plans — Standard, Enterprise, Business Critical, and Virtual Private Snowflake. Also, there you will find an informative pricing guide and contacts for Snowflake consultants.

Community forums. Even with a not-so-big community, it is easy to find answers to all general and technical questions related to Snowflake including information about its SQL language, connectors, administration, user interface, and ecosystem. There are 13 Community Groups clustered under major topics on the website. You can visit Snowflake Lab on GitHub or go to the aforementioned StackOverflow or Reddit forums where data experts and enthusiasts share their experiences.

Snowflake University and Hands-on Lab. In its online University, Snowflake provides a variety of courses for people with different levels of expertise: for those who are new to Snowflake and for advanced learners preparing for SnowPro Certification exams, e.g., Zero to Snowflake in 90 Minutes.

YouTube channel. Snowflake has a lot of explanatory, educative, and customer success videos on the official YouTube Channel.

Add a comment

Comments

avatar