The Good and the Bad of Snowflake Data Warehouse
Not long ago setting up a data warehouse — a central information repository enabling business intelligence and analytics — meant purchasing expensive, purpose-built hardware appliances and running a local data center. With the consistent rise in data volume, variety, and velocity, organizations started seeking special solutions to store and process the information tsunami. This demand gave birth to cloud data warehouses that offer flexibility, scalability, and high performance. These days, Snowflake is one of the most popular options that meets these and a lot of other important business requirements.
For everyone who is considering Snowflake as a part of their technology stack, this article is a great place to start the journey. We’ll dive deeper into Snowflake’s pros and cons, its unique architecture, and its features to help you decide whether this data warehouse is the right choice for your company.
Data warehousing in a nutshell
Before we get into Snowflake technology, let’s deal with the key concepts of data warehousing for a common understanding.
The main idea of any data warehouse (DW) is to integrate data from multiple disjointed sources (e.g., CRMs, OLAP/OLTP databases, enterprise applications, etc.) within a single, centralized location for analytics and reporting. Traditionally, it is a relational database that stores all data in tables and allows users to run SQL (Structured Query Language) queries on it.
By the type of deployment, data warehouses can be categorized as
- on-premise — hardware and software are installed locally;
- cloud-based — resources are deployed either in public or private cloud environments; and
- hybrid cloud — the aforementioned capabilities are available under one roof.
Depending on the type and capacities of a warehouse, it can become home to structured, semi-structured, or unstructured data.
- Structured data is highly-organized and commonly exists in a tabular format like Excel files.
- Unstructured data comes in all forms and shapes from audio files to PDF documents and doesn’t have a pre-defined structure.
- Semi-structured data is somewhere in the middle, meaning it is partially structured but doesn’t fit the tabular models of relational databases. Examples are JSON, XML, and Avro files.
The data journey from different source systems to a warehouse commonly happens in two ways — ETL and ELT. The former extracts and transforms information before loading it into centralized storage while the latter allows for loading data prior to transformation.
These are the basics needed to explore the world of Snowflake and how it works.
What is Snowflake?
Developed in 2012 and officially launched in 2014, Snowflake is a cloud-based data platform provided as a SaaS (Software-as-a-Service) solution with a completely new SQL query engine. As opposed to traditional offerings, Snowflake is a tool natively designed for the public cloud, meaning it can’t be run on-premises. The platform provides fast, flexible, and easy-to-use options for data storage, processing, and analysis. Initially built on top of the Amazon Web Services (AWS), Snowflake is also available on Google Cloud and Microsoft Azure. As such, it is considered cloud-agnostic.
Modern data pipeline with Snowflake technology as part of it. Source: Snowflake
BTW, we have an engaging video explaining how data engineering works
Snowflake is considered a more serverless offering, meaning as a user you don’t have to select, install, configure, or manage any software and hardware (virtual or physical) except for the number and size of compute clusters (more on this later). Also, Snowflake has a unique architecture that can scale up and down based on the requirements and workloads. For example, when the number of queries increases, the system instantly mobilizes more computing resources.
Snowflake use cases
A cloud-based data warehousing and analytics platform, Snowflake can be used for a variety of purposes. Here are some of the key use cases of Snowflake.
Data ingestion. Snowflake offers a solution with its continuous data ingestion service, Snowpipe. This service enables enterprises to stage data as soon as it becomes available from external storage locations like S3 and Azure Blob. With features like auto-ingest and cloud provider notification, Snowpipe enables seamless and uninterrupted loading of data into tables.
Business intelligence and analytics. Snowflake enables organizations to gain insights from their data through interactive reporting and advanced analytics. The solution’s compatibility with popular business intelligence tools such as QuickSight, Looker, Power BI, and Tableau enhances its ability to provide valuable insights for organizations.
Data sharing and collaboration. Snowflake offers a seamless and secure way for users to share and collaborate on their data via Snowflake Marketplace. The Marketplace is a centralized platform where users can discover and access data assets, such as datasets and data services, that are published by other organizations. The data assets are verified by Snowflake to ensure that they meet certain standards of quality and security. Users can easily discover data assets that are relevant to their needs, compare different offerings, and quickly obtain access to the data they need.
Machine learning. Snowflake also supports machine learning use cases, enabling data scientists and analysts to build, train, and deploy machine learning models within the Snowflake platform. This includes loading, transforming, and managing large datasets, and integrating with popular machine learning libraries such as TensorFlow and PyTorch. Additionally, Snowflake integrates directly with Apache Spark to streamline data preparation and facilitate the creation of ML models. With support for programming languages like Python, R, Java, and C++, Snowflake empowers users to leverage these tools to develop sophisticated ML solutions.
These are some of the key use cases of Snowflake, which makes it a powerful and versatile data platform for organizations of all sizes and types.
Snowflake architecture overview
In most cases, if you want to build a scalable data warehouse, you need massively parallel processing (MPP) to handle multiple concurrent workloads. So, you either use shared-disk or shared-nothing MPP architecture for that.
Shared-disk vs shared-nothing architecture
- The shared-disk architecture uses multiple cluster nodes (processors) that have access to all data stored on a shared memory disk. Nodes have CPU and Memory but no disk storage, so they communicate to a central storage layer to get data.
- The shared-nothing architecture stores and processes portions of data on different cluster nodes in parallel and independently. Each node has its own disk storage.
Snowflake combines the benefits of both architectures in its new, unique hybrid design.
Similar to shared-nothing architecture, Snowflake uses MPP-based computing to process queries concurrently with each node locally storing a portion of the entire data. As for the similarity with shared-disk architecture, there is a centralized data repository for a single copy of data that can be accessed from all independent compute nodes. As such, data management is as simple as in shared-disk architecture with performance and scale-out benefits of the shared-nothing architecture.
Snowflake architecture diagram
Snowflake has a multi-cluster, shared-data architecture that consists of three separate tiers, namely
- data storage layer,
- query processing (compute) layer, and
- cloud services (client) layer.
Physically separated but logically integrated, each layer can scale up and down independently, enabling Snowflake to be more elastic and responsive.
To understand how Snowflake works, we’ll walk you through all layers and explain their features.
Database storage layer
The database storage layer, as the name suggests, handles tasks related to secure, reliable, and elastic storage of data that comes from disparate sources. Snowflake supports both ETL/ELT processes to insert data in scheduled bulks or batches. Also, it enables continuous data ingestion from source files in micro-batches, making information available to users almost instantaneously. This feature is available through a separate Snowflake service called Snowpipe.
With loading, data is optimized, compressed, and reorganized into an internal columnar format. It is broken down into so-called micro-partitions. For example, if the table contains transactions, micro-partitions are the days of transactions. Each day is a separate micro-partition – a separate file.
This optimized data is stored in a cloud object storage such as S3 by AWS, Google Cloud Storage, or Microsoft Azure Blob Storage. Customers can neither see nor access these data objects. They use Snowflake to run SQL query operations.
Query processing layer
Query processing or compute layer provides the means for executing various SQL statements. It consists of multiple independent compute clusters with nodes processing queries in parallel. Snowflake calls these clusters virtual warehouses. Each warehouse is packed with compute resources, such as CPU, memory, and temporary storage required to perform SQL and DML (Data Manipulation Language) operations. Users can:
- retrieve rows from tables,
- load data into tables,
- unload data from tables, and
- delete, update, or insert separate rows in tables, etc.
The warehouses’ size chart with the number of virtual nodes. Source: Snowflake
Virtual warehouses come in ten sizes from X-Small to 6X-Large: Each increase in size to the next larger warehouse doubles the computing power.
Since warehouses don’t share compute resources with one another, there’s no impact on the performance of other machines if one goes down. Besides, nodes do not store any data, so losing them isn’t critical. If a failure occurs, Snowflake will recreate a new instance in minutes.
Cloud services layer
Cloud services or client layer hosts a bunch of services that coordinate activities across Snowflake. The layer also runs on compute instances provided by Snowflake from different cloud providers. These services pull together all Snowflake components into a cohesive whole.
Services managed in this layer include
- authentication and access control (authenticating users and connections, managing encryption and keys);
- infrastructure management (managing virtual warehouses and storage);
- metadata management (storing metadata and handling queries that can be executed from it); and
- query parsing and optimization.
The pros of Snowflake data warehouse
In this section, you will find things that make Snowflake a real deal and may serve as reasons to consider this cloud data warehouse as a solution.
Adequate security and data protection
With Snowflake, data is highly secure. Users can set regions for data storage to comply with regulatory guidelines such as HIPAA, PCI DSS, and SOC 1 and SOC 2. Security levels can be adjusted based on requirements. The solution has built-in features to encrypt all data at rest and in transit, regulate access levels, and control things like IP allows and blocklists.
To achieve better data protection, Snowflake offers two advanced features — Time Travel and Fail-safe. Time Travel gives you an opportunity to restore tables, schemas, and databases from a specific time point in the past. By default, there’s one day of data time travel. But Enterprise users can choose a period of time up to 90 days. The Fail-safe feature allows for protecting and recovering historical data. Its 7-day period starts right after the Time Travel retention period ends.
Great performance and scalability
Thanks to the separated storage and compute, Snowflake has the ability to run a virtually unlimited number of concurrent workloads against the same, single copy of data. This means that multiple users can execute multiple queries simultaneously.
While benchmarks can be configured to perform in a certain way and fit particular use cases, most show great results for Snowflake performance. This one proves that Snowflake is capable of processing 6 to 60 million rows of data in 2 seconds to 10 seconds, which is pretty impressive. Out of the box, Snowflake has what it takes to outperform other cloud warehouse solutions with no prior fine-tuning.
When it comes to scalability, Snowflake has a unique auto-scaling and auto suspend feature to start and stop warehouses depending on whether they are active or inactive. For comparison, autoscaling in Redshift is quite limited. Apart from scaling being handled by Snowflake, it can be both vertical and horizontal. Vertical scaling means the platform adds more computer power to existing virtual warehouses, e.g., upgrading CPUs. With horizontal scaling, more cluster nodes are added.
The virtual warehouse memory is used for caching. When a query is executed, data from different tables in storage is cached by different compute clusters. Then all subsequent queries can use this cache to generate results. With data in the cache, queries run up to 10 times faster.
A really powerful element of the tool is that data stored in Snowflake comes in the form of micro-partitions. These are continuous units of storage that hold data physically. They are called “micro” because their size ranges from 50 to 500 MB before compression. Besides, resizing the micro-partition blocks can be executed by both users and Snowflake automatically.
How Snowflake stores data in micro-partitions. Source: Snowflake
Within each micro-partition, data is stored in a columnar data structure, allowing better compression and efficient access only to those columns required by a query. Shown in the picture above, 24 rows from the table are stored and sorted in 4 micro-partitions by columns. Repeated values are stored only once. Let’s imagine that you need data from two different tables for your SQL query to be executed. So, instead of copying both tables fully to the compute cluster, Snowflake retrieves only relevant micro-partitions. As a result, the query needs less time to be completed.
Light learning curve
Many people think that setting up and using a data warehouse properly is a difficult task requiring solid knowledge of different tech stacks. Well, if we take some Hadoop or Spark that have a steep learning curve due to completely new syntax, the former words will make sense. Things are different with Snowflake since it is fully SQL-based. Chances are, you have some experience using BI or data analysis tools that work on SQL. Most of what you already know can be applied to Snowflake. Not to mention that SQL is an easy-to-learn language, a significant benefit for general users. To that we add an intuitive UI that fits the needs of both users with and without coding experience.
One of the real strengths of Snowflake is the serverless experience it provides. Well, almost serverless, to be exact. As mentioned above, as a user you don’t have to go behind the scenes of Snowflake’s work. The platform handles all the management, maintenance, upgrades, and tuning tasks. It also takes care of all aspects of software installation and updates. And it goes for all types of users — general end-users, business analysts, or data scientists.
With almost zero management, you can be up and running in Snowflake in minutes, start loading your data, and derive insights from it. Why “almost” then? You still need to set the number of warehouses and configure the sizes of compute clusters per the warehouse. These tasks require knowledge of SQL as well as an understanding of how data warehouse architecture works. Hence, we can’t say that Snowflake is completely serverless.
Connectors, tools, and integrations
Connectivity is one of the strengths of the platform. Users can access data in a variety of ways such as Snowflake Web UI, Snowflake Client command-line interface, and a set of connectors and drivers like ODBC, and JDBC.
Web UI. A web-based user interface is leveraged to interact with cloud services. Users can manage their account and other general settings, monitor usage of resources, and query data as well.
Command-line interface. Snowflake provides a Python-based CLI Client called SnowSQL to connect to the data warehouse. It is a separate downloadable and installable utility for executing all queries including both data definition and data manipulation types
Connectors. There’s a rich set of connectors and drivers for users to connect to cloud data. Some of them include Connector for Python for writing Python apps that connect to Snowflake, ODBC driver for C and C+ development, and JDBC driver for Java programming.
There’s an extensive ecosystem of various tools, extensions, and modules that provide native connectivity to Snowflake.
For data integration purposes, a few popular tools and technologies to natively use with Snowflake include:
- Hevo Data is an official Snowflake ETL Partner that provides a no-code data pipeline to bring information from various sources to Snowflake in real-time;
- Apache Kafka software uses a publish/subscribe model to write and read streams of records and is available through the Snowflake Connector for Kafka to load data from its topics; and
- Informatica Cloud and Informatica PowerCenter are cloud data management tools that work collaboratively with Snowflake.
For machine learning and data science purposes, consider the following platforms:
- Amazon SageMaker — a cloud machine-learning platform to build, train, and deploy machine learning (ML) models — has no requirements for connecting to Snowflake and
- Qlik AutoML — a no-code automated machine learning platform — is a readily available integration with no requirements.
For BI and analytics purposes, you can choose from a variety of tools and integration provided by Snowflake, namely:
- Looker — business intelligence software and big data analytics platform powered by Google Cloud — is validated by the Snowflake Ready Technology Validation Program;
- Power BI — Microsoft business intelligence platform — can be connected to Snowflake via ODBC driver; and
- Tableau — one of the leading analytics platforms — is also among Snowflake partner integrations.
The toolset isn’t exhaustive, there are far more technologies that Snowflake uses to extend its capabilities.
Snowflake’s documentation is truly a gem. Neatly organized and well-written, it explains all the aspects of the technology from general concepts of the architecture to detailed guides on data management and more. Whether you are a business user with no tech background or an experienced solution architect, Snowflake has resources for everyone.
Unlike traditional data warehouses, Snowflake gives you the flexibility to pay only for what you use. On-demand pricing means you pay based on the amount of data you store and compute hours you use. Compute resources are charged per second, with a 60-second minimum. Worth noting that if the warehouse is idle for some time, it can be auto-stopped so time isn’t billed. The warehouse wakes up only when a query is submitted to it. In case you want to move from a Large to an X-Large warehouse, you will double your compute power and the number of credits billed per hour.
The cons of Snowflake data warehouse
The sky’s the limit, right? No matter how great a solution is, there are always some weaknesses that may be critical to customers who are considering certain technology as a part of their stack. Snowflake is no exception.
Snowflake was initially designed in the cloud and for the cloud only. Up until recently, all components of Snowflake’s service for both compute needs and persistent storage of data have run only in public cloud infrastructures. This means that there have been no opportunities for users to deploy Snowflake on private cloud infrastructures (on-premises or hosted). However, in 2022, Snowflake started expanding to on-premises storage, which is good news for users who don’t choose the cloud due to security concerns.
On-demand pricing can be costly
While pay-as-you-go pricing is definitely a strong side of Snowflake, the solution may be more expensive than its competitors such as Amazon Redshift. This is due to the fact that Snowflake pricing depends heavily on your usage pattern. According to one Redshift comparison, Redshift is 1.3 times cheaper for on-demand pricing, and even cheaper when purchasing a 1 or 3-year reserved instance in advance. But again, it heavily depends on the usage and doesn’t reflect the complete picture of Snowflake’s costs, which are transparent and fully accountable.
Relatively small community
No matter how great a certain technology is, it still may prompt inquiries about implementation and problem-solving. And here’s where a big community of experienced users can be an advantage.
Compared to its main competitors such as Amazon Redshift and Google BigQuery, Snowflake has a relatively small community of 30,000 members with only 3,400 users on an unofficial subreddit (the BigQuery subreddit has around 15,000 users). On StackOverflow, the tag [google-bigquery] has raised nearly 21,000 questions with Google Cloud Collective available while Snowflake’s popularity is a bit more modest with Snowflake-tagged questions having about 7,000 questions and no Collective.
However, it’s important to note that even though a smaller community may be reflected in these metrics, Snowflake’s community is still active and growing. On top of that, Snowflake’s ease of use compared to other solutions might make it less likely for users to run into problems. In case you have some questions, you can fill out a form on the website and they contact you via phone or email. You can also become a member of like-minded people by emailing the Snowflake team.
Cloud-agnostic approach: a weakness?
The cloud-agnostic nature of Snowflake can be both an advantage and a disadvantage, depending on a company’s needs. On the one hand, there’s no vendor lock-in and you are free to run Snowflake in the Amazon, Google, and Microsoft public clouds of your choice. On the other hand, each of these public clouds offers its own cloud data warehouse tool: Amazon Redshift, Google BigQuery, and Microsoft Azure SQL DW respectively. In some cases, choosing a more tightly coupled cloud ecosystem can be more beneficial than going the Snowflake path.
Data streaming: Was a weakness, but not anymore
Snowflake has always allowed for easy data loading from various sources and in different formats, including relational information and semi-structured data like JSON, Avro, ORC, Parquet, and XML files. With its newly-arrived Data Cloud capabilities, unstructured data storage and governance is also made simple. However, building continuous data pipelines with Snowflake required additional tools like Snowpipe not so long ago. But with the introduction of Snowflake Streaming, now in private preview and development, Snowflake has solved this challenge and brought Kafka-like speed to streaming ingestion. So Snowflake users will soon have the ability to effortlessly work with streaming data and handle data ingestion through the creation of seamless streaming data pipelines.
Snowflake alternatives and competitors
Having looked at Snowflake architecture and toolset, you probably have a general understanding of whether this solution meets your needs. Overall, this solution is a good fit for companies in search of an easily deployed data warehouse with nearly unlimited, automatic scaling and top-notch performance.
In case you think Snowflake doesn’t have exactly what your project requires, you can look at the alternative vendors on the market. Here are a few possible options.
Snowflake vs Apache Hadoop
The Apache Hadoop is an open-source framework for the distributed processing of large data sets across clusters of computers. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. However, for the sake of scalability to handle big data, systems like Hadoop may compromise on security and the convenience of SQL. Besides, the Hadoop programming environment is quite complex.
Snowflake vs Databricks
Initially built as a processing engine managed by Apache Spark, Databricks is a cloud platform with a hybrid architecture of a data lake and a data warehouse known as a data lakehouse. In this case, it includes Delta Lake storage and a SQL engine called Databricks SQL Analytics. Just like Snowflake, the platform is equipped with services for all your data and analytics needs, from loading to storage to query processing to machine learning, and so on. It’s not serverless though. Databricks requires a specialized set of skills to manage and maintain its infrastructure, e.g., experience in configuring Spark, which in turn calls for expertise in Scala and Python.
Snowflake vs AWS Amazon Redshift
Amazon Redshift is another alternative to consider. It is also a cloud-based data warehouse designed to tackle Business Intelligence use cases among other things. Redshift is a more serverless offering that has a shared-nothing MPP architecture to enable users to run multiple complex analytical queries at the same time. Redshift is a native AWS service that is built to work in unison with other AWS technologies. So, if you are already using AWS technology, Redshift might be a better option for you. At the same time, the solution comes with more baggage compared to Snowflake because users have to optimize the platform in order to get the most out of the solution.
For a more detailed comparison of Snowflake’s main competitors, please read our dedicated article comparing major cloud data warehouses including our hero.
How to get started with Snowflake
If you want to implement or migrate to Snowflake, we have prepared a few useful links with information on how to make that happen and what to start with.
Snowflake Documentation. This is a general reference for all the services provided by Snowflake. From the Getting Started info pack explaining how to create an account and start working with it to detailed step-by-step guides on how to use REST API to access unstructured data.
Snowflake ecosystem of partner integrations. Chances are you are already using some sort of software to work with data. You may follow this link to check integration options. There’s a wide array of third-party partners and technologies that provide native connectivity to Snowflake. They range from data integration solutions to Business Intelligence tools to machine learning and data science platforms, and more.
Pricing page. This link will explain the details of Snowflake pricing plans. At the moment, there are four plans — Standard, Enterprise, Business Critical, and Virtual Private Snowflake. Also, there you will find an informative pricing guide and contacts for Snowflake consultants.
Community forums. Even with a not-so-big community, it is easy to find answers to all general and technical questions related to Snowflake including information about its SQL language, connectors, administration, user interface, and ecosystem. There are 13 Community Groups clustered under major topics on the website. You can visit Snowflake Lab on GitHub or go to the aforementioned StackOverflow or Reddit forums where data experts and enthusiasts share their experiences.
Snowflake University and Hands-on Lab. In its online University, Snowflake provides a variety of courses for people with different levels of expertise: for those who are new to Snowflake and for advanced learners preparing for SnowPro Certification exams, e.g., Zero to Snowflake in 90 Minutes.
YouTube channel. Snowflake has a lot of explanatory, educative, and customer success videos on the official YouTube Channel.