Choosing the Right Data Store: A Comprehensive Overview of Cloud Platforms
The growth of data necessitates effective storage, organization, and analysis to extract its full value and make informed decisions. But choosing the right data store for a company can be challenging due to the options available – there are numerous platforms to consider for cloud storage, such as Cosmos DB, Azure SQL Database, Postgres, MySQL Services, Databriks, Redshift, RDS, etc.
While opting for cloud storage enhances accessibility, scalability, cost-effectiveness, and data security, the question remains: which platform should you choose?
This comprehensive guide delves beyond a simple overview of cloud platforms to explore the strategic approach of storing different data types in specialized data stores. To optimize the overall data architecture, we draw upon the extensive expertise of Denny Cherry, Owner and Principal Consultant at Denny Cherry & Associates Consulting in San Diego, California.
This overview encapsulates Denny Cherry’s speech at the Big Data Fest 2023, where he shared indispensable advice on selecting the ideal data platform for storing data. We present the key points of his speech, shedding light on the advantages and disadvantages of various options. With concise summaries, you will gain a clear understanding of which platform best suits your specific needs.
So, what exactly is the cloud?
The cloud refers to a network of remote servers accessed via the Internet for storing, managing, and processing data instead of relying on local storage or computing resources. It enables individuals and organizations to access computing resources on-demand, including storage space, processing power, and software applications, without needing physical infrastructure or maintenance.
Various cloud platforms are available, with Microsoft Azure and Amazon being prominent. Each platform offers a multitude of options for data storage. For instance, Microsoft Azure provides Azure SQL Database, Managed Instance, Storage Tables, or Blobstore. On the AWS side, there are choices such as virtual machines, RDS, Redshift, and other AWS-specific data platforms.
Making the correct choice for data storage solutions becomes vital when transitioning from on-premises solutions to the cloud and when developing applications. Let’s explore the different options in detail.
Virtual Machines
Virtual machines (VMs) provide significant advantages when storing data for a database platform. From a SQL Server perspective (which applies to most other database platforms), opting for a VM offers complete control over the operating system and the database instance. Unlike platform-as-a-service (PaaS) offerings, where the vendor handles patching and maintenance on their schedule, VMs allow precise control over specific builds and patch levels.
For example, suppose a vendor’s application requires a particular version of SQL Server or Oracle SQL with no support for newer versions. In that case, VMs empower you to install and control that specific build or patch level. Additionally, VMs provide flexibility in setting up high availability (HA) and disaster recovery configurations, although building these configurations is required. This level of control extends to deploying specific versions of the database and operating system software, enabling compatibility with vendor requirements.
Running VMs allows you to schedule operations based on your needs, unlike PaaS services that adhere to cloud providers’ patching schedules. This feature ensures updates maintain your business operations, particularly when dealing with different time zones.
VMs offer fine-grained control over firewall rules, enabling precise control over communication between different components. Unlike PaaS platforms, which may limit storage capacity, VMs provide ample storage options, ranging from terabytes to hundreds of terabytes, depending on the cloud provider. In a virtual machine, you can allocate storage on specific servers or storage devices, which is impossible with PaaS offerings.
Furthermore, utilizing reserved instances allows you to commit to running a specific VM size for a defined duration, typically one or three years, resulting in discounted pricing—usually around 30% to 40% off the regular rate. This presents a cost-saving opportunity for businesses.
Summary. Virtual machines offer extensive control over the entire stack, including the operating system and database instance, while enabling flexibility, customization, and cost savings, making them a preferred choice for specific scenarios and requirements.
Azure SQL Database
Azure SQL Database offers unique advantages that set it apart from other cloud providers. As the owner of the SQL Server source code, Microsoft can leverage this advantage to provide capabilities other providers cannot match. For example, AWS’s RDS is limited by the underlying SQL Server technology, whereas Azure SQL Database can introduce innovative features specific to their cloud implementation.
Azure SQL Database encompasses four flavors: SQL Server, MySQL, PostgreSQL, and MariaDB. While Postgres and MariaDB have different ownership, they function similarly to managed instances. From a branding perspective, all these options fall under the SQL Database umbrella.
Azure SQL DB is a cloud-hosted and database-hosted service, differing from instance-hosted services like RDS or SQL Managed Instance. It allows the configuration of databases as either single databases or within elastic pools. Elastic Pools enable the bundling of multiple databases, facilitating resource sharing, including computing, memory, and billing. With Azure SQL, thousands of databases can reside on a single namespace, allowing extensive scalability.
Comparatively, Managed Instance and Amazon’s RDS have limitations, accommodating only up to 100 databases per namespace. This restricts scalability options in these platforms. To achieve optimal performance and scale in the cloud, application redesign is recommended rather than merely migrating on-premises databases. For instance, a payroll provider in the US successfully migrated to Azure SQL Database by breaking their system into individual databases per client, resulting in a massive scale. They could scale up individual databases or consolidate smaller databases into elastic pools, reducing costs and improving performance.
Horizontal scaling is a recommended approach for scaling applications, allowing the creation of numerous small databases. This approach facilitates limitless scalability and is particularly beneficial for massive-scale applications. Azure SQL Database eliminates the need to purchase hardware, offloading that responsibility to Microsoft. Additionally, the platform supports serverless mode, allowing automatic CPU scaling without downtime. This feature is helpful for scenarios where resource demands fluctuate, such as generating reports during specific periods.
Another notable feature of Azure SQL Database is hyper-scale, which addresses the limitation of database size. Hyperscale enables databases to scale to much larger sizes. However, it is currently available only in specific Azure regions.
Summary. Azure SQL Database offers distinct advantages through Microsoft’s ownership of the database source code. Its flexibility, scalability, support for elastic pools, serverless mode, and hyper-scale capabilities make it a compelling choice for cloud-based database solutions.
Amazon RDS / Azure Managed Instance
These two services share similarities, as Amazon RDS is comparable to Azure Managed Instance. Both platforms provide SQL Server instances in the cloud, offering various features. However, Azure SQL Database is a distinct database-level feature, lacking the ability to perform cross-database queries.
In contrast, Managed Instance (MI) and RDS allow for cross-database queries, SQL Server Agent functionality, and the establishment of link servers to other database platforms. However, a limitation of 100 databases per RDS instance or managed instance may pose a downside. In Azure’s managed instance environment, sizing limitations can be encountered, although the product team is actively working on addressing these limitations.
Both RDS and Managed Instance excel in supporting online transaction processing (OLTP) databases. They are ideal choices when SQL Database (SQLDB) cannot accommodate specific requirements or lacks certain features. If you already use AWS or need features not supported by SQLDB, RDS or Managed Instance would be suitable alternatives.
While these platforms offer excellent scale-up functionality, the 100 database limit poses a constraint. As you approach this limit, creating additional instances becomes necessary to accommodate more databases. For example, suppose you have an Azure environment with an Azure Managed Instance. You can leverage cross-database queries, SQL Server Agent, and link servers to other platforms. However, if you anticipate reaching the 100 database limit, you can spin up another instance and distribute databases across multiple instances to ensure scalability.
Summary. Amazon RDS and Azure Managed Instance provide powerful options for hosting SQL Server instances in the cloud. They offer cross-database querying and other advanced features, making them suitable for OLTP databases. However, the 100 database limit may require the creation of additional instances to accommodate growing database needs.
Cosmos DB
The next choice to store the data is an Azure-specific option called Cosmos DB. This platform offers auto-scaling capabilities and supports multiple database engines within a unified environment.
A significant feature Cosmos DB offers is multi-master write, which allows writing to any database copy and automatically distributing the changes to other copies. This functionality addresses a long-standing desire for relational database platforms. Moreover, Cosmos DB enables the deployment of multi-master workloads across multiple regions within the Microsoft Azure environment, not just within a single region. To achieve optimal performance within Cosmos DB, data is partitioned.
Cosmos DB is currently limited to running exclusively on Azure as no options are available for running it on-premises or on AWS. However, Azure Arc, a feature provided by Azure, enables the management of on-premises, AWS, PCP, and other cloud provider systems within the Azure portal. This suggests a future scenario where Cosmos DB could be deployed on an Azure Arc-managed virtual machine in AWS by installing Azure Arc software on that virtual machine.
One of the remarkable aspects of Cosmos DB is its ability to scale infinitely. It can handle an ever-increasing volume of data, dynamically scaling storage and computing resources as needed. For instance, if you leverage Cosmos DB in your Azure environment, you can benefit from its multi-master capabilities, distributed workload across regions, and seamless scalability to accommodate growing data demands. However, Cosmos DB is currently limited to running exclusively on the Azure platform.
Summary. Azure’s Cosmos DB is a fascinating platform that offers auto-scaling and multi-master write capabilities. It supports multiple database engines and provides a wealth of functionality within a single environment. While Cosmos DB is currently restricted to Azure, the potential for future integration with Azure Arc may allow for deployment in other environments, such as AWS. The scalability of Cosmos DB is virtually limitless, making it an attractive choice for applications with expanding data requirements.
Synapse
Azure Synapse is another powerful feature offered by Azure for storing data. It is Microsoft’s counterpart to AWS Redshift, a data warehousing solution. If you’re familiar with previous names like SQLBW or APS, they are essentially the same platform that has evolved. Initially acquired by Microsoft in 2008 from a company based in Southern California, Azure Synapse was transformed from a physical appliance that could be deployed in on-premises data centres to a cloud-based service within Azure.
Azure Synapse is particularly well-suited for data warehousing purposes, but it is not recommended for OLTP (Online Transaction Processing) databases. To fully harness the benefits of Synapse, a database redesign is often necessary. Simply migrating an existing data warehouse into Synapse without modifying it can result in poor performance. For example, you have a large-scale data warehousing project with complex analytics requirements. In that case, Azure Synapse can provide an excellent solution to consolidate and analyze data efficiently. However, if you attempt to load your existing data warehouse into Synapse without optimizing it for the platform, you may experience significant performance issues.
Summary. Azure Synapse is a powerful tool for data warehousing purposes, offering a competitive alternative to AWS Redshift. While it may not suit OLTP databases, a well-planned redesign can unlock its full potential for data warehousing and analytics projects.
Nataliia Rekunenko