Big Data in the Cloud
So you have decided that you’re ready to implement a big data solution and you plan to do it in the cloud. How will you choose what’s best for your organization? With all of the hoopla around big data, NoSQL and the like, this blog will provide some background and a review of the options available by three leading cloud providers: Amazon, Microsoft and Google.
Database Types
Many are familiar with relational databases (RDBMS) and NoSQL databases, but NoSQL databases actually come in several flavors. Amazon’s information sheet on NoSQL covers the following types[1] (descriptions also taken from that page):
- RDBMS: The relational model normalizes data into tabular structures known as tables, which consist of rows and columns. A schema strictly defines the tables, columns, indexes, relationships between tables, and other database elements.
- Columnar Databases: Columnar databases are optimized for reading and writing columns of data as opposed to rows of data.
- Document Databases: Document databases are designed to store semi-structured data as documents, typically in JSON or XML format.
- Graph Databases: Graph databases store vertices and directed links called edges. Graphs can be built on relational (SQL) and non-relational (NoSQL) databases.
- Key-Value Stores: In-memory key-value stores are NoSQL databases optimized for read-heavy application workloads (such as social networking, gaming, media sharing and Q&A portals) or compute-intensive workloads (such as a recommendation engine).
There are also hybrid databases, such as Google’s Cloud Spanner, which provide the structure and consistency of relational databases with the horizontal scalability of NoSQL databases[2].
Cloud Provider Offerings
* Using columnstore index
** Able to use 3rd party graph databases with providers’ storage/DB offerings
Choosing a Solution
There are numerous criteria that one can use in choosing a cloud-based big data solution. It is important to choose a solution that aligns with your most important criteria. Here are a few to consider:
- OLTP vs OLAP: Database structure, layout, indexes and other optimizations are usually focused on supporting either large numbers of inserts and updates (OLTP) or supporting large and complex queries (OLAP)
- ACID Transactions: Traditional RDBMS systems guarantee ACID (atomicity, consistency, isolation, durability) transactions. NoSQL databases offer high performance and scalability in exchange for relaxing ACID constraints (usually consistency – allowing for a short lag)
- Scalability: Scalability comes in many forms, including the number of data elements, the number of kinds of data, concurrent updates and concurrent queries
- Integration Support: What tools and applications will you use to interact with the database? Are you fitting existing applications and queries to a new database? Are there 3rd party tools and applications that need direct access to the database?
- Lock-In: It’s important to look into the future and prepare for the possibility that you will change cloud providers or that your application will morph to support new use cases (especially if you’re building the database for a startup). You may choose to optimize portability over some of the other criteria if it’s likely that your solution will change over time.
Have you already made a selection of a cloud big data solution to use? Please share your experience with the group!
References
[1] https://aws.amazon.com/nosql/
[2] https://cloud.google.com/spanner/
Amazon RDS: https://aws.amazon.com/rds/
Amazon Redshift: https://aws.amazon.com/redshift/
Amazon DynamoDB: https://aws.amazon.com/dynamodb/
Azure SQL Database: https://azure.microsoft.com/en-us/services/sql-database/
Azure CosmosDB: https://azure.microsoft.com/en-us/services/cosmos-db/
Azure TableStorage: https://azure.microsoft.com/en-us/services/storage/tables/
Google Cloud SQL: https://cloud.google.com/sql/
Google Cloud BigQuery: https://cloud.google.com/bigquery/
Google Cloud Datastore: https://cloud.google.com/datastore/
Google Cloud BigTable: https://cloud.google.com/bigtable/
Google Cloud Spanner: https://cloud.google.com/spanner/
2 comments on “Big Data in the Cloud”
Comments are closed.
Thanks for the very informative blog. The blog shed lights on available big data services from three major cloud providers in today market. I personally found BigQuery which is Google’s fully managed, low-cost analytics data warehouse more interesting. Since it is serverless, there is no infrastructure to manage, no need to guess the needed capacity or over provision, and you don’t need a database administrator.
The interesting part is Google gives serverless analytics platform which is one stop solution from ingestion to data preparation, store, and analysis. This helps us to focus on analyzing data to find meaningful insight by using familiar SQL by taking advantage of pay as you go model. This platform integrates
1. Google BigQuery, Google Cloud Dataflow (This offers a unified
programming model and a managed service for executing a wide range of
data processing patterns including streaming analytics, ETL, and batch
computation)
2. Google Cloud Dataproc (This offers managed Spark and Hadoop service, to
easily process big datasets using the powerful and open tools in the
Apache big data ecosystem)
3. Google Cloud Pub/Sub ( This is for large scale, reliable, real-time
asynchronous messaging)
4. Google Cloud Dataflow (This is for batch and stream data processing)
5. Cloud Storage to store any of our files
6. Cloud Bigtable (NoSQL based Database)
Source: https://cloud.google.com/solutions/big-data/
Hi Ramdev – Thanks for the update and pointing out additional benefits and features of BigQuery and supporting technologies. I agree with your points about it removing some of the ‘headaches’ of other approaches and is easier to manage than a traditional database. Have you used any of these technologies before? We’re actually considering migrating some of our existing infrastructure and I’m interested in learning about others’ experiences.