A Study On Distributed Database Systems Information Technology Essay

Cloud computing is a technology that involves delivering hosted services which uses internet and remote server machines to maintain applications and data. It has gained much popularity in the recent years and many companies offer a variety of services like Amazon EC2, Google “Gov Cloud” and so on. Many startups are adopting the cloud as their sole viable solution to achieve scale. While predictions regarding cloud computing vary, most of the community agrees that public clouds will continue to grow in the number and importance of their tenants [4]. Having said that, there is an opportunity for rich data-sharing among independent web services that are co-located within the same cloud. We also expect that a small number of giant-scale shared clouds (such as Amazon AWS) will result in an unprecedented environment where thousands of independent and mutually distrustful web services share the same runtime environment, storage system, and cloud infrastructure. With the development of distributed system and cloud computing, more and more applications might be migrated to the cloud to exploit its computing power and scalability. However most cloud platforms don’t support relational data storage for the consideration of scalability. To meet the reliability and scaling needs, several storage technologies have been developed to replace relational database for structured and semi-structured data, e.g. BigTable [3] for Google App Engine, SimpleDB [10] for Amazon Web Service (AWS) and NOSQL solutions such as Cassandra [8] and HBase [9]. As a result, relational database of existing applications have started transforming to cloud-based databases so that they can efficiently scale on such platforms. In this paper, we will look in detail how this alternative database architecture can truly scale in the cloud while giving a single logical view across all data.

1 Introduction

In the recent days, a lot of new non-relational databases have cropped up in public and private clouds. We could infer one key message from this trend: “Go for a non-relational database if you want on-demand scalability”. Is this a sign that relational databases have had their day and will decline over time? Relational databases have been around for over thirty years. During this time, several so-called revolutions flared up all of which were supposed to spell the end of the relational database. All of those revolutions fizzled out and none even made a dent in the dominance of relational databases. In the coming sections, we will see how this situation changed, the current trend of moving away from relational databases and what this means for the future of relational databases. We will also look at a couple of distributed database models that sets the trend.

Before we can take a look at the way this paper is structured, let me give a brief overview on the list of references used in this paper in the order it is listed in the reference section. “Distributed Database management Systems” [1], a text book written by Saeed and Frank addresses the distributed database theory in general. The chapters in this book explore various issues and aspects of a distributed database system and further discusses various techniques and mechanisms that are available to address these issues. Reference [2] is a URL reference that talks about the next generation distributed databases. It mostly addresses the points of the distributed databases as being non-relational, distributed, open-source and horizontal scalable and so on. The web page also has links (URLs) to each and every distributed database model it has discussed. “Bigtable: a distributed storage system for structured data” [3] is a paper reference that discusses the most popular distributed storage system ‘BigTable’ on which many distributed databases like HBase and Cassandra have based their data models on. Reference 4 is an URL reference that predicts ten cloud computing developments expected to see next year for both cloud service providers and enterprise users. Reference [5] is an URL reference that talks about the history of SQL and the recent rise of alternatives to relational databases, the so called “NoSQL” data stores. “Automatic Configuration of a Distributed Storage System” [6] is a paper reference that presents an approach of automatically configuring a distributed database by proposing a design that supports an administrator to correctly configure such a system in order to improve application performance. They intend to design a software control loop that automatically decides how to configure the storage system to optimize the performance. “The Hadoop distributed File system (HDFS)” [7] is a paper reference that presents a distributed file system with a framework for analysis and transformation of very large data sets using its MapReduce paradigm. HBase (a distributed database model which we would be discussing later in this paper) uses HDFS as its data storage engine. “Cassandra-A Decentralized Structured Storage system” [8] is a paper reference that presents the Cassandra distributed database model. “HBase and Hypertable for large scale distributed storage systems” [9] is a paper reference that provides a view on the capabilities of each of these implementations of BigTable, which helps those trying to understand their technical similarities, differences, and capabilities. “SimpleDB: a simple Java-based multiuser system for teaching database internals” [10] is a paper reference that presents the architecture of SimpleDB. References [11] and [12] are URL references that point to the definitions of CAP theorem and MapReduce. “MapReduce: Simplified Data Processing on Large Clusters” [13] is a paper reference that presents a programming model for processing and generating large datasets. It uses the ‘map’ and the ‘reduce’ functions to partition a job into sub-jobs and later combine their results in some way to get the output. Many real world tasks are expressible in this model. Let us now take a look at the paper structure.

This paper is structured as follows. Section 2 talks about various features of relational databases and its drawbacks that led to a distributed model. Section 3 presents the distributed database model, its types, with benefits and drawbacks. In section 4, we present the importance of big data in the future of Information technology. Section 5 compares the data model and features of Hadoop HBase and Cassandra in detail. Finally, section 6 concludes with my opinion on distributed databases and the future of cloud computing.

2 What is a Relational Database?

A relational database is essentially a group of tables or entities that are made up of rows (also called as record or tuple) and columns. Those tables have constraints and one can define relationships between them. Relational databases are queried using Structured Query Language (SQL), and result sets are produced from the queries that access data from one or more tables. Multiple tables being accessed in a single query are “joined” together, typically by a criterion defined in the table relationship columns. One of the data-structuring models used with relational databases that removes data duplication and ensures data consistency is ‘Normalization’.

Relational databases are facilitated through Relational Database Management Systems (RDBMS) which is a DBMS that is based on the relational model. Almost all database management systems we use today are relational, including those of SQL Server, SQLite, MySQL, Oracle, Sybase, TeraData, DB2 and so on. The reasons for the dominance of relational databases are not trivial. They have continually offered the best mix of simplicity, flexibility, robustness, performance, compatibility and scalability in managing generic data. Relational databases have to be incredibly complex internally to offer all of this. For example, a relatively simple SELECT statement could have hundreds of potential query execution paths, which the optimizer would evaluate at run time. All of this is hidden to us as users, but under the cover, RDBMS determines the “execution plan” that best answers our requests by using things like cost-based algorithms.

2.1 Drawbacks of Relational Databases

Though Relational Database Management Systems (RDBMS) have provided users with the best mix of simplicity, flexibility, robustness, performance, compatibility and scalability, their performance and rate of acceptance in each of these areas is not necessarily better than that of an alternate solution pursuing one of these benefits in isolation. It has not been much of a problem so far because the universal dominance of the Relational DBMS has outweighed the need to push any of these boundaries. Nonetheless, if we really had a need that couldn’t be answered by any of these generic relational databases, alternatives have always been around to fill those niches.

We are in a slightly different situation today. One of these benefits mentioned above is becoming more and more critical for an increasing number of applications. While still considered a niche, it is rapidly becoming a mainstream. That benefit is ‘Scalability’. As more and more DB applications are launched in environments that have massive workloads, (such as web services) their scalability requirements can ‘change’ very quickly and ‘grow’ very large. The former scenario can be difficult to manage if you have a relational database sitting on a single in-house data server. For example, if your server load doubles or triples overnight, how quickly do you think you can upgrade your hardware? The later scenario in general can be too difficult to manage with a relational database.

When run on a single server node, relational databases scale well. But when the capacity of that single node is reached, the system must be able to scale out and distribute that load across multiple server nodes. This is when the complexity of relational databases starts to rub against their potential to scale. And the complexities become overwhelming when we try scaling to hundreds or thousands of nodes. The characteristics that make relational database systems so appealing drastically reduce their viability as platforms for large distributed systems.

2.2 Towards Distributed Database

Addressing this limitation in private and public clouds became necessary as cloud platforms demand high levels of scalability. A cloud platform without a scalable data store is not much of a platform at all. So, to provide real-time customers with a scalable place to store data, cloud providers had only one real option, which was to implement a new type of database system that focuses on scalability, at the expense of the other benefits that come with relational databases. These efforts have led to the rise of a new breed of database management system termed as ‘Distributed’ DBMS. In the upcoming sections, we will take a deep look at these databases.

3 Distributed Databases – The New Breed

In the book ‘Distributed Database Systems’ [1] authors Saeed and Frank define Distributed database as follows – “A Distributed DBMS maintains and manages data across multiple computers. A DDBMS can be thought of as a collection of multiple, separate DBMSs, each running on a separate computer, and utilizing some communication facility to coordinate their activities in providing shared access to the enterprise data. The fact that data is dispersed across different computers and controlled by different DBMS products is completely hidden from the users. Users of such a system will access and use data as if the data were locally available and controlled by a single DBMS.”

As opposed to Distributed DBMS, a centralized DBMS is a software that allows an enterprise to access and control its data on a single machine or server. It has become common for the organizations to own more than one DBMS which did not come from the same database vendor. For example, an enterprise could own a mainframe database that is controlled by IBM DB2 and a few other databases controlled by Oracle, SQL Server or other vendors. Most of the time, users access their own workgroup database. Yet, sometimes, users access data in some other division’s database or even in the larger enterprise-level database. As we can see now, the need to share data that is scattered across an enterprise cannot be satisfied by centralized Database management system software leading to a distributed management system.

3.1 Homogeneous Vs Heterogeneous databases

In recent years, the distributed database management system (DDBMS) has been emerging as an important area in data storage and its popularity is increasing day by day. A distributed database is a database that is under the control of a central DBMS in which not all hardware storage devices are attached to a single server with common CPU. It may be stored on multiple machines located in the same

physical location, or may be dispersed over a network of interconnected machines. Collections of data (e.g., database table rows/columns) can be distributed across multiple machines in different physical locations. In short, a distributed database is a logically interrelated collection of shared data, and a description of this data is physically distributed over the network (or cloud).

A distributed database management system (DDBMS) may be classified as homogeneous or heterogeneous. In an ideal DDBMS, the sites would share a common global schema (although some relations may be stored only at some local sites), all sites would run the same database management software or model and each site is aware of the existence of other sites in that network. In a distributed

system, if all sites use the same DBMS model, it is called a homogenous distributed database system. However, in reality, a distributed database system has to be constructed by linking multiple machines running already-existing database systems together, each with its own schema and possibly running different database management software. Such systems are called heterogeneous distributed database

systems. In a heterogeneous distributed database system, sites may run different DBMS software that need not be based on the same underlying data model, and thus, the system may be composed of relational, hierarchical, network and object-oriented DBMSs (these are various types of database models).

Homogeneous distributed DBMS provides several advantages such as ease of designing, simplicity and incremental growth. It is much easier to design and manage than a heterogeneous system. In a homogeneous distributed DBMS, making the addition of a new site to the system is much easier, thereby providing incremental growth in the site. On the other hand, heterogeneous DDBMS are flexible

in the way that it allows sites to run different DBMS softwares with its own model. The communications between different DBMSs are required for translations which might add an additional overhead to the system.

3.2 Relational Vs Non-relational databases

3.2.1 Scalability

Relational databases usually reside on one server or a machine. Scalability here can be achieved by adding more processors (maybe with advanced capabilities), adding hardware memory and storage. Relational databases do reside on multiple servers and synchronization in this case is achieved by using replication techniques. On the other hand, ‘NoSQL’ databases (term used to refer distributed/non-relational databases) also can reside on a single server or a machine but more often are designed to work across a cloud or a network of servers.

3.2.2 Data formats

Relational databases are comprised of a collection of rows and columns in a table and/or view which includes a fixed schema and join operations. ‘NoSQL’ databases often store data in a combination of key and value pairs also called as Tuples. They are schema free and have an ordered list of elements.

3.2.3 Dataset Storage

Relational databases almost always reside on a secondary disk drive or a storage area. When required, a collection of database rows or records are brought into main memory with stored procedure operations and SQL select commands. In contrast, most of the NoSQL databases are designed to exist in main memory for speed and can be persisted to disk.

3.3 Components of Distributed Database

A Distributed DBMS controls the storage and efficient retrieval of logically interrelated data that are physically distributed among several sites. Therefore, a distributed DBMS includes the following components.

Workstation machines (nodes in a cluster) – A distributed DBMS consists of a number of computer workstation machines that form the network system. The distributed database system must be independent of the computer system hardware.

Network components (hardware/software) – Each workstation in a distributed system contains a number of network hardware/software components. These components allow each site to interact and exchange data with each other site.

Communication media – In a distributed system, any type of communication or information

exchange among nodes is carried out through the communication media. This is a very important component of a distributed DBMS. It is desirable that a distributed DBMS be communication media independent. (i.e.) it must be able to support several types of communication media.

Transaction processor – A Transaction processor is a software component that resides in each workstation (computer) connected with the distributed database system and is responsible for receiving and processing local/remote applications’ data requests. This component is also known as the transaction manager or the application manager.

Data processor – A data processor is also a software component that resides in each workstation (computer) connected with the distributed system and stores/retrieves data located at that site. This is also known as the data manager. In a distributed DBMS, a centralized DBMS may act as a Data Processor.

Now, a number of questions might arise while we research how distributed DBMS could be deployed in a company. What disadvantages and advantages will come with this new kind of architecture? Can we reuse the existing hardware component efficiently with this new setup? What vendors can one choose to go with? What kind of code changes will this introduce to the database developers in the company and what kind of cultural shocks will they see? Let’s take a look at its drawbacks and benefits.

3.4 Drawbacks of Distributed DBMS

The biggest disadvantage in deploying distributed DBMS in most of the scenarios would be the code and the culture change. Developers are used to writing complex code in SQL that connects to a database and executes a stored procedure that lives in the database. Introducing this new distributed architecture would completely change their routine and environment. The stored procedures are usually written in Java or another JIT language. This essentially means a rewrite of much of the Java code infrastructure. Building on that, they also introduce additional complexities into the existing architecture. This potentially means more cost in work labor.

Companies have also invested a lot of time and money into the hardware components and architecture they have today. Introducing the concept of commodity hardware into the existing system could cause both political and economical backlashes. There are also solution-specific problems that are introduced on top of this. The concept of Distributed DBMS itself is rather new and the lack of options and experience among the developers and customers raises concern. This also means there’s a lack of standards to follow. I would also be concerned with how the vendor treats concurrency control among the nodes. This is one of the major considerations in distributed environments.

3.5 Benefits of Distributed DBMS

The first benefit that I see in this approach is that it is extremely scalable. With distributed database expansion will be easier while not sacrificing reliability or availability. If one needs to add more hardware, they should be able to pause a node’s operations, copy it over and bring up the new hardware at a very high-level without worrying much about the rest of the system.

Secondly, this setup doesn’t require high-end massive machines that are currently being used in a non-distributed environment or datacenters. One can use a cheap commodity hardware providing a decent RAM and good CPUs. The hardware one would consider for this would have 4 to 8 GB of RAM and dual Core i7s. One can provide as many cores as they can to each node in a cluster for higher throughput or performance. Now, the complexity of the network becomes one of the bottlenecks. These nodes need to be able to communicate more often in the cluster. To attain this, the network connection speed needs to be increased.

3.6 Automatic configuration of Distributed DBMS

Although a network-based storage system has several benefits, managing the data across several nodes in a network is complex. Moreover, it brings several issues (few of which have been discussed in section 3.2) that are inexistent in centralized solutions. For example, the administrator has to decide whether data needs to be replicated and if yes, then at what rate should it be replicated, how to speed up several concurrent distributed accesses in the network (kept in memory – good for temporal short files; verified for similarity – good to save storage space and bandwidth when there is high similarity between write operations), etc. To avoid such complexities, the database systems typically fix these decisions making it easier to manage and yet keep the system simple.

Few versatile storage systems were proposed to solve these problems. The versatility is the ability of providing a predefined set of configurable optimizations that can be configured by the administrator or the running application based on their needs thus improving application performance and/or the reliability. Although this kind of versatility allows applications to obtain improved performance from the storage system, it adds more responsibilities to the administrator who now has to manually tune the storage system after learning the network and applications. Such a manual configuration of the storage system might not be a desired task to the administrator for the following reasons: lack of complete knowledge about the application using this network, the network workload, the workload change, and performance tuning which could be time-consuming.

In the paper ‘Automatic Configuration of a Distributed Storage System’ [6], the authors had proposed a design to support the administrator to correctly configure such a distributed file system in order to improve the application performance. They intended to design an architecture that allows automatic configuration of the parameters in such a versatile storage system. Their intended solution had the following requirements:

• Easy to configure: The main goal is to make the administrator’s job easier. Thus, they intended to design a solution that requires minimal human intervention to turn the optimizations on. The ideal case is no human intervention.

• Choose a satisfactory configuration: The automatic configuration should detect a set of parameters that can make the performance of the application closer to the administrator’s intention.

• Reasonable configuration cost: The cost required to detect the optimizations and set the new configuration should be paid off according to the utility to the system. For example, considering as a main goal having the smallest possible application execution time, if the cost to learn the correct parameters (the time in this case) is greater than the time saved by the optimizations, automatic configuration solution

is useless.

Figure 1 – Software Control loop

Courtesy: ‘Automatic Configuration of a Distributed Storage System’ [6]

To configure the system, they intended to design a software control loop (Figure 1) that automatically decides how to configure the storage system to optimize the performance. Thus, the system would have a closed loop consisting of a monitor which constantly verifies the behavior of the distributed storage system operations and extracts a set of measurements at given moment. The actual measurements may change depending on the target metric and on the configuration. Examples of measurements include: amount of data sent through network, amount of data stored, operation response time, system processing time, and communication latency. The monitor informs the system’s behavior to the controller by sending the measurements. The controller, then, analyzes the metrics, the impacts of its last decisions and may dispatch new actions to change the configuration of the storage system. Now, let us chip into the Big Data!

4 Big Dataset

While when it comes to cloud computing, not many of us know exactly how it will be used by the enterprise. But, something that is becoming increasingly clear is that Big Data is the future of Information Technology. So, tackling Big Data is the biggest challenge we would face in the next wave of cloud computing innovation. Data is growing exponentially everywhere from users, applications, machines and no vertical or industry is being spared. And the result is that organizations everywhere are being forced to grapple with storing, managing and extracting value from every piece of it as cheaply as possible. Yes, the race has begun!

IT architectures have been revisited/reinvented so many times in the past in order to remain competitive. The shift from the mainframes to distributed microprocessing environments, their subsequent shifts to web services through Internet are few of those to mention. While cloud computing will leverage these technological advantages in computing and networking to tackle Big Data, it will also embrace deep innovations in the areas of storage/data management.

4.1 Big Data stack

Before cloud computing will be broadly embraced by the customers, a new protocol stack will need to emerge. Infrastructure capabilities such as virtualization, security, systems management, provisioning, access control, availability, etc. will have to be standard before IT organizations are able to deploy the cloud infrastructure completely. In particular, the new framework needs the ability to process data real-time and in greater orders of magnitude. They could leverage commodity servers for computing and storage.

The cloud stack discussed above has already been implemented in many forms at large-scale data center networks. At the same pace, as the volume of data exploded, these networks also quickly encountered the scaling limitations of traditional SQL databases. To solve this, distributed and object-orientated data stores are being developed and implemented at scale. Many solved this problem through techniques like MySQL instance sharding, in essence using SQL more as data stores than true relational databases (i.e. no table joins, etc.). However, as internet data centers scaled, this approach didn’t help much.

4.2 The rise of Distributed Non-relational DBMS

Large web properties have been building their own so-called ‘NoSQL’ databases. But while it can seem like a different version of DDBMS sprouts up every day [2], they can largely be categorized into two flavors: One, distributed key value/tuple stores, such as Azure (Windows), CouchDB, MongoDB and Voldemort (LinkedIn); and two, wide column stores/column families such as HBase (Yahoo/Hadoop), Cassandra (Facebook) and Hypertable(Zvents).

These projects are in various stages of deployment and adoption but promise to deliver a scalable data layer on which applications can be built elastically. One facet that is common across these myriad of ‘NoSQL’ databases is a data cache layer, which is essentially a high-performance, distributed memory caching system that can accelerate applications in web by avoiding continual database hits.

5 Hadoop HBase Vs Cassandra

Managing distributed, non-transactional data has become more challenging. Internet data center networks are collecting massive volumes of data every second that somehow need to be processed cheaply. One solution that was been deployed by some of the largest web giants like Facebook, Yahoo, LinkedIn, etc. for distributed file systems and computing in a cloud is Hadoop [7]. In many cases, Hadoop essentially provides a primary compute and storage layer for the NoSQL databases. Although the framework has roots in data centers, Hadoop is quickly penetrating broader enterprise use cases.

Another popular distributed database that gathered the most attention is Cassandra [8]. Cassandra purports to being a hybrid of BigTable and Dynamo, whereas HBase is a near-clone of Google’s BigTable [3]. The split between Hbase and Cassandra can be categorized into the features that it supports and the architecture itself.

As far as I am concerned, customers seem to consider HBase as being more suitable for data warehousing, large scale data analysis (such as that involved when indexing the Web) and processing and Cassandra as being more suitable for real-time transaction processing and serving of interactive data. Before moving on to the comparisons, let us take a look at a powerful theorem.

5.1 Consistency, Availability and Partition tolerance (CAP)

CAP is a powerful theorem that applies to the development and deployment of distributed systems (in our case distributed databases). This theorem was developed by Prof. Brewer, Co-founder of Inktomi. According to Wikipedia CAP theorem is stated as follows: “The CAP theorem, also known as Brewer’s theorem, states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees:

* Consistency (all nodes see the same data at the same time)

* Availability (node failures do not prevent survivors from continuing to operate)

* Partition tolerance (the system continues to operate despite arbitrary message loss)” [11].

The theorem states that a distributed system (“shared data” in our context) design, can offer at most two out of three desirable properties such as Consistency, Availability and Partition tolerance to network. Very basically from database context, “Consistency” means that if one writes a data value to a database, other users will immediately be able to read the same data value back. “Availability” means that if some number of nodes fail in a cluster (means network of nodes) the distributed DBMS can remain operational. And, “Partition tolerance” means that if the nodes or machines in a cluster are divided into two or more groups that can no longer communicate (maybe because of a network failure), the system again remains functional.

Many developers (and in fact customers), including many in the HBase community have taken it to heart that their database systems can only support two of these three CAP properties and have accordingly worked to this design principle. Indeed, after reading many posts in online forums, I regularly find the HBase community explaining that they have chosen Consistency and Partition (CP), while Cassandra has chosen Availability and Partition tolerance (AP). It is a fact that most developers would need consistency (C) in their network at some level.

However, the CAP theorem explained above only applies to a single distributed algorithm. And there is no reason why one cannot design a single database system where for any given operation or input, an underlying algorithm is selectable. Thus while it is true that a distributed database system may only offer two of these properties per operation, what has been widely missed is that a system