Log-In to post
Adoption of new technologies will be inevitable; the key factor is DBA mindset as it evolves over the experience gained from vast areas of data management. Technology advances on demand and if we see the means of it then it is 'DATA' which is going to be consumed, processed and managed out of it. So technicians having skills to manage data will have alike mindset. DBA in Hadoop ecosystem can leverage his Clustering, Resource Managing, Partitioning, Parallelism, ETL, SQL and Scripting skills as these are very much used in Hadoop and high-level concepts would differ scarcely for DBA mindset.
For example, Fencing is one of the important mechanism used in clusters to provide High Availability by avoiding split brain syndrome - In Oracle RAC we use STONITH (Shoot The Other Node In The Head) concept by using odd number of Voting disks to ensure that node is a member of cluster when it is able access more than half of the Voting disks. Similarly in Hadoop we use STONITH concept for NameNode High Availability by configuring odd number of Zookeeper's to avoid split brain syndrome.
DBA's having data warehousing experience can outperform in Hadoop world as Hadoop workloads resembles to what data warehousing workloads are, and controlling them in terms of resource consumption to ensure clusters are not choked down is an art. In Hadoop ecosystem YARN is next-generation compute and resource management framework to manage the resources which reconciles the way applications use Hadoop system resources by allowing interactive querying and streaming data applications to run simultaneously with MapReduce batch jobs.
Administration of Hadoop requires operational expertise with strong scientific troubleshooting skills in areas like Memory, CPU, I/O, Network and Storage which we can assume every DBA knows it and if not then there is no reason that DBA can't deduce it. It is also important for DBA to have Linux skills as Hadoop is deployed on Linux platform, again we can say that most of the DBA's does have hands-on experience on working with Linux.
If DBA has DevOps culture who builds and maintains automated database deployments, collaborates closely with development and takes responsibility of capacity planning to application scalability then the same skills sets can be used in Hadoop world as the deployment and configuration management tools like Chef and Puppet are universal.
DBA's are blessed to have extensive diagnostic information in hand about the RBDMS systems which are designed precisely and thoughts are carefully applied to structure the schema for the data placement. On the other hand it is contradicting with Hadoop as it is not ACID complaint, schema is defnied during runtime and NoSQL is not relational, thus it demands deep technical expertise to manage data by leveraging varities of tools, analytical engines and flexibility available in Hadoop ecosystem. Hadoop requires intimate knowledge of the various components of the stack, some of which are still immature and require even more fine-tuning and understanding to get enterprise ready.
It's certain that the roles and responsibilities of today’s DBAs are dramatically shifting, can DBA manage to survive this paradigm shift ?
November's topic will be:
What will be the role of a DBA in a NoSQL world? Is the role of a DBA diminishing?
Get your thoughts organized and ready to share. And don't forget that anyone who contributes feedback will be entered for a chance to win a $25 Amazon gift card!
Firstly, NoSQL database does not imply no SQL at all and for the same reason “NoSQL” is also referred to as "not only SQL". And, most NoSQL databases provide a SQL-like interface. What is different about NoSQL databases is that they are based on a database model that is non-relational and schema-free. In contrast relational databases such as Oracle database store data in tables with a fixed schema (rows/columns), which have relations between them and make use of SQL (Structured Query Language) to access and query the tables. A relational database table has a fixed schema with predefined columns and column types. Schema-less (schema-free) implies that each row of data in a NoSQL database could have a different (or same) set of columns and the column data type is not fixed. Some of the NoSQL supported data models are document store (for example MongoDB and Couchbase), wide column store (for example Apache Cassandra), and Key Value store (for example Oracle NoSQL database).
Relational databases have been used for decades. What is the need for a new type of a database, the NoSQL database.
NoSQL databases were developed as a solution to the following requirements of modern web scale applications:
A subset of the afore-mentioned reasons is often referred to as the 3 Vs or 4 vs. of Big Data; Volume, Variety, Velocity and Veracity. SQL-based relational databases were not designed to handle the scalability, agility, and performance requirements of modern applications using real-time access and processing big data. While most RDBMS databases provide scalability and high availability as features, NoSQL databases provide higher levels of scalability and high availability. Big data is growing exponentially. Concurrent users have grown from a few hundred or thousand to several million for applications running on the web. It is not just that once big data has been stored new data is not added. It is not just that once a web application is being accessed by millions of users it shall continue to be accessed by as many users for a predictable period of time. The number of users could drop to a few thousand within a day or a few days. Relational database being based on single server architecture, a single database is a single point of failure (SPOF). For a highly available database, data must be distributed across a cluster of servers instead of relying on a single database. NoSQL databases provide the distributed, scalable architecture required for big data. "Distributed" implies that data in a NoSQL database is distributed across a cluster of servers. If one server becomes unavailable another server is used. The "distributed" feature is a provision and not a requirement for a NoSQL database. A small scale NoSQL database may consist of only one server.
The fixed schema data model used by relational databases makes it necessary to break data into small sets and store them in separate tables using table schemas. The process of decomposing large tables into smaller tables with relationships between tables is called database normalization. Normalized databases require table joins and complex queries to aggregate the required data. In contrast, the data models provided by NoSQL databases provide a denormalized database. Each document is complete unto itself and does not have any external references to other documents. Self-contained documents are easier to store, transfer, and query.
In this section we shall cover the advantages of NoSQL databases.
NoSQL databases are easily scalable, which provides an elastic data model. Why is scalability important? Suppose you are running a database with a fixed capacity and the web site traffic fluctuates, sometimes rising much in excess of the capacity, sometimes falling below the capacity. A fixed capacity database won't be able to serve the requests of the load in excess of the capacity, and if the load is less than the capacity the capacity is not being utilized fully. Scalability is the ability to scale the capacity to the workload. Two kinds of scalability options are available: horizontal scalability and vertical scalability. With horizontal scalability or scaling-out, new servers/machines are added to the database cluster. With vertical scalability or scaling-up, the capacity of the same server or machine is increased. Vertical scalability has several limitations.
While relational databases support vertical scalability, NoSQL databases support horizontal scalability. Horizontal scalability does not have the limitations that vertical scalability does. Additional server nodes may be added to a cluster without a dependency on the other nodes in the cluster. The capacity of the NoSQL database scales linearly, which implies that if you add 4 additional servers to a single server, the total capacity becomes five times the original, not a fraction of the original due to performance loss. The NoSQL cluster does not have to be shut down to add new servers. Ease of scalability is provided by the shared-nothing architecture of NoSQL databases. The monolithic architecture provided by traditional SQL databases is not suitable for the flexible requirements of storing and processing big data. Traditional databases support scale-up architecture (vertical scaling) in which additional resources may be added to a single machine. In contrast, NoSQL databases provide a scale-out (horizontal scaling), nothing shared architecture, in which additional machines may be added to the cluster. In a shared- nothing architecture, the different nodes in a cluster do not share any resources, and all data is distributed (partitioned) evenly (load balancing) across the cluster by a process called sharding.
Why is high availability important? Because interactive real-time applications serving several users need to be available all the time. An application cannot be taken offline for maintenance, software, or hardware upgrade or capacity increase. NoSQL databases are designed to minimize downtime, though different NoSQL databases provide different levels of support for online maintenance and upgrades.
NoSQL databases are designed to be installed on commodity hardware, instead of high-end hardware. Commodity hardware is easier to scale-out:, simply add another machine and the new machine added does not even have to be of similar specification and configuration as the machine/s in the NoSQL database cluster.
While the relational databases store data in the fixed tabular format for which the schema must be defined before adding data, the NoSQL databases do not require a schema to be defined or provide a flexible dynamic schema. Some NoSQL databases such as Oracle NoSQL database and Apache Cassandra have a provision for a flexible schema definition, still others such as Couchbase are schema-less in that the schema is not defined at all. The support for flexible schemas or no schemas makes NoSQL databases suitable for structured, semi-structured, and unstructured-structured data. In an agile development setting the schema definition for data stored in a database may need to change, which makes NoSQL databases suitable for such an environment. Dissimilar data may be stored together.
Flexible schemas make development faster, code integration uninterrupted by modifications to the schema, and database administration almost redundant.
NoSQL databases are designed for big data. Big data is in the order of tens or even hundreds of PetaByte (PB). Big data is usually associated with a large number of users and a large number of transactions.
The data models provided by NoSQL databases support object-oriented programming, which is both easy to use and flexible. Most NoSQL databases are supported by APIs in object-oriented programming languages such as Java, PHP, and Ruby. All client APIs support simple put and get operations to add and get data.
Why is performance important? Because interactive real-time applications require low latency for read and write operations for all types and sizes of workloads. Applications need to serve millions of users concurrently at different workloads. The shared- nothing architecture of NoSQL databases provides low latency, high availability, reduced susceptibility to failure of critical sections, and reduced bandwidth requirement. The performance in a NoSQL database cluster does not degrade with the addition of new nodes.
NoSQL databases typically handle server failure automatically to failover to another server. Why is auto-failover important? Because if one of the nodes in a cluster were to fail and if the node was handling a workload, the application would fail and become unavailable. NoSQL databases typically consist of a cluster of servers and are designed with the failure of some nodes as expected and unavoidable. With a large number of nodes in a cluster the database does not have a single point of failure, and failure of a single node is handled transparently with the load of the failed server being transferred to another server.
NoSQL databases are easier to install and administer without the need for specialized DBAs. A developer is able to handle the administration of a NoSQL database, but a specialized NoSQL DBA should still be used. Schemas are flexible and do not need to be modified periodically. Failure detection and failover is automatic without requiring user intervention. Data distribution in the cluster is automatic using auto-sharding. Data replication to the nodes in a cluster is also automatic. When a new server node is added to a cluster, data gets distributed and replicated to a new node as required automatically.
Cloud computing has made unprecedented capacity and flexibility in choice of infrastructure available. Cloud service providers such as Amazon Web Services (AWS) provide fully -managed NoSQL database services and also the option to develop custom NoSQL database services.
While much has been discussed about their merits, NoSQL databases are not without drawbacks. Some of the aspects in which relational databases have advantages are as follows.
NoSQL databases do not provide the Atomicity, Consistency, Isolation, and Durability (ACID) properties in transactions that relational databases do.
NoSQL database provide Basically Available, Soft state, and Eventually consistent (BASE) transactional properties.
The NoSQL databases are still new to the field of databases and not as functionally stable and reliable as the established relational databases.
Most NoSQL databases such as MongoDB and Apache Cassandra are open source projects and lack the official support provided by established databases such as Oracle database or IBM DB2 database.
NoSQL/NonSQL databases such as MongoDB (or Apache Cassandra or Couchbase) shall never completely replace relational databases such as Oracle database because the NoSQL databases are designed for a different use case, which is big, unstructured data, for example the web scale data used by search engines. Small scale enterprises (and even some larger ones) would continue to use relational databases for their superior transactional properties, stability & reliability, and established support base.