Dr. Edgar Codd introduced the relational model in his 1970 paper “A Relational Model of Data for Large Shared Data Banks”—there was an earlier 1969 version but let’s not digress.  The 1970 paper was reprinted in the 25th anniversary edition of the NoCOUG Journal. It’s terribly mathematical and consequently not very comprehensible to mere mortals. But his 1981 ACM Turing Award lecture is a study in contrast so I recommend it highly. You can read it in the 30th anniversary edition of the NoCOUG Journal. Here are my favorite quotes from that paper:

“The adverse impact on development productivity of requiring programmers to navigate along access paths to reach the target data (in some cases having to deal directly with the layout of data in storage and in others having to follow pointer chains) was enormous.”

In other words, the relational model was to be about productivity, productivity, productivity. In highlighting productivity in the ACM Turing Award lecture, Codd de-emphasized the ability to assert arbitrarily complex consistency constraints that must be satisfied by the data within the database. Not surprisingly, nobody talks about it much though it is indubitably the most important capability of the relational model. There is finally an effort to implement it in Oracle Database. 

“In the definition of the relational model there are several prohibitions. To cite two examples: user-visible navigation links between tables are ruled out, and database information must not be represented (or hidden) in the ordering of tuples within base relations.”

What is not so obvious that user-invisible navigation links between links are not ruled out. In other words, we shouldn’t need an index range scan to navigate from a parent record to a child record. Every relational database vendor to date has missed this.

Instead of welcoming a theoretical foundation as providing soundness, the attitude seems to be: if it’s theoretical, it cannot be practical. The absence of a theoretical foundation for almost all nonrelational DBMS is the prime cause of their ‘ungepotchket’ (patched up) quality.”

Prior to August 2012, MongoDB only supported instance-level locking for writes (one MongoDB instance can control multiple databases). Between August 2012 and November 2014, MongoDB supported database-level locking for writes.  Finally in November 2014, MongoDB began supporting document-level locking for writes when using the optional Wired Tiger storage engine and “collection-level “locking when using the default MMAPv1 storage engine (one MongoDB database can contain multiple document collections).  #Nuffsaid

 “Can we retain the advantages of the relational approach while handling heterogeneous data also? Such data may include images, text, and miscellaneous facts. An affirmative answer is expected, and some research is in progress on this subject, but more is needed.”

NoSQL camp is touted for “schemaless design.” In database management systems of the NoSQL kind, data is stored in “blobs” or documents; the database management system does not police their structure. In mainstream database management systems on the other hand, doctrinal purity requires that the schema be designed before data is inserted. But let’s do a thought experiment.

Let’s suppose that we don’t have a schema but the following facts are known:

  • Iggy Fernandez is an employee with EMPLOYEE_ID=1 and SALARY=$1000.
  • Mogens Norgaard is a commissioned employee with EMPLOYEE_ID=2, SALARY=€1000, and COMMISSION_PCT=25.
  • Morten Egan is a commissioned employee with EMPLOYEE_ID=3, SALARY=€1000, and unknown COMMISSION_PCT.

Could we ask the following questions and expect to receive correct answers?

Question: What is the salary of Iggy Fernandez?

Expected answer: $1000.

Question: What is the commission percentage of Iggy Fernandez?

Expected answer: Invalid question.

Question: What is the commission percentage of Mogens Norgaard?

Expected answer: 25%

Question: What is the commission percentage of Morten Egan?

Expected answer: Unknown.

If we humans can process the above data and correctly answer the above questions, then surely we can program computers to do so.

The above data could be modeled with the following three relations. It is certainly disruptive to suggest that this be done on the fly by the database management system, but it is not outside the realm of possibility.

EMPLOYEES

  • EMPLOYEE_ID NOT NULL NUMBER(6)
  • EMPLOYEE_NAME VARCHAR2(128)

UNCOMMISSIONED_EMPLOYEES

  • EMPLOYEE_ID NOT NULL NUMBER(6)
  • SALARY NUMBER(8,2)

COMMISSIONED_EMPLOYEES

  • EMPLOYEE_ID NOT NULL NUMBER(6)
  • SALARY NUMBER(8,2)
  • COMMISSION_PCT NUMBER(2,2)

A NoSQL company called Hadapt—eventually acquired by Teradata—was the first to offer this capbility: “While it is true that SQL requires a schema, it is entirely untrue that the user has to define this schema in advance before query processing. There are many data sets out there, including JSON, XML, and generic key-value data sets that are self-describing — each value is associated with some key that describes what entity attribute this value is associated with [emphasis added]. If these data sets are stored in Hadoop, there is no reason why Hadoop cannot automatically generate a virtual schema against which SQL queries can be issued. And if this is true, users should not be forced to define a schema before using a SQL-on-Hadoop solution — they should be able to effortlessly issue SQL against a schema that was automatically generated for them when data was loaded into Hadoop.” (http://hadapt.com/blog/2013/10/28/all-sql-on-hadoop-solutions-are-missing-the-point-of-hadoop/)

This is not really new ground. Oracle Database provides the ability to convert XML documents into relational tables (http://docs.oracle.com/cd/E11882_01/appdev.112/e23094/xdb01int.htm#ADXDB0120), though it ought to be possible to view XML data as tables while physically storing it in XML format in order to benefit certain use cases. It should also be possible to redundantly store data in both XML and relational formats in order to benefit other use cases.

In “Extending the Database Relational Model to Capture More Meaning,” Dr. Codd explains how a “formatted database” is created from a collection of facts: “Suppose we think of a database initially as a set of formulas in first-order predicate logic. Further, each formula has no free variables and is in as atomic a form as possible (e.g, A & B would be replaced by the component formulas A, B). Now suppose that most of the formulas are simple assertions of the form Pab…z (where P is a predicate and a, b, … , z are constants), and that the number of distinct predicates in the database is few compared with the number of simple assertions. Such a database is usually called formatted, because the major part of it lends itself to rather regular structuring. One obvious way is to factor out the predicate common to a set of simple assertions and then treat the set as an instance of an n-ary relation and the predicate as the name of the relation.”

In other words, a collection of facts can always be organized into a collection of relations—on the fly if necessary.