Database schema-enforcement for the great good

With the rise of the NoSQL hype, developers rushed to the schema-free databases ditching years of SQL in search for a true scalability promised by the vendors for the new web-scale era.

Schema-enforced database model

 CREATE TABLE users(userid VARCHAR, planet VARCHAR, age INTEGER);

Schema-free database model

// Schema-free
{ /* just put any stuff here */ }

It's kinda frustrating when I see developers advertise schema-less databases as a feature to be celebrated and that it's the future of a web-scale applications!

After a few years of using document-based schema-less databases I came to the conclusion that the schema-less is just a myth. IMHO, it promotes lazier development and less thoughtful designs of software applications, resulting in lots of crappy code that some possibly innocent developer will have to decipher and maintain.

It reminds me with the dynamically-typed language movement that many web developers embraced including me for a few years and still many developers are still fond of dynamically-typed languages because they don't have to deal with compilers, type-checking, and etc.

So, Why do I think that schema-free is a myth?

When you store data in a database, you know what you are storing, when you fetch it back, you expect certain things to be there defined in certain types. Even if you are using a dynamically-typed language you still cannot add an integer with a string and you expect the result to be an integer. That's why you probably will be doing certain type-checks after you retrieve the data or at least you expect that an exception can be thrown trying to do those kinds of operations in runtime, which is even scarier!

So, in fact you are basically moving your compile-time type-enforcement to your runtime. This pattern gives you flexibility that you can have multiple types of documents in the same collection in your database, however, you still need to identify those different documents so you might want to add some kind of identifier there, for example.

{
  "_objType": "User"
  "userid": "AhmedSoliman",
  "planet": "Earth"
}

The field _objType was added by you to ensure that you can classify the documents upon retrieval. Another thing is evolutions, an advertised feature by database vendors is that you don't have to make offline data migrations anymore if you decided to tinkle with your now-imaginary data model. Now what you need to do is to check on every document upon manipulation or retrieval to see if it's the latest updated version or not and based on that you will update the document to the latest data model. So, imagine that we want to change the key userid to username. We will be doing probably something like this on all of our CRUD operations

  // check if the document is old-style and migrate on-demand
  if (doc.hasKey("userid")) {
    doc.set("username", doc.get("userid"));
    doc.delete("userid");
    doc.save();
  }
  //let's do the normal work here

The validation you've added is fragile and expensive, if you later decided to bring back the userid you will be in-trouble, plus you are actually repeating your migration-validation-thing almost everywhere, it's easy to forget! And expensive because you will have to do this validation on every document even if it's actually migrated already!

Model Versioning, does it help?

One common solution to manage changing your imaginary data schema is to put a version number along every document. Once you upgrade your imaginary schema you add one to the model version so every new document will get that and for old documents you again run a version-check code on almost every CRUD operation.

if (obj.modelVersion <= 3) {  
  doc = upgradeDocument(obj.modelVersion, 3);
  doc.save();
}

Where upgradeDocument takes the current document version and the latest model version and does its magic. What about querying your data? what happens if you query documents that still holds userid instead of your query key username? Oops, no match!

You will then have to think about running background job to upgrade all your old documents to ensure that your query indexes are updated for your queries.

See? there is no silver-bullet, it's just your problem now to handle all of that, this makes building the NoSQL databases relatively simpler but makes your enterprise-ready-code much harder to get right.

How did we arrive here in the first place?

We arrived here because SQL databases didn't perform very well in a web-scale era, poor clustering options, no out-of-the-box data partitioning support, and most commonly what's known as the CAP theorem.
The answer was to give up one of the C.A.P letters and along the way many database vendors dropped the schema-support too! We didn't have to do that, did we? no.

Solving the database scalability problems is absolutely irrelevant here, as long as your schema and data validation doesn't depend on strong consistency of your entire dataset. So, it's technically possible to have a data backend that it's truly horizontally scalable and schema-enforced at the same time.

Conclusion

I really like the idea of document-oriented databases, it offers a unique data-model which most of the time is close to your native JSON data representation. It's particularly awesome when it fits the rest of your data structures, but this doesn't mean that we are giving away the need of having a schema-level checking on the data, right?

Basically, database vendors are implicitly saying that validation and data migration is your problem now, you deal with it. I'm not saying to stop using schema-free databases but your decision be thoughtful of the consequences.

Most developers I saw, think in types and schemas, this is basically how their brain work, you need types and schemas even if they are implicit but then you need to be sharper and more focused, and definitely you need a better battery of tests and much more luck!

Side Note

If you think that schema-enforcement is awesome, it might be a good idea and you know your way through Scala, check out really.io a realtime open source schema-enforced backend-as-a-service solution to help developers build near-real-time applications for the Web and Mobile with a truly scalable modern backend under their palm. If you are interested about the development of this, check out our github repo

comments powered by Disqus