I’m sure you’ve heard the saying, “Never discuss politics or religion in polite company”. That’s good advice. This blog is not intended to spark impassioned debate on a topic that for many DBAs is their religion: the database they prefer. Rather I simply want to show that maybe automatically picking a database without testing might not always yield the best results. One database cannot really be the single best solution for all use cases. There are considerations for cost, performance, and applications, among others. Therefore some experimentation might be very worthwhile to assess the options. In this post, I will focus on database performance for a data warehouse.
Let me start with the least contentious issue, “Does the database version really matter?” Of course, if you know that you need a certain feature introduced in a specific release, then you simply need to adopt that version. But what if for example any one of three major database versions would work? I know that most database vendors don’t generally support more than the current and previous major versions, but for the sake of demonstration let’s assume that they do. So in this example let’s compare Oracle 11g R2, 12c R2, and 18c performance for a data warehouse. I ran a TPC-H benchmark and measured two key metrics that are fairly easy to compare: how long it takes to:
- Create, load, index, and collect statistics for 300GB
- Run 22 highly complex queries that scan lots of data
My goal was to measure two key aspects: how fast data can go in and how fast it can come out. While this may seem like an overly simple way to measure any differences, it works well enough for this blog’s purposes. Here are the results (note that I've excluded the scale units from the Y-axis as I only want to show the relative difference):
Oracle 12c was a major update adding numerous new features; hence the observed 26% improvement was not a surprise. Clearly, that upgrade is a no-brainer. Oracle 18c has been explained by some as being more like an Oracle 12c R2 update (i.e. 12c R3) than a whole new major version. I am not in a position to confirm or deny that, but I do see that 18c yields only a 9% improvement over 12c. Still worth doing, but clearly I could remain on 12c if I preferred. However given all the new cloud-specific features and support in Oracle 18c, I’d probably pull that trigger anyhow.
Now let’s try the same test across the four major relational databases: SQL Server, Oracle, PostgreSQL, and MySQL. All databases were running together on the same Windows VM with the same CPU, memory, and disks. Again I am not trying to say that one database is better than another. I merely contend that testing like this will help you to pick the right database for the job at hand. So once again I’m running a 300GB TPC-H for each of these four databases. Here are the results:
I was shocked by these results. My initial belief was that MySQL 8.0 would be a great database for a data warehouse along the lines of a TPC-H type benchmark, and thus probably a good data warehouse option in general. I tried running the tests with both the INNODB and MYISAM storage engines just to be sure. I had to adjust my thinking. For the queries, while still much higher it was still in the ballpark. But the data load, indexing, and statistics collection were orders of magnitude worse. The other databases looked very close, so I had to remove MySQL in order to better see how the remaining three really compared to each other. Here is the comparison with just SQL Server, Oracle, and PostgreSQL.
OK, now we can see that SQL Server and Oracle were the top two choices. Furthermore if data loads were more important, then Oracle might be the better choice; however, if query run times were most important, then SQL Server might be the better choice. My next logical test would be to test SQL Server 2016 and 2017 to see how they stacked up against Oracle 18c before making any final decisions. I leave that up to the reader.
I hope you saw the value in this exercise. My premise was that no one single database is always the best choice for any job. You need to compare your alternatives versus the workload you intend to deploy and support. Maybe running industry standard benchmarks will suffice, or maybe you’ll need to do a workload capture and replay instead of your database application. Regardless, please don’t just assume that you automatically know for a fact which database and version are best. In today’s world, the database has become a commodity. You should pick the one that best suits your current project’s needs and budget rather than just always picking the one you have the most history with. With an open-minded approach like this, you’ll master multiple databases and thus extend your career. Knowledge is earning power.