Last week the BBBT was visited
by startup database company Calpont, although 'startup' hardly fits the
bill considering the fact that the company was founded in 2000. I won't
go into the company nor the BBBT session details; you can find the last
one here.
What I do want to blog about is the product itself. Infinidb comes in
both a Community Edition (CE) and an Enterprise Edition (EE), where the
latter contains MPP capabilities, monitoring tools and different support
options. For more details, visit the comparison
page. Since I don't have an MPP cluster available I took the CE for
a spin and thought I'd share the results.
The Good
As
you might have noticed on the edition comparison page, the core server
features are the same for both CE and EE. That's good news compared to
direct competitor Infobright. Infobright has stripped out DML
(insert/update/delete) capabilities from the Community Edition. Some
people would call this 'cripple ware' as the only way to update a
datawarehouse table is to drop, recreate and reload it. 1-0
for InfiniDB already.
More on the good side: the installation process.
The product installs in minutes and is only a 12.7 MB download. I
downloaded the 64bit RPM version which requires two steps: extract the
zip file and run the command rpm -i Calpont*.rpm as root. This
will install the software in the default location /usr/local/Calpont.
Then invoke the script /usr/local/Calpont/bin/install-infinidb.sh,
and configure the InfiniDB Aliases with .
/usr/local/Calpont/bin/calpontAlias and you're good to go. In my
case: almost good to go since the data directories are now under
/user/local/Calpont which is not my 12 SDD disk raid set. Simply
mounting the data device to /usr/local/Calpont/data1 solves that problem
too. Alternatively, you can of course move the data1 directory to a
different location and create a symlink in the original spot. The
command service infinidb start fires up the database engine and
invoking idbmysql gives you command line access to the database.
Remember, it's all MySQL so if you're familiar with that, working with
InfiniDB is a breeze.
The MySQL thing is another piece of the goodies:
all front end tools, including the MySQL workbench (query browser,
admin console) can be used with InfiniDB as well. The same JDBC drivers
you already have for MySQL can be used with InfiniDB as well. The only
difference when creating a new table is the fact that you should specify
InfiniDB as the engine, but that's about it.
The last item I'd like to mention here is the
bulk loader. The TPC-H benchmark consists of 22 queries and the database
contains 8 tables. The data files can be created using the dbgen
utility and will generate pipe delimited text files with a .tbl
extension. The default settings InfiniDB uses for bulk loading text
files are also the pipe delimiter and a .tbl extension, what a
convenience! Other than that, the file names have to be named exactly
as the tables you want to load (so customer.tbl is data for the table
customer) and placed in the Calpont data import directory. Invoking the
command colxml will create a bulk
loader import job based on these files and the meta data from the tables
in the database. To start the import, simply run the cpimport
command and InfiniDB starts loading. This may sound more complex than it
actually is, so trust me, it couldn't be simpler. Or faster, for that
matter: loading the 100GB data set took only about 25 minutes, a new
record on my machine!
The Bad
There's actually only one 'bad' thing about the
current version of InfiniDB: it's not finished yet. Yes, it works, it's
fast (more about that later), but there are still a couple of serious
limitations. The most notable of these, at least when running a TPC-H
benchmark, is the support for subqueries. Version 1.1.0 alpha didn't
support any form of subquery, so even a select * from table where column
in (select othercolumn from othertable) couldn't be run. Version 1.1.1
alpha, released on April 23, solved this last one, but more complex
subquery constructs or correlated subqueries are not yet supported. The
upgrade from 1.1.0 to 1.1.1 enabled InfiniDB to complete 10 of the 22
TPC-H queries, instead of the only 5 it could run a week ago. But, as
I've said, this should only be a temporary problem. The roadmap
shows that in a month or so, phase two of the subquery support should
be available in the next alpha release, with GA
(General Availability) for version 1.1. set at early July. By then we
can have a look at the complete run and see how it behaves, also when
multiple threads are running in parallel.
The Ugly
Calpont uses the 'no indexes needed' as one of
the key benefits of the product; I tend to disagree on that one. It's
nice that you don't need to explicitely specify indexes, but when a DBMS
doesn't support any constraints AT ALL, well, that's plain ugly. Want
to enforce a NOT NULL contraint? Bad luck. Primary/foreign key
relationships? Ditto. You could argue that these features are not really
mandatory in a data-warehouse, but without constraint and index support
all the constraint enforcement must be built into the ETL process. Another issue that will hopefully be solved in the final release is the insert/update speed. Bulkloading is indeed fast, but only 40 rows per second using a regular insert or update won't cut it in the real world.
The $64,000 question...
There is actually only one single reason why
anyone would want to use a column store like InfiniDB in the first
place: performance! So the main question is: does it deliver? Yes, it
does. Compared to MySQL the performance improvement is no less than
spectacular. In fact, to date nobody has been brave (or patient) enough
to try an SF100 TPC-H on MySQL so a direct comparison is not even
available. There are however plenty of other comparisons that can be
made. The 10 queries that do run already all outperform Greenplum
single node edition (except for query11) for instance. Some queries
are somewhat faster (Q10, Q12, Q18), some are 3-4 times faster (Q1, Q3,
Q4, Q14, Q16), and query 6 is more than 20 times faster. For a disk
based analytical database (InfiniDB doesn't seem to take as much
advantage of memory as other products I evaluated) it's really, really
fast. Query 1 is always a good indicator since it forces a full table
scan on the largest fact table (600 million rows in this case). If you
can do this in under a minute on my moderate hardware, you do have a
potential winner.
Conclusion
My initial thoughts about InfiniDB when I first
tested it weren't very positive, to say the least. But, given the fact
that they are moving in the right direction and have kept their promised
delivery dates so far, combined with the ease of installation, ease of
use (it's all MySQL) and of course the already great performance, a
second look is certainly warranted. Given the limitations of the direct
competitors (Kickfire with its proprietary hardware, Infobright with its
crippled community edition and lack of MPP/scale out capabilities),
InfiniDB should be on the top of your shortlist when looking for a MySQL
based data-warehouse solution. When it's finished, of course.
Jos van Dongen,Tholis Consulting