Advertisements
In Part 1 of this article, I took you through some of the reasons why a column-oriented database based on MySQL can be compelling. We also examined the architecture of InfiniDB, which is an open source analytic, reporting, and read-intensive database, and how its modular design helps you scale both up and out. Now let's kick the tires of the database so you can see these things in action.
Loading Data
First, let's load some data into a basic database to work with. The below load and query tests were run on a small 32-bit Fedora Core 6 machine with 2GB of RAM, and 4 CPU's (2.40GHz with 512KB cache).
With InfiniDB, you have a high-speed loader to work with called cpimport
that does a decent job at taking flat files and pushing their data into database tables. There are two primary steps to using the cpimport
utility:
- Create a job file that is used to load data from one or more flat files into a database
- Run the
cpimport
utility to perform the data import
Note that bulk loads are an append operation to a table so they allow existing data to be read and remain unaffected during the process. Upon completion of the load operation, a high water mark in each column file is moved in an atomic operation that allows for any subsequent queries to read the newly loaded data.
You can create a load job to load one table at a time or load multiple tables from different flat files. Let's go ahead and load up an entire TPC-H database:
mysql> show tables;
+-----------------+
| Tables_in_tpch2 |
+-----------------+
| customer |
| lineitem |
| nation |
| orders |
| part |
| partsupp |
| region |
| supplier |
+-----------------+
8 rows in set (0.00 sec).
First, you put all your load files in InfiniDB's import directory:
[rschumacher@srvsn import]$ pwd
/usr/local/Calpont/data/bulk/data/import
[rschumacher@srvsn import]$ ls -l
total 1076072
-rw-rw-r-- 1 root calpont 24346144 Sep 25 11:39 customer.tbl
-rw-rw-r-- 1 root calpont 759863287 Sep 25 11:39 lineitem.tbl
-rw-rw-r-- 1 root calpont 2224 Sep 25 11:39 nation.tbl
-rw-rw-r-- 1 root calpont 171952161 Sep 25 11:39 orders.tbl
-rw-rw-r-- 1 root calpont 118984616 Sep 25 11:39 partsupp.tbl
-rw-rw-r-- 1 root calpont 24207240 Sep 25 11:39 part.tbl
-rw-rw-r-- 1 root calpont 389 Sep 25 11:39 region.tbl
-rw-rw-r-- 1 root calpont 1409184 Sep 25 11:39 supplier.tbl
Next, you create a job file that's used by the import operation. The job file is created with the colxml
utility. The reason for the job file step is that oftentimes loads are repetitively done (especially in dev/test environments) and a job file can be reused many times to re-load data into the same objects. So, let's create our job file to import all our data into a database called 'tpch2', where the flat files use the default pipe ('|') character as the flat file field delimiter, and give our job a number of '500':
[rschumacher@srvsn bin]$ ./colxml tpch2 -j500
Running colxml with the following parameters:
2009-10-07 15:14:20 (9481) INFO :
Schema: tpch2
Tables:
Load Files:
-b 0
-c 1048576
-d |
-e 10
-f CSV
-j 500
-m 50
-n
-p /usr/local/Calpont/data/bulk/job/
-r 5
-s
-u
-w 10485760
-x tbl
File completed for tables:
tpch2.customer
tpch2.lineitem
tpch2.nation
tpch2.orders
tpch2.part
tpch2.partsupp
tpch2.region
tpch2.supplier
Normal exit.
Now, let's perform the actual import for all eight tables using the cpimport
utility:
[rschumacher@srvsnp bin]$ ./cpimport -j 500
Bulkload root directory : /usr/local/Calpont/data/bulk
job description file : Job_500.xml
2009-10-07 15:14:59 (9952) INFO : successfully load job file /usr/local/Calpont data/bulk/job/Job_500.xml
2009-10-07 15:14:59 (9952) INFO : PreProcessing check starts
2009-10-07 15:15:04 (9952) INFO : PreProcessing check completed
2009-10-07 15:15:04 (9952) INFO : preProcess completed, total run time : 5 seconds
2009-10-07 15:15:04 (9952) INFO : No of Read Threads Spawned = 1
2009-10-07 15:15:04 (9952) INFO : No of Parse Threads Spawned = 3
2009-10-07 15:15:06 (9952) INFO : For table tpch2.customer: 150000 rows processed and 150000 rows inserted.
2009-10-07 15:16:12 (9952) INFO : For table tpch2.nation: 25 rows processed and 25 rows inserted.
2009-10-07 15:16:12 (9952) INFO : For table tpch2.lineitem: 6001215 rows processed and 6001215 rows inserted.
2009-10-07 15:16:31 (9952) INFO : For table tpch2.orders: 1500000 rows processed and 1500000 rows inserted.
2009-10-07 15:16:33 (9952) INFO : For table tpch2.part: 200000 rows processed and 200000 rows inserted.
2009-10-07 15:16:44 (9952) INFO : For table tpch2.partsupp: 800000 rows processed and 800000 rows inserted.
2009-10-07 15:16:44 (9952) INFO : For table tpch2.region: 5 rows processed and 5 rows inserted.
2009-10-07 15:16:45 (9952) INFO : For table tpch2.supplier: 10000 rows processed and 10000 rows inserted.
2009-10-07 15:16:45 (9952) INFO : Bulk load completed, total run time : 106 seconds
The cpimport
utility pushed 8.6 million rows into our database in a little over 100 seconds, which isn't bad for a small 32-bit machine (about 82,000 rows/second). Now, let's run some query tests.
Querying Data in InfiniDB
To run queries against InfiniDB, you'll be using a mysql
client interface named calpontmysql
.
As nearly everyone likes to test query speed by issuing a count(*)
against a table, let me start off by saying that InfiniDB does not keep a count of a table's rows immediately on hand, so it will always do an actual count of data:
mysql> select count(*) from lineitem;
+----------+
| count(*) |
+----------+
| 6001215 |
+----------+
1 row in set (0.81 sec)
With that caveat out of the way, let's now run a few queries and see what we see. Again, the database has no indexes or manual partitioning of data, so it's just pure column-oriented tables we're working with. Let's start with a 3-way table join between customer, orders, and lineitem:
mysql> select c_name, sum(l_quantity)
-> from customer, orders, lineitem
-> where c_custkey = o_custkey and
-> l_orderkey = o_orderkey and
-> l_shipdate = '1992-01-02'
-> group by c_name;
+--------------------+-----------------+
| c_name | sum(l_quantity) |
+--------------------+-----------------+
| Customer#000094199 | 35.00 |
| Customer#000146399 | 21.00 |
| Customer#000104149 | 20.00 |
| Customer#000081157 | 12.00 |
| Customer#000071485 | 27.00 |
.
.
.
17 rows in set (1.12 sec)
InfiniDB brings back the result set in a little more than a second, which isn't bad. Now, let's switch to another TPC-H database where I have 20 times the data in the lineitem
table, which equates to 120 million rows instead of just 6 million and run the same query:
mysql> select c_name, sum(l_quantity)
-> from customer, orders, lineitem
-> where c_custkey = o_custkey and
-> l_orderkey = o_orderkey and
-> l_shipdate = '1992-01-02'
-> group by c_name;
+--------------------+-----------------+
| c_name | sum(l_quantity) |
+--------------------+-----------------+
| Customer#000094199 | 700.00 |
| Customer#000009263 | 380.00 |
| Customer#000146399 | 420.00 |
| Customer#000072862 | 100.00 |
.
.
.
17 rows in set (10.48 sec)
If we added CPU's to the box or increased the data cache size, we could reduce the response time further, or if we decided to use the Enterprise Edition option and go MPP, we can expect to cut the response time in half for one new node and in half again if we added two more nodes after that.
Now, let's try a full 8-way join with the smaller TPC-H database:
mysql> select c_name, p_name, ps_availqty, s_name,
-> o_custkey, r_name, n_name, sum(l_quantity)
-> from orders, lineitem, customer, part,
-> partsupp, supplier, nation, region
-> where o_orderkey = l_orderkey and
-> c_custkey = o_custkey and
-> p_partkey = l_partkey and
-> ps_partkey = p_partkey and
-> s_suppkey = ps_suppkey and
-> r_regionkey = n_regionkey and
-> s_nationkey = n_nationkey and
-> o_orderkey = 4161537
-> group by c_name, p_name, ps_availqty, s_name, o_custkey, r_name, n_name;
+--------------------+----------------------------------------+-------------+--------------------+-----------+-------------+---------------+-----------------+
| c_name | p_name | ps_availqty | s_name | o_custkey | r_name | n_name | sum(l_quantity) |
+--------------------+----------------------------------------+-------------+--------------------+-----------+-------------+---------------+-----------------+
| Customer#000122437 | slate light gainsboro dark seashell | 8343 | Supplier#000000597 | 122437 | AMERICA | UNITED STATES | 47.00 |
.
.
28 rows in set (5.68 sec)
And with the bigger TPC-H lineitem
table (120 million rows):
mysql> select c_name, p_name, ps_availqty, s_name,
-> o_custkey, r_name, n_name, sum(l_quantity)
-> from orders, lineitem, customer, part,
-> partsupp, supplier, nation, region
-> where o_orderkey = l_orderkey and
-> c_custkey = o_custkey and
-> p_partkey = l_partkey and
-> ps_partkey = p_partkey and
-> s_suppkey = ps_suppkey and
-> r_regionkey = n_regionkey and
-> s_nationkey = n_nationkey and
-> o_orderkey = 4161537
-> group by c_name, p_name, ps_availqty, s_name, o_custkey, r_name, n_name;
+--------------------+----------------------------------------+-------------+--------------------+-----------+-------------+---------------+-----------------+
| c_name | p_name | ps_availqty | s_name | o_custkey | r_name | n_name | sum(l_quantity) |
+--------------------+----------------------------------------+-------------+--------------------+-----------+-------------+---------------+-----------------+
| Customer#000122437 | slate light gainsboro dark seashell | 8343 | Supplier#000000597 | 122437 | AMERICA | UNITED STATES | 47.00 |
.
.
28 rows in set (22.98 sec)
With this join, we see pretty good performance in the smaller 8-way join and not quite a 4x increase in response time for the larger table in the same 8-way operation, which again is not bad for our little 32-bit system.
One interesting item of note is that InfiniDB supports hash joins, which at the time of this writing, MySQL does not. Moreover, InfiniDB's hash joins can be distributed and processed in parallel when a multi-node, MPP configuration is used.
Let's now test InfiniDB's Extent Map to see if queries that use a value outside of a column's maximum and minimum value can be satisfied in the Extent Map alone and not have to read any actual data blocks:
mysql> select min(l_orderkey), max(l_orderkey) from lineitem;
+-----------------+-----------------+
| min(l_orderkey) | max(l_orderkey) |
+-----------------+-----------------+
| 1 | 6000000 |
+-----------------+-----------------+
1 row in set (1.25 sec)
mysql> select count(*) from lineitem where l_orderkey < 1;
+----------+
| count(*) |
+----------+
| 0 |
+----------+
1 row in set (0.04 sec)
mysql> select count(*) from lineitem where l_orderkey > 6000000;
+----------+
| count(*) |
+----------+
| 0 |
+----------+
1 row in set (0.04 sec)
The Extent Map seems to be doing its job here, which is good. There are few things more frustrating in my book than waiting a long time for a result set that ends up in zero rows being sent back.
I could run through other example queries, but the above should give you a decent first impression of how InfiniDB will handle data loads and queries.
When To and Not To Use InfiniDB
So what are some rules of thumb as to when you should and shouldn't use InfiniDB? In general, InfiniDB may be a help to you when you have:
- Read-only databases or read-only parts of an application where response time speeds are important and row subsets (i.e. not all columns in a row) are being requested
- Data marts/ warehouses/ analytic applications where query patterns change and cannot predicted. In other words, it is difficult to continually know what columns will be requested and will need indexing
- Very large data volumes that are not served well by a general purpose database's optimizer
- Scale out scenarios where you want to do parallel processing of queries across multiple machines (note: today, this is only possible with the Enterprise Edition of InfiniDB)
Places where you don't want to use InfiniDB include:
- Transactional applications where you want InfiniDB to serve as a transactional back end database. Even though InfiniDB is transactional, the column-oriented nature of it means it will not do singleton/selective inserts and deletes as fast as a general purpose RDBMS will
- Query systems where all the columns in a table are requested and a primary key lookup is performed
- Databases that are very small in nature and are static/not growing