OneStopTesting.com - Testing EBooks, Tutorials, Articles, Jobs, Training Institutes etc.
OneStopGate.com - Gate EBooks, Tutorials, Articles, FAQs, Jobs, Training Institutes etc.
OneStopMBA.com - MBA EBooks, Tutorials, Articles, FAQs, Jobs, Training Institutes etc.
OneStopIAS.com - IAS EBooks, Tutorials, Articles, FAQs, Jobs, Training Institutes etc.
OneStopSAP.com - SAP EBooks, Tutorials, Articles, FAQs, Jobs, Training Institutes etc.
OneStopGRE.com - of GRE EBooks, Tutorials, Articles, FAQs, Jobs, Training Institutes etc.

MySQL Database Analytics with InfiniDB from Calpont � Part 1 | Articles | Recent Articles | News Article | Interesting Articles | Technology Articles | Articles On Education | Articles On Corporate | Company Articles | College Articles | Articles on Recession

Sponsored Ads

Hot Jobs

Fresher Jobs
Experienced Jobs
Government Jobs
Walkin Jobs

Placement Section

Company Profiles
Interview Questions
Placement Papers

Resources @ VYOMS

Companies In India
Consultants In India
Colleges In India
Exams In India
Latest Results

Notifications In India
Call Centers In India
Training Institutes In India
Job Communities In India
Courses In India
Jobs by Keyskills
Jobs by Functional Areas

Learn @ VYOMS

GATE Preparation
GRE Preparation
GMAT Preparation
IAS Preparation
SAP Preparation
Testing Preparation
MBA Preparation

News @ VYOMS

Freshers News
Job Articles
Latest News

India News Network

Interview Ebook

Get 30,000+ Interview Questions & Answers in an eBook.
Interview Success Kit - Get Success in Job Interviews

Interview Success Kit - Get Success in Job Interviews

30,000+ Interview Questions
Most Questions Answered
5 FREE Bonuses
Free Upgrades

VYOMS TOP EMPLOYERS

Home » Articles » MySQL Database Analytics with InfiniDB from Calpont � Part 1

MySQL Database Analytics with InfiniDB from Calpont � Part 1

CBSE 12th Result 2016 @ cbseresults.nic.in at 12 : CBSE 12th Class Results | CBSE Board Class 10th X Result 2016: Results likely to be declared on May 24 @ Cbse.nic.in & Cbseresults.nic.in | Gujarat Secondary and Higher Secondary Education Board‬, ‪Gujarat‬‬ | GSHSEB HSC (Science) results tomorrow, SSC on May 24; check gseb.results-nic.in | RBSE 12th Result 2016: Check rajeduboard.nic.in, rajresults.nic.in exam Rajasthan 12th Board Result 2016 to be announced on May 16 at 5 PM | AIPMT 2015: Admit cards released by the CBSE | State Bank of India Probationary Officer (SBI PO) Main Exam: Admit cards to be released on 13th July 2015 | Main Result | Bank Of India Recruitment 2014 | SBI PO Result 2014 -15

Article Posted On Date : Friday, January 15, 2010

MySQL Database Analytics with InfiniDB from Calpont � Part 1

Advertisements

Let's be honest: working with big databases is a lot of fun. There's something cool about dealing with tables that have hundreds of millions or billions of rows in them, loading huge amounts of data, building star and snowflake schemas for data warehouses/marts, optimizing query performance, and all that jazz. Yes, working with big databases is a lot of fun.

On the other hand, let's be honest: working with big databases is not a lot of fun. There's a lot of pain in dealing with tables that have hundreds of millions or billions of rows in them, waiting for huge amounts of data to be loaded only to have the load job toss its cookies and fail when it's 99% done, building special schemas that you wonder whether make any difference at all, and trying to figure out why just a simple two-way join query has been hanging for over an hour. Yes, working with big databases is not a lot of fun.

But here's the thing: big databases aren't going away; in fact, there's more of them than ever before and they just keep multiplying and getting bigger. I could quote a lot of industry statistics and TDWI reports on this fact, but that would just bore you because you likely know all this stuff already. You also know that there are a number of database vendors out there who can take on big data, but they also come with big price tags.

Surveys done by MySQL show that data warehousing is the fifth most common use case for MySQL users, so it's understandable that many try and get around the sticker shock of proprietary data warehouse DB's with MySQL. One of the challenges is scaling MySQL for very large data volume sizes or for complex/analytic queries with fast response times. Moreover, one of the drawbacks that most generic RDBMS's have is a row-based architecture, whereas most data warehouses and read-intensive query situations are selective column-based use cases.

Lately, the idea of column-oriented databases has been catching new wind in its sails. I say 'new wind' because Sybase IQ did the column database thing when columns weren't that cool. But now there are a number of column-oriented databases that are really taking off, however most still are proprietary and cost quite a bit. But what happens when you marry a column-oriented database that's modular in design, can scale both up and out, with MySQL and open source? You get Calpont's InfiniDB.

Why Columns?

To be frank, I'm as skeptical as they come when claims are made that say how this or that different 'type' of database is better than a general purpose RDBMS. When object-oriented databases came out, I scoffed. When OLAP/cube databases were hogging all the tech press, I smirked. I believed that neither could really challenge the good ol' relational model, which isn't perfect by any means, but hey: it still has enough punch to relegate those two other challengers to the very boutique marketplace. That being the case, why should we entertain a relational database that is vertical vs. horizontally architected?

Because, for querying and processing medium to huge amounts of data, they actually work. And, many times, they work better at finishing the SQL query horse race than a general purpose RDBMS. That's why.

Before column databases came out, experienced data modelers creating data warehouses, marts, and reporting databases knew that by designing vertically partitioned tables that were subsets of an overall table, they could sometimes buy better physical query performance because the rows were shorter and thus I/O could somewhat be reduced. Using views, they'd then tie the vertically partitioned tables together to form one logically-based table that could be referenced. And sometimes this practice worked. But oftentimes, it didn't make the grade because (1) even the vertically partitioned tables were still row-based and I/O wasn't always reduced by much; (2) it was really hard to predict what columns would be consistently queried together; (3) the practice didn't reduce the need for indexes (and in fact made the matter worse sometimes because you needed indexes to smartly do joins between the vertically partitioned tables in the view definitions); (4) like with the difficulty in predicting what columns would be queried together, it was difficult to do index design because user's query patterns could change at the drop of a hat and suddenly queries would do full scans instead of index searches. And on and on it went.

Column-oriented databases designed especially for analytics overcome these limitations by storing, managing, and querying data based on columns rather than rows. Because only the necessary columns in a query are accessed rather than entire rows, the column can act as the index and I/O activities as well as overall query response times can oftentimes be reduced. The end result is the ability to interrogate and return query results against either moderate or large amounts of data in a pretty darn efficient manner. On average, a row-based system can do up to 5-10x the physical I/O that a column-based database does to retrieve the same information. This is why column bigot Philip Howard (Research Director for Bloor Research) says: "Columns provide better performance at a lower cost with a smaller footprint: it is difficult to understand why any company seriously interested in query performance would not consider a column-based solution."

Nice quote, but should you really check out a column database? Can they really help you? Read on�

What's it Have?

Column databases can be a help for sure in the right situation, but there's more that's needed to crunch through large data volumes than just having a vertical data design. In addition to being column-oriented, the open source version of InfiniDB provides the following core feature set, which adds a number of needed items to the mix:

Column-oriented architecture: mentioned already...
Multi-threaded design: the engine is multi-threaded and uses today's modern multi-CPU/core architectures to split a single query up across the available processing units.
Automatic vertical and horizontal partitioning: Being column-oriented, the engine uses vertical partitioning to store data but it also uses a form of logical horizontal range partitioning that does not require special storage placements or design. Using both vertical and logical horizontal range partitioning allows it to reduce I/O in both directions (column and row). Both vertical and horizontal partitioning are automatically handled so you don't have to manually do anything.
High user concurrency: There are no real limits as far as concurrency is concerned.
High-speed data loader: To effectively load lots of data, a high-speed load utility is made available. I'll show you how this works in just a second.
DML support: In addition to supporting the high-speed bulk loading of data, the engine supports full DML (insert, update, delete) operations as well.
Transactional support: ACID-compliant transactional support is provided in the engine. Commit, rollback, deadlock detection, all that good stuff.
Crash recovery: the engine has full crash recovery capabilities.
MVCC design: who doesn't like 'snapshot read'? I do, and the engine has it.
No need for indexing: Because the engine uses both vertical and logical horizontal partitioning, there is no need for indexing. In essence, the data is the index. In addition, the engine automatically maintains a small, important structure called the Extent Map (explained later), which is used to reduce I/O. Besides no indexes, you also don't have to worry about dealing with materialized views, summary tables (of course, in MySQL, you don't' have these anyway�), or manual data partitioning.
Logical Data compression: the engine doesn't have real physical/file compression yet (coming�) so for now, it uses transparent logical data compression to store data that benefits from being compressed.
ALTER TABLE is supported (add, drop).
Performance diagnostics: To help troubleshoot those queries from hell, InfiniDB supplies a nice tracing/diagnostic utility that helps figure out exactly what a query is doing above the normal EXPLAIN output.

InfiniDB also uses MySQL as its front end, so if you're familiar with MySQL, you'll be right at home with InfiniDB. MySQL is used mainly in InfiniDB for security, SQL parsing, and initial query plan output.

In addition to the above open source product feature set, a commercial/Enterprise version of InfiniDB will be offered that has a pay-for scale out option where multiple machines can be used in a massive parallel processing (MPP) configuration. However, for many data warehousing, data mart, and analytic database tasks, the open source version of InfiniDB provides plenty of power to plow through big databases.

InfiniDB Architecture

The architecture of InfiniDB is modular and consists of three basic components:

User Module: The User Module is made up of a small MySQL instance and a number of InfiniDB processes that handle concurrency scaling. The User Module is also responsible for breaking down SQL requests and distributing the various parts to one or more Performance Modules that actually retrieve requested data from either memory caches or disk. Finally, the User Module assembles all the query results from the various participating Performance Modules to form the complete query result set that is returned to the user. Note there can more than one User Module in an InfiniDB Enterprise configuration so you can do concurrency scale out by adding more User Module nodes.
Performance Module: The Performance Module is responsible for retrieving and managing data, and passing it back to the User Module(s) to satisfy query requests. The Performance Module selects data from disk and caches it in a shared nothing data cache that is part of the server on which the Performance Module resides. InfiniDB Enterprise accomplishes its MPP benefits by allowing the user to configure as many Performance Modules as they would like; each additional Performance Module adds more cache to the overall database as well as more processing power.
Storage: InfiniDB can use either local storage or shared storage (e.g. SAN) to store data. A user can have everything configured and running on one server. In the commercial edition of InfiniDB, they can scale out with multiple servers and configure either a shared disk (currently supported) or shared nothing (coming in future release) architecture.

All the above modular components can exist on one server, or in the Enterprise version, be broken up and run on different machines. If you want to compare the architecture and design of InfiniDB to other proprietary analytic databases in the market that do both scale up (use a box's available CPU's) and scale out (MPP), InfiniDB is comparable to Vertica, Sybase IQ, and Paraccel. There are other analytic databases available that scale and do MPP, but all others are row-based (e.g. Greenplum, Aster Data, Teradata, Netezza, etc.) vs. column-oriented.

One interesting feature that allows InfiniDB to handle lots of data is a structure called the Extent Map. It's the Extent Map that removes the need for you to do any indexing, manual partitioning of data, create summary tables, etc.

An "extent" in InfiniDB is a logical block of space that exists within a physical file, with an extent being anywhere from 8-64MB in size depending on a column's datatype. Each extent supports the same number of rows, with smaller data types using less space on disk. The Extent Map catalogs all extents and their corresponding blocks (identified with logical block identifiers or LBID's). The Extent Map also maintains minimum and maximum values for a column's data within an extent.

The Extent Map provides the ability for InfiniDB to only retrieve the blocks needed to satisfy a query, but it also provides another benefit � that of logical range partitioning. This is accomplished via the minimum and maximum values for each extent that are contained within the Extent Map. Extent elimination is first accomplished in InfiniDB via the column-oriented architecture (only needed columns are scanned), but the process is accelerated because of this logical horizontal partitioning that is implemented in the Extent Map.

This automatic extent elimination behavior is well suited for time-based data where data is loaded frequently and often referenced by time. Near real-time loads with queries against the leading edge of the data can easily show good extent elimination for all of the date/time columns as well as an ascending key value. Any column with clustered values is a good candidate for extent elimination.

To eliminate an extent when a column scan involves a filter, that filter value is compared to the minimum and maximum values stored in each extent for the column:

In the above figure, if a WHERE column filter of col1 between 220 and 250 is specified, InfiniDB will eliminate extents 1, 2 and 4 from being scanned, saving � of the I/O and many comparison operations. If the extent cannot possibly contain any rows that would match the filter, that extent is ignored entirely. Additionally, since each extent eliminated for a column also eliminates the same extents for all the other columns in a table, impossible filter chains can be quickly identified without any I/O being required. For example, take the following two columns and their Extent Map information:

If a column WHERE filter of col1 between 220 and 250 and col2 < 10000 is specified, InfiniDB will eliminate extents 1, 2 and 4 from the first column filter, then, looking at just the matching extents for col2 (i.e. just extent 3), it will determine that no extents match and return zero rows without doing any I/O at all.

Amazon.in Widgets

Sponsored Ads

Interview Questions

HR Interview Questions
Testing Interview Questions
SAP Interview Questions
Business Intelligence Interview Questions
Call Center Interview Questions

Databases

Clipper Interview Questions
DBA Interview Questions
Firebird Interview Questions
Hierarchical Interview Questions
Informix Interview Questions
Microsoft Access Interview Questions
MS SqlServer Interview Questions
MYSQL Interview Questions
Network Interview Questions
Object Relational Interview Questions
PL/SQL Interview Questions
PostgreSQL Interview Questions
Progress Interview Questions
Relational Interview Questions
SQL Interview Questions
SQL Server Interview Questions
Stored Procedures Interview Questions
Sybase Interview Questions
Teradata Interview Questions

Microsof Technologies

.Net Database Interview Questions
.Net Deployement Interview Questions
ADO.NET Interview Questions
ADO.NET 2.0 Interview Questions
Architecture Interview Questions
ASP Interview Questions
ASP.NET Interview Questions
ASP.NET 2.0 Interview Questions
C# Interview Questions
Csharp Interview Questions
DataGrid Interview Questions
DotNet Interview Questions
Microsoft Basics Interview Questions
Microsoft.NET Interview Questions
Microsoft.NET 2.0 Interview Questions
Share Point Interview Questions
Silverlight Interview Questions
VB.NET Interview Questions
VC++ Interview Questions
Visual Basic Interview Questions

Java / J2EE

Applet Interview Questions
Core Java Interview Questions
Eclipse Interview Questions
EJB Interview Questions
Hibernate Interview Questions
J2ME Interview Questions
J2SE Interview Questions
Java Interview Questions
Java Beans Interview Questions
Java Patterns Interview Questions
Java Security Interview Questions
Java Swing Interview Questions
JBOSS Interview Questions
JDBC Interview Questions
JMS Interview Questions
JSF Interview Questions
JSP Interview Questions
RMI Interview Questions
Servlet Interview Questions
Socket Programming Interview Questions
Springs Interview Questions
Struts Interview Questions
Web Sphere Interview Questions

Programming Languages

C Interview Questions
C++ Interview Questions
CGI Interview Questions
Delphi Interview Questions
Fortran Interview Questions
ILU Interview Questions
LISP Interview Questions
Pascal Interview Questions
Perl Interview Questions
PHP Interview Questions
Ruby Interview Questions
Signature Interview Questions
UML Interview Questions
VBA Interview Questions
Windows Interview Questions
Mainframe Interview Questions

Copyright © 2001-2025 Vyoms.com. All Rights Reserved.

Home | About Us | Advertise With Vyoms.com | Jobs | Contact Us | Feedback | Link to Us | Privacy Policy | Terms & Conditions

Placement Papers | Get Your Free Website | IAS Preparation | C++ Interview Questions | C Interview Questions | Report a Bug | Romantic Shayari | CAT 2025

Fresher Jobs | Experienced Jobs | Government Jobs | Walkin Jobs | Company Profiles | Interview Questions | Placement Papers | Companies In India | Consultants In India | Colleges In India | Exams In India | Latest Results | Notifications In India | Call Centers In India | Training Institutes In India | Job Communities In India | Courses In India | Jobs by Keyskills | Jobs by Functional Areas

Testing Articles | Testing Books | Testing Certifications | Testing FAQs | Testing Downloads | Testing Interview Questions | Testing Jobs | Testing Training Institutes

Gate Articles | Gate Books | Gate Colleges | Gate Downloads | Gate Faqs | Gate Jobs | Gate News | Gate Sample Papers | Gate Training Institutes

MBA Articles | MBA Books | MBA Case Studies | MBA Business Schools | MBA Current Affairs | MBA Downloads | MBA Events | MBA Notifications | MBA FAQs | MBA Jobs
MBA Job Consultants | MBA News | MBA Results | MBA Courses | MBA Sample Papers | MBA Interview Questions | MBA Training Institutes

GRE Articles | GRE Books | GRE Colleges | GRE Downloads | GRE Events | GRE FAQs | GRE News | GRE Training Institutes | GRE Sample Papers

IAS Articles | IAS Books | IAS Current Affairs | IAS Downloads | IAS Events | IAS FAQs | IAS News | IAS Notifications | IAS UPSC Jobs | IAS Previous Question Papers
IAS Results | IAS Sample Papers | IAS Interview Questions | IAS Training Institutes | IAS Toppers Interview

SAP Articles | SAP Books | SAP Certifications | SAP Companies | SAP Study Materials | SAP Events | SAP FAQs | SAP Jobs | SAP Job Consultants
SAP Links | SAP News | SAP Sample Papers | SAP Interview Questions | SAP Training Institutes |

Copyright ©2001-2025 Vyoms.com, All Rights Reserved.
Disclaimer: VYOMS.com has taken all reasonable steps to ensure that information on this site is authentic. Applicants are advised to research bonafides of advertisers independently. VYOMS.com shall not have any responsibility in this regard.