Data Alchemy: HCatalog - Embrace the independence

Codd's Rule 9: Logical data independence:

Changes to the logical level (tables, columns, rows, and so on) must not require a change to an application based on the structure. Logical data independence is more difficult to achieve than physical data independence. (wiki)

This was a novel idea in 1970 and caused a lot of controversy. Back in the day, your COBOL program was usually fired from some flavor of JCL. So we had to point the DD (stdin/stdout) to a specfic location and dataset name 'manually' along with UNIT and VOL parms.

//TRGD12 JOB FOO,PLUMARIA
//STEP01 EXEC PGM=MYPROGRAM
//INDD DD DSN=TRGD56.DEMO.INPUT,DISP=SHR,UNIT=SYSDA
//OUTDD DD DSN=TRGD56.DEMO.OUTPUT,DISP=(NEW,CATLG),
//  UNIT=AFF=INDD

Sorry for posting JCL on a Hadoop blog, don't judge my age. And to the TSO brethren, I know: Cataloged datasets helped with this later on...

As we evolved our metastores we began to organize data logically by schemas and abstracted the physical location via TABLESPACES and FileGroups. After awhile, we took this for granted while we pounded out our wonderful SQL.

But as the great philosophers, MATCHBOX 20 pondered:

Let's see how far we've come
Let's see how far we've come

Lets take a typical pig latin script that would do some ETL work:

A = load '/data/MDM/ds=20130101/bu=pd/property=cust' using PigStorage()
as (custid:chararray, name:chararray, addr:chararray, timestamp:long);

Ouch! This way of working could become a maintenance nightmare with any of the following situations:

1. Compacting multiple files via HAR
2. Change in compression method
3. Schema changes from data producer
4. Moving data location

If the variety of your schemas is limited, this probably is easy to manage. However, most enterprise environments will be anything but limited. This is also compounded by users with different data tools accessing Hadoop and sharing the same data. Soon your cluster becomes very brittle. Whats a data alchemist to do?

HCatalog to the rescue! This Apache subproject comes primarily from Yahoo and appears to be a big part of Hortonworks bag of tricks for organizing data. Basically HCatalog extends the Hive metastore and, based on the contributors, has been a collaborative effort between members from the Apache Pig, Hive, and Hadoop projects

Per the wiki:

Apache HCatalog is a table and storage management service for data created using Apache Hadoop.

This includes:
Providing a shared schema and data type mechanism.
Providing a table abstraction so that users need not be concerned with where or how their data is stored.
Providing interoperability across data processing tools such as Pig, Map Reduce, and Hive.

The pig example above would now look like:

A = load 'MDM' using HCatLoader();
B = filter A by ds='20130101' and bu='pd' and property='cust';
...

Much better. If we relocate the files that store MDM, or switch to a better storage format, the Pig Latin script does not change at all. This is the same for MR and Hive. Also notice that the shema is not specified, HCatLoader automagically tells that to Pig and converts the shema to the appropriate pig datatypes.

HCatalog also leverages Hive's DML so we can create and drop tables, specify table parameters, etc. This will give us a great way to manage our metadata. I'm also curious how we can extend HCatalog from an audit standpoint to keep track of who,what,where,when datasets are touched.

This is another dot connected for me. I ranted about this a bit in a prior post: Confessions of a data architect where I argued the limited capabilities of HDFS from a data organization stanpdoint. In a way, this also answers another looming question: where should we store metadata? This solution depends on Postgres... interesting.

Check out this video from Hortonworks on Hcat,

Friday, February 8, 2013

HCatalog - Embrace the independence

No comments:

Post a Comment

Blog Archive

Search This Blog