Data Alchemy: Schema

Showing posts with label Schema. Show all posts

Wednesday, August 7, 2013

Consuming JSON Strings in SQL Server

This article describes a TSQL JSON parser and its evil twin, a JSON outputter, and provides the source. It is also designed to illustrate a number of string manipulation techniques in TSQL. With it you can do things like this to extract the data from a JSON document:

Read the full article here.

Friday, February 8, 2013

HCatalog - Embrace the independence

Codd's Rule 9: Logical data independence:

Changes to the logical level (tables, columns, rows, and so on) must not require a change to an application based on the structure. Logical data independence is more difficult to achieve than physical data independence. (wiki)

This was a novel idea in 1970 and caused a lot of controversy. Back in the day, your COBOL program was usually fired from some flavor of JCL. So we had to point the DD (stdin/stdout) to a specfic location and dataset name 'manually' along with UNIT and VOL parms.

//TRGD12 JOB FOO,PLUMARIA
//STEP01 EXEC PGM=MYPROGRAM
//INDD DD DSN=TRGD56.DEMO.INPUT,DISP=SHR,UNIT=SYSDA
//OUTDD DD DSN=TRGD56.DEMO.OUTPUT,DISP=(NEW,CATLG),
//  UNIT=AFF=INDD

Sorry for posting JCL on a Hadoop blog, don't judge my age. And to the TSO brethren, I know: Cataloged datasets helped with this later on...

As we evolved our metastores we began to organize data logically by schemas and abstracted the physical location via TABLESPACES and FileGroups. After awhile, we took this for granted while we pounded out our wonderful SQL.

But as the great philosophers, MATCHBOX 20 pondered:

Let's see how far we've come
Let's see how far we've come

Lets take a typical pig latin script that would do some ETL work:

A = load '/data/MDM/ds=20130101/bu=pd/property=cust' using PigStorage()
as (custid:chararray, name:chararray, addr:chararray, timestamp:long);

Ouch! This way of working could become a maintenance nightmare with any of the following situations:

1. Compacting multiple files via HAR
2. Change in compression method
3. Schema changes from data producer
4. Moving data location

If the variety of your schemas is limited, this probably is easy to manage. However, most enterprise environments will be anything but limited. This is also compounded by users with different data tools accessing Hadoop and sharing the same data. Soon your cluster becomes very brittle. Whats a data alchemist to do?

HCatalog to the rescue! This Apache subproject comes primarily from Yahoo and appears to be a big part of Hortonworks bag of tricks for organizing data. Basically HCatalog extends the Hive metastore and, based on the contributors, has been a collaborative effort between members from the Apache Pig, Hive, and Hadoop projects

Per the wiki:

Apache HCatalog is a table and storage management service for data created using Apache Hadoop.

This includes:
Providing a shared schema and data type mechanism.
Providing a table abstraction so that users need not be concerned with where or how their data is stored.
Providing interoperability across data processing tools such as Pig, Map Reduce, and Hive.

The pig example above would now look like:

A = load 'MDM' using HCatLoader();
B = filter A by ds='20130101' and bu='pd' and property='cust';
...

Much better. If we relocate the files that store MDM, or switch to a better storage format, the Pig Latin script does not change at all. This is the same for MR and Hive. Also notice that the shema is not specified, HCatLoader automagically tells that to Pig and converts the shema to the appropriate pig datatypes.

HCatalog also leverages Hive's DML so we can create and drop tables, specify table parameters, etc. This will give us a great way to manage our metadata. I'm also curious how we can extend HCatalog from an audit standpoint to keep track of who,what,where,when datasets are touched.

This is another dot connected for me. I ranted about this a bit in a prior post: Confessions of a data architect where I argued the limited capabilities of HDFS from a data organization stanpdoint. In a way, this also answers another looming question: where should we store metadata? This solution depends on Postgres... interesting.

Check out this video from Hortonworks on Hcat,

Monday, December 10, 2012

My name is Jeff Graham and I'm a data modeler. It's been 6 weeks since my last star-schema.

Lets face it. Codd, Imhoff, Inmon, and Kimball paved the way for almost every data analyst and app-dev professional since the relational model worked its way into corporate data centers. We cried, learned, and laughed as crazy ideas like data warehousing and dimensional modeling became part of our lexicon. My kids are well fed and have the latest shoes thanks to these data architects (actually I owe more to the foolish and lazy who didn't employ quality in their database design; thx for the consulting $s!).

As a 14 year veteran of RDBMS performance tuning, I was a staunch defender of the relational model. A few years ago, I found myself running into hurdles as the 3 V's (Volume, Velocity, Variety) started to shake my belief system. At first the problem would manifest itself in small ways and we found ways to overcome by partitioning, faster hardware, etc. I would start to question my skills as ever-growing informational pressure caused designs to fail. Based on historical results, the problems I faced were a direct result of a poor data model. It took a look of honest questioning and the death of many sacred cows to accept a new way of thinking. Any data architect worth their salt should step aside from the past and look forward.

But we can't forget ALL of the lessons learned. While the way we store and manipulate data may change, the ability to control and regulate it in a multitenant datastore will be just as important. These concepts must persist if Hadoop is to be accepted just like the relational databases of the past. Therefore I propose the following qualities that must continue in our brave new world .

Organization

Organization is highly subjective and use case specific. In Hadoop, we can use directory structures to organize data by business unit, by stage in lifecycle (new vs. old, hot versus cold, raw versus derived), or other concerns. I’ll have to be honest, it took me a little while to absorb this because somehow it seems completely natural to experienced Hadoop developers (aka app-dev people), but it leaves a big hole in the heart of a DBA. A good DBA hearts metadata – that’s how they carve up access and do stuff. An old DBA hearts metadata about metadata – and perhaps a few backups of the metadata.

Now all I have is HDFS, which is really just a filesystem with basic unix security. So directories ARE my metadata. Hmmm. I see a metastore in our future that will allow us to map a directory structure to more metadata. DBA_TABLES anyones?

Space Quotas

Storage control, especially in multitenant environments will be a huge concern. Just like organization, having data bucketed by directory is also what gives you control over quotas (note: create another post about the small-filesize quota snafu).

Partitioning

Finally, tools like Hive understand partition pruning during query execution. Each partition is simply a directory with a special naming convention that indicates the range of the table to which the contained data belongs. Tools other than Hive can similar partition pruning by simply only including the directories that are known to contain data of interest.

Partitioning will also facilitate data removal. Although the mantra of Hadoop is to store everything forever because there might be value later, there is the occasional contractual or legal requirement to do so.

OK. So what does this all wind up looking like? Here is a straw dog (just for you Jeff B) :

/data : Contains raw data sets ingested from other systems. Read only to users.

/user/<username> : Home directories / scratch pads for users.

/Dbay : Contains ETL process queue directories

/tmp : Sticky-bit set scratch for tools and users (no guarantee on longevity).

/data and /DBay are the interesting ones.

/data/<dataset name>/<optional partitions>

Where <dataset name> is the equivalent of a table name in an RDBMS. Data sets may be partitioned by N columns, but that's optional and use case dependent.

Ex: Partitioned clickstream data by day.

/data/clickstream/date=20120101/{x.dat,y.dat,z.dat}

/data/clickstream/date=20120102/{x.dat,y.dat,z.dat}

/Dbay/<group>/<application>/<process>/{incoming,working,complete,failed}

Where <group> is the line of business / group (research, search quality, fraud analysis), <application> is the name of the application the process supports, and <process> is for applications that have multiple processing stages. Each process "queue" has four state directories:

incoming: newly arriving files drop off here. A process atomically renames them into a temp directory under working to indicate they're in progress (and so overlapping processes don't steal them).

working: Contains a timestamped directory for each attempt at processing the files. Files in these directories older than X require human intervention. Monitor for this.

complete: After a Dbay process finish processing a file in working, this is where it lands.

failed: If a Dbay process decides to permanently reject a file (and ask for a human to look at it), it moves it here. If the directory contains > 0 files, it requires human intervention.

This is a complex issue that often gets overlooked and creates a bit of angst for a recovering DBA who is used to a mature datastore . It's effectively the shared filesystem version of data modeling.

Wednesday, August 7, 2013

Consuming JSON Strings in SQL Server

Friday, February 8, 2013

HCatalog - Embrace the independence

Monday, December 10, 2012

Blog Archive

Search This Blog