Data Alchemy

Monday, September 9, 2013

Hive Metastore ERD

In an effort to understand the underpinnings of Hive and HCat, I started reverse engineering the metastore ERD from MySQL and figured others might find this handy. While the API-only crowd might find this verboten, I find its an easy way to understand quickly how useful Hive/Hcat is beyond just writing HQL (see my blog HCatalog - Embrace the independence).

Monday, September 2, 2013

Hadoop Hindsight #2 Keep it simple: more than likely someone else has encountered your problem.

An adventure is only an inconvenience rightly considered. An inconvenience is an adventure wrongly considered.
-G.K. Chesterton

Sometimes our ego gets the best of us. This seems to occur more often in Hadoop than anywhere else I've worked. I'm not sure if this relatively new world propels us into thinking we're on an island, or if some developers are inherently poor data analysts. At any rate, we need to reign in our bloated self-image and realize that someone else likely encountered our issue and a seasoned committer carried it thru the stack to resolution. Let me give you an example:

Sqooping data with newlines

I wish I had caught this issue earlier. Some of our developers were pulling data from Teradata and DB2 and encountered embedded newline and ctrl-a data in a few columns. Claiming the 'bad' data broke their process, they overreacted and jumped to using Avro files to resolve their problem. While avro is well and good for some issues, this was major overkill that turned out causing issues within Datameer and created additional complexity in HCat. I took some time to 'research' (ala google-fu) to see what others had done to get around this. I already had a few simple ideas, like regex your SQL to remove \n\r\01, but I was really looking for a more elegant solution.

It took me 30 minutes or so to work up an example, create a failure, and RTFM for a resolution. I was hitting walls everywhere much like our developers, the sqoop documentation isn't bad, but there are some holes. A little more searching and I foundCloudera Sqoop-129 Newlines in RDBMS fields break hive. Created 11/2010 and resolved 5/2011. Turns out it was fixed in sqoop version 1.3.0 and we are on 1.4.2 - looking good so far. The fix implemented these arguments which handles elimination or replacement of these characters during the load.

`--hive-drop-import-delims`	Drops \n, \r, and \01 from string fields when importing to Hive.
`--hive-delims-replacement`	Replace \n, \r, and \01 from string fields with user defined string when importing to Hive.

It turns out they fixed our problem from a Hive standpoint, but its actually valid for Pig, etc. Its much more elegeant than a source-SQL/regex solution because I don't need to specify fields - everything is covered. Now in our case the business users didn't even care about the newlines that were present in 3 of 2 million rows (ug!) so I just used --hive-drop-import-delims in the sqoop command and everything was fine.

So by adding a single line to a Sqoop step, I eliminated the need to maintain an additional serialization framework and downstream processes will likely be easier to maintain. When dealing with basic business data we need to realize it isn't rocket science - some else has probably already figured it out.

Wednesday, August 7, 2013

Consuming JSON Strings in SQL Server

This article describes a TSQL JSON parser and its evil twin, a JSON outputter, and provides the source. It is also designed to illustrate a number of string manipulation techniques in TSQL. With it you can do things like this to extract the data from a JSON document:

Read the full article here.

Tuesday, July 2, 2013

Hadoop Hindsight #1 Start Small

I thought we would start series on some lessons we've learned. Many of the topics I've learned the hard way so I hope it will be helpful for those a few steps behind in the journey. YMMV, but I wish this ideology was firmly ensconced when we started.

Identify a business problem that Hadoop is uniquely suited for.
Just because you found this cool new hammer doesn't mean everything is a nail. Find challenges that your existing tech can't answer easily. One of our first projects involved moving 300 gigs of EDI transaction files. A business unit was having BA's grep for customer strings on 26,000 files to find 4 or 5 files, then FTP'ing those to their deskptop for manual parsing and review. They might spend a few HOURS doing this for each request. It was a natural and simple use of Hadoop. We learned a lot about design patterns, scheduling, and data cleanup.

Solve this one business challenge well.
Notice I didn't say nail it perfectly. There are many aspects of Big Data that will challenge the way you've looked at things the last 20 years. The solution should be good, but not necessarily perfect. Accepting this gives time to establish PM strategy and basic design patterns.

Put together a small team that has worked well together in the past.
This is critical to your success! Please, please, please take note! Inter-team communication is the foundation upon which your Hadoop practice will grow. In The Mythical Man-Month my man Fredrick Brooks said:

To avoid disaster, all the teams working on a project should remain in contact with each other in as many ways as possible...

Ideally a team should consist of the following:

1 Salesman (aka VPs)
1 Agile-trained PM
1 Architect
2 Former DBAs
1-3 skilled java developers
1 Cluster Admin

Obviously this is very simplified and some roles can overlap. My point is you should have no more than 10 people max starting out!

Support your solution.
This very same team should also live thru at least 3 months of support of the solution they've created. Valuable insight is gained once you have to fix a few production problems. Let the solution mature in production a bit to understand support considerations. This gives you time to adjust your design patterns. Trust me, you'll want time to reflect on your work and correct flaws.

Smash your solution and rebuild (Optional - If time permits)
Good luck getting the time, but if you're serious about a sustainable Enterprise Hadoop solution this should be rightly considered.

Go forth and multiply.
By this time your patterns and procedures should form the DNA of your new Hadoop cell. You're team should naturally develop into the evangelists and leaders upon which the mitosis of a new project occurs, carrying with it the new replicated chromosomes. As your project cells divide and multiply, you'll be able to take on more formidable challenges.

That's all I have to say about that.

Saturday, June 1, 2013

Hortonworks HDP 1.3 released

HDP 1.3 was released today. This milestone release takes advantage of improved performance in Hive 0.11 along with delivery on a series of enterprise requirements including NFS access to HDFS, improved MTTR for HBase, business continuity through HDFS and HBase snapshots, optimized connectors to Oracle and Netezza and the latest release of Ambari for management and operations. All of this represents the wicked fast pace of community-driven open source.

Core 1.1.2 was sooo last week.

Wednesday, May 22, 2013

Why most Enterprise Hadoop jobs will not require hardcore Java skills in 3-5 years.

OK. So the controversial heading hopefully piqued your interest, and if your a hardcore java developer - just hear me out. Its nothing personal.

In the late 1979, RSI's Oracle version 2 ran on Digital's VAX minicomputers (32bit AND virtual memory!). If you were proficient with the first commercial RDBMS, you had to posses mad Macro-11 or PL-11 (the high level version) skills to actually make many of the functions work that we take for granted now. Many basic tools that DBAs and developers use today simply didn't exists. You had to roll your own. Even the data dictionary was a new concept and often in-flux.

Hello World, Macro-11 style:

        .TITLE  HELLO WORLD
        .MCALL  .TTYOUT,.EXIT
HELLO:: MOV     #MSG,R1 ;STARTING ADDRESS OF STRING
1$:     MOVB    (R1)+,R0 ;FETCH NEXT CHARACTER
        BEQ     DONE    ;IF ZERO, EXIT LOOP
        .TTYOUT         ;OTHERWISE PRINT IT
        BR      1$      ;REPEAT LOOP
DONE:   .EXIT

MSG:    .ASCIZ /Hello, world!/
        .END    HELLO

Don't forget the RT-11 commands to assemble, link, and run!

.MACRO HELLO
ERRORS DETECTED:  0

.LINK HELLO

.R HELLO
Hello, world!
.

It was an immature but revolutionary way to store and recall information. Bell Labs saw the business benefits of the Oracle RDBMS and thus much hype and exuberance flowed in the land:

"They could take this data out of the database in interesting ways, make it available to nontechnical people, but then look at the data in the database in completely ad hoc ways." - Ed Oates

During these early days you would need a room full of advanced computer science academics just to keep the system functioning - at each and every business. There were no safety nets and everyone had there own perspective on how to do both a multi-join query WITH an aggregate function (and on the 4th day RBO was created, and it was good). Read consistency was still 5 years away! As time went on, the best brains from the IT collective pioneered standards and best practice that we all use today. As the tech matured, the need for low-level Macro-11 developers diminished as they were replaced by a more mature product that would appeal to large non-tech companies. As the need for low-level tech skills went away, patterns were established and the need for highly skilled programmers to keep the data store functioning went away. Interestingly, the data and the patterns of its flow remained. That is why enterprises have DBA to maintain modern relational databases, not developers.

Inevitably, there are some times when advances dictate new low-level programming skills on a large scale. When RSI released Version 3 in C, there was high demand for developers who could read and speak the prose of Mr. Ritchie. This was necessary for recompiling and testing a consistent code base across everything from minis and mainframes, to PCs. While C was quite portable, there was much work to be done in the storage subsystems. Again, as the need for low-level tech skills went away, the data remained.

When we look at the new world of Hadoop, we must understand that this type of tech revolution has occurred before. Right now there is much work afoot to solve the primitive questions. This undoubtedly requires a new breed of low-level Java developers... for awhile. We see the results of these efforts in tools like Pig, Hive, Impala, and Stinger glued together via HCat. Once the dust settles, I wouldn't stake my professional future on mastering low-level MapReduce, but rather focus on mastering the higher level tools. This will allow the enterprises quicker access to business insight. As Hadoop's primitive issues are solved in to standards and patterns in the next 3-5 years, the need for Java developers will diminish substantially in the next 3-5 years. Just look at how many PL-11 or C++ programmers your enterprise has in their DBA teams; the low-level tech comes and goes, but the data remains.

Sunday, April 7, 2013

The 3 Pillars of Data Democracy

In order to promote the use of data within the enterprise, we need to provide a collaborative environment which gives people the freedom and incentive to try new things. This gives everyone the chance to prove great ideas, or at worst to fail quickly. We may all understand the benefits of democratizing data, yet without an environment to foster that exploration it will remain just a great idea. It is much like a unicorn. We all know what it looks like, but no one has actually seen one.

Therefore I propose 3 pillars that are essential to encouraging this environment:

Searchable Metadata

We can provide a powerful framework for mining all sorts of data, but if users cannot inspect data elements easily from a trusted source, we will be perceived as unintuitive and difficult to use.

“Simplicity is the ultimate sophistication.” ? Leonardo da Vinci

Gallery

The gallery showcases works that others have done and springboards innovation by providing rapid provisioning of data. Infographics and supporting worksheets can be linked into the gallery and tagged so that others may search for it. Did someone already perform margin rate analysis? Let’s go to the gallery first to find out.

If a person finds a relevant workbook, they can easily provisioning that insight for their own use. This is a powerful feature that allows others to build upon and extend new insights without reinventing the wheel.

Gamification

We are wired to appreciate feedback loops. We post on Facebook and Twitter not because we like to type, but we enjoy the recognition in the form of comments and followers. As of 2011, the global video game market is valued at $65 billion. People spend hours in front of a screen because it leverages their natural desire for competition, accomplishment, and status. Adopting this mindset in our analytic platform not only encourages participation, but it gives us yet another set of metrics for self-assesment. Which infographics are most accessed? How many times is a worksheet copied for further analysis? The ‘Like’ and ‘Friend’ button allows Facebook to catalog over a billion active user profiles – all to sell you stuff.