Monday, September 2, 2013

Hadoop Hindsight #2 Keep it simple: more than likely someone else has encountered your problem.

An adventure is only an inconvenience rightly considered. An inconvenience is an adventure wrongly considered.
-G.K. Chesterton
Sometimes our ego gets the best of us.  This seems to occur more often in Hadoop than anywhere else I've worked.  I'm not sure if this relatively new world propels us into thinking we're on an island, or if some developers are inherently poor data analysts.  At any rate, we need to reign in our bloated self-image and realize that someone else likely encountered our issue and a seasoned committer carried it thru the stack to resolution.  Let me give you an example:
Sqooping data with newlines
I wish I had caught this issue earlier. Some of our developers were pulling data from Teradata and DB2 and encountered embedded newline and ctrl-a data in a few columns.  Claiming the 'bad' data broke their process, they overreacted and jumped to using Avro files to resolve their problem.  While avro is well and good for some issues, this was major overkill that turned out causing issues within Datameer and created additional complexity in HCat.  I took some time to 'research' (ala google-fu) to see what others had done to get around this.  I already had a few simple ideas, like regex your SQL to remove \n\r\01, but I was really looking for a more elegant solution.
It took me 30 minutes or so to work up an example, create a failure, and RTFM for a resolution.  I was hitting walls everywhere much like our developers, the sqoop documentation isn't bad, but there are some holes.  A little more searching and I foundCloudera Sqoop-129 Newlines in RDBMS fields break hive.  Created 11/2010 and resolved 5/2011.  Turns out it was fixed in sqoop version 1.3.0 and we are on 1.4.2 - looking good so far.  The fix implemented these arguments which handles elimination or replacement of these characters during the load.
--hive-drop-import-delimsDrops \n\r, and \01 from string fields when importing to Hive.
--hive-delims-replacementReplace \n\r, and \01 from string fields with user defined string when importing to Hive.
It turns out they fixed our problem from a Hive standpoint, but its actually valid for Pig, etc.  Its much more elegeant than a source-SQL/regex solution because I don't need to specify fields - everything is covered.  Now in our case the business users didn't even care about the newlines that were present in 3 of 2 million rows (ug!) so I just used --hive-drop-import-delims in the sqoop command and everything was fine.
So by adding a single line to a Sqoop step, I eliminated the need to maintain an additional serialization framework and downstream processes will likely be easier to maintain.  When dealing with basic business data we need to realize it isn't rocket science - some else has probably already figured it out.


31 comments:

  1. This comment has been removed by the author.

    ReplyDelete

  2. very usefull informatation.and iam expecting more posts like this please keep updating us........

    ReplyDelete

  3. Well, it’s a nice one, I have been looking for. Thanks for sharing such informative stuff.

    PhD Thesis Writing Services
    Dissertation Writing Services
    Research Paper Writing Services

    ReplyDelete
  4. Good one, Thanks for sharing your information, it is very useful for me.
    Selenium Training in velachery | Selenium Training in velachery

    ReplyDelete
  5. Best Php Training Institutes in Noida - PHP training in itself is an incredible choice for each person who's trying to decorate his career opportunities in the end. It helps you understand greater approximately distinctive database practical specifications.

    Best Php Training Institutes in Noida

    Best Digital Marketing Training Institutes in Noida

    Best Hadoop Training Institutes In Noida

    Best Sas Training Institutes in Noida

    Best Sap Training Institutes in Noida

    Best Linux Training Institutes in Noida

    ReplyDelete
  6. This comment has been removed by the author.

    ReplyDelete
  7. WOW! This post made my day. It was looking for this this information for a long time. Thanks for sharing such a wonderful post.

    Oracle training in noida
    Digital marketing training in noida

    ReplyDelete
  8. Great blog.you put Good stuff.All the topics were explained briefly.so quickly understand for me.I am waiting for your next fantastic blog.Thanks for sharing.
    MCSE Training in Chennai | Hardware and Networking Training in Chennai

    ReplyDelete
  9. CIITN Noida provides Best Big Data Training Institute in Noida. based on current industry standards that helps attendees to secure placements in their dream jobs at MNCs. CIITN Provides Best Big Data Training in Noida. CIITN is one of the most credible Big Data training institutes in Noida offering hands on practical knowledge and full job assistance with basic as well as advanced level Big Data training courses. At CIITN Big Data training in noida is conducted by subject specialist corporate professionals with 7+ years of experience in managing real-time Big Data projects. CIITN implements a blend of academic learning and practical sessions to give the student optimum exposure that aids in the transformation of naïve students into thorough professionals that are easily recruited within the industry.
    CIITN is the best Hadoop training center in Noida with a very high level infrastructure and laboratory facility. The most attractive thing is that candidates can opt multiple Institute.

    Best Big Data Training Institute in Noida & big Data Hadoop Training Institute in Noida.

    ReplyDelete
  10. Nice blog with excellent information. Thank you, keep sharing Full Stack Training in Hyderabad

    ReplyDelete
  11. This comment has been removed by the author.

    ReplyDelete
  12. This is an awesome post.Really very informative and creative contents. These concept is a good way to enhance the knowledge.I like it and help me to development very well.Thank you for this brief explanation and very nice information.Well, got a good knowledge.
    Spoken English Class in Coimbatore
    Spoken English in Coimbatore
    Best Spoken English Coaching Centre in Coimbatore
    IELTS Classes in Coimbatore
    best IELTS Coaching Center in Coimbatore
    German Language course in Coimbatore
    German Language in Coimbatore

    ReplyDelete
  13. Awesome,Thank you so much for sharing such an awesome blog. Thanks for one marvelous posting! I enjoyed reading it; you are a great author. Thank you for sharing such great information very useful to us.
    oracle training in chennai

    oracle training institute in chennai

    oracle training in bangalore

    oracle training in hyderabad

    oracle training

    oracle online training

    hadoop training in chennai

    hadoop training in bangalore

    ReplyDelete
  14. It is amazing and wonderful to visit your site. Thanks for sharing this information, this is useful to me.
    Visit us for custom laptop bags.

    ReplyDelete
  15. Algoriddim djay Pro AI 4.0.7 Crack FREE Download · Create your own custom and smart playlists · Powerful track filtering · Finder Integration.How To Use Djay Pro

    ReplyDelete
  16. iMyFone Filme Crack is an easy-to-use video editing program that allows you to create professional-looking videos.Gilisoft Video Editor Pro 14.0.0 Key

    ReplyDelete
  17. Merry Christmas Wishes for Family · A little smile, a word of cheer, a bit of love from someone near, a little gift from one held dear, best ...Christmas Greetings For Girls

    ReplyDelete
  18. How does the manifestation work? · Build positive thoughts and ideas · Small Things to Manifest if you are a Beginner. 1. Manifest a cup of tea:
    How-To-Attract-More-Blessings-Into-Your-Life

    ReplyDelete
  19. If you want premium features unlocked and no ads, allow users to enjoy the app without any interruptions or restrictions. Then use h Snapchat++, users can save snaps without the sender knowing, view snaps as many times as they want, and even use custom filters and lenses.

    ReplyDelete