Systems Engineering and RDBMS

Archive for the ‘Big Data’ Category

Sqoop graduates from Apache Incubator to a Top Level Project

Posted by decipherinfosys on April 5, 2012

Sqoop – the Big Data Tool has moved out of the Apache Incubator to a Top Level Project (TLP).  In case you are not aware of Sqoop, it is the key data tool to transfer volumes of data between Hadoop and structured data stores like RDBMS (Relational Database Management Systems).  This project provides connectors for many popular RDBMS – Oracle, SQL Server, MySQL, DB2 and PostgreSQL.  This is a significant step towards the adoption of Hadoop in the enterprise solutions.

You can read more on this in the eweek article here. And here is the link for learning more on sqoop:

Posted in Big Data, Technology | Tagged: , | 1 Comment »

Excellent comparison of Pig vs Hive

Posted by decipherinfosys on March 22, 2012

We had recently blogged about Hadoop and the different sources for learning Hadoop and getting up to speed on it.  One of the points that we missed out on was a mention of Pig and Hive.  Hive and Pig were Hadoop sub-projects before but are now open source volunteer projects under the Apache Software Foundation.

Pig is essentially a platform for creating MapReduce programs with Hadoop.  The platform consists of a high level language for data analysis programs and an infrastructure for evaluating those programs.  Since they are amenable to substantial parallel operations, it enables them to handle very large data sets.

Hive is a data warehouse system built for Hadoop that allows easy data aggregation, ad-hoc queries and analysis of large data sets stored in Hadoop compatible file systems.  HiveQL is a SQL “like” language that can be used to interact with the data and it also allows developers to put in their own custom mappers/reducers.

Here is a link that provides an excellent comparison between Pig and Hive by Lars George:

Be sure to read the comments as well.

And the getting started guides on Hive and Pig:



Posted in Big Data | Tagged: , , , | Leave a Comment »

Learning Hadoop

Posted by decipherinfosys on March 20, 2012

In a recently concluded project, we had the opportunity to work on Hadoop.  There was a learning curve since none of us had worked in Hadoop before.  Here are some URLs to help you get started with your learning process in this regard:

Basics of Hadoop:

The article on gigaom or the series of articles on cloudera’s site will get your started:

Sign up with Cloudera and you will have access to a lot of very good learning material on Hadoop, example:  is a good starter’s video on MapReduce and HDFS.

or this one: for understanding the Hadoop ecosystem.

And this whitepaper from Gartner on Hadoop and MapReduce for Big Data Analytics:

If you like to have text available for your learning purposes rather than video tutorials, here is a good chapter on HDFS:

Setting up Hadoop cluster:

And once you are ready to jump in, there are some excellent tutorials by Michael G. Noll to guide you:

To set up your first Hadoop node:

And then multiple node cluster:

And here are some additional good tutorial references:

Microsoft and BigData

Recently, MSFT also announced their support for Apache Hadoop.  You can read more on MSFT’s big data solution from here:

and the work done by HortonWorks for extending Apache Hadoop to Windows:

Posted in Big Data, Linux, SQL Server, Unix, Windows | 1 Comment »

Excellent Article on the Basics of Hadoop

Posted by decipherinfosys on February 10, 2012

Read an excellent article at gigaom on Hadoop.  It is true that many people in the industry have heard the word, read about the Big Data and don’t really know what Hadoop is.  This article will give you the basic understanding that you need:

Posted in Big Data, Open Source, Technology | Leave a Comment »