Author Archives: tshanky

Big Data Analytics : What’s Next?

While a majority of Fortune 1000 companies are en-route to understanding Hadoop and adopting it in their technology stack, startups in the bay area and elsewhere have started asking the important and inevitable question: “What’s Next?”. Hadoop for the first time has allowed us to analyze massive amounts of data without necessarily indulging in expensive proprietary hardware or software. However, adoption of Hadoop alone isn’t necessarily helping businesses make smarter decisions or unearth completely new facts that could lead to immense growth of top line. The power of scalable infrastructure needs to be supplemented with nifty data mining and machine learning tools, better visualization of results, and easier ways to track and analyze the findings over a period of time. Besides, there is the entire realm of real-time analytics, which is beyond the batch oriented nature of Hadoop.

The “Global Big Data Conference“, scheduled to take place at the Santa Clara Convention Center on January 28, 2013 answers some of these very important questions around what’s happening in the field of “big data” and what’s to come next. Its a great 1 day conference that has a lot of interesting topics, covered by an awesome line-up of great speakers. In order to take advantage of some favorable pricing please register by tomorrow (January 22, 2013) and save a whole $100, as compared to the onsite price. In addition, as a reader of my blog, don’t forget to take advantage of the additional 20% discount, which you can avail by using the code: “SHAS“. See you all there!

Geolocation in MongoDB at the Silicon Valley MongoDB User Group

Thanks to all of you, who were able to join me at the session last evening. Thanks much for the kind remarks some of you left behind on the meetup message board, post the session.  Its very rewarding to know that many of you enjoyed the session and found it very useful. I loved the many questions that were brought up and discussed in the room. Please feel free to send more questions by emailing them to me at st (at) treasuryofideas (dot) com. Alternatively, you could tweet them to me at @tshanky.

The presentation from last evening is available online at http://www.slideshare.net/tshanky/geolocation-in-mongodb-16021143.

For all those who are excited about MongoDB and would like to learn more please join me for “MongoDB in an Hour!” on February 1, 2013. The format of that session would be as follows:

  • 1 hour free video session, which will be made available online by or before Feb 1, 2013.
  • 4 half-hour Google+ Hangout sessions for live Q&A.
  • Unlimited number of Q&A opportunities over email (or over a forum if we create one)
  • (optional) 1 evaluation exam. Passing the exam would entitle you to a certificate of honor.







This 1 hour session is substantially subsidized and I only ask for $25 as suggested donation to cover some of the costs.

Screenshot from 2012-12-25 17:36:17

Christmas Trivia

Screenshot from 2012-12-25 17:36:17

Google’s Santa Tracker 2012

Santa Claus is possibly more popular among kids than the Lord on his own birthday. When and how did the American version of Santa Claus originate?

The American version of Santa Claus comes from the Dutch legend of Sinterklass. Sinterklass is the dutch version of St. Nicholas, who has the reputation of secretly giving gifts back in the 4th century AD.

The Dutch immigrants brought the concept of Sinterklass to New Amsterdam (modern day New York) in the 17th century. Sinterklass, in his modern version of Santa Claus, initially appeared in the press as “St. A Claus”. In 1809 Washington Irving, possibly one of most famous New Yorker ever, wrote a book titled the “History of New York” under the pseudonym Diedrich Knickerbocker. You can read that book online: Knickerbocker’s History of New York, Complete by Washington Irving. In this book Washington Irving described Santa Claus and concocted the legend that he came on a horseback (and not on a sleigh!) each year on the eve of St. Nicholas.

In 1823, Clement Clarke Moore in a poem titled: “A Visit from St. Nicholas”, illustrated Washington Irving’s Santa Claus with additional details that included Santa’s reindeers and his winks and nods. This poem is popularly know as “The Night before Christmas”. Interestingly though Santa himself was an elf as far as this poem goes.

Between 1860 and 1880, Thomas Nast, expanded on these original ideas of Santa Claus and converted him into a fat round character with reindeers, who went every Christmas night giving gifts around the world. His illustration of Santa Claus was published in Harper’s Weekly. A human size and form of Santa Claus as we know of it today, was popularized by Coca Cola in an advertisement in 1931. This advertisement illustration was created by Haddon Sundblom.

Geolocation in MongoDB at San Francisco MonogoDB User Group

Thanks to 10gen for having me over at the last user group meetup in SF to speak on “Geolocation in MongoDB”. I have been leveraging MonogoDB’s geolocation features as a part of the super awesome, yet to be released app: doaround. I shared some of the essentials of geolocation in MongoDB during this session and plan on covering some more at the Silicon Valley MonogoDB Meetup in Palo Alto this coming January.

Thanks to all who came for my session. Thanks also for all the great questions and the conversation after the session. Here is the presentation from the session:

Ash Maurya’s Running Lean — go read it now!

I just finished reading Ash Maurya’s Running Lean. Its one of those rare books that has lots of great content packed in only about 200 pages. The book is an easy read and flows like a magazine article or a nicely written blog post. Its a must read for anyone trying to start a business or anyone trying to go work for a startup.

Eric Ries codified the lean startup principles in a book titled: The Lean Startup. Ash Maurya, in Running Lean, takes the next step to translate philosophy to a more tangible set of actionable guidelines.

Practice Oriented Self-Help Guidelines

Running Lean is a self-help book for entrepreneurs. Like most self-help books, it provides a few pointers and guidelines to what must be done to move toward success. In this regard, the book is no different in intent than a self help book about losing weight, making friends, being happy, or becoming rich. Unlike a plethora of self-help concept books though, Ash Maurya focuses on practice and advice that can be applied. He presents a step-by-step process and a formula to enhance your chances. By the time you finish reading this book you have a list of action items to be applied to your own startup.

Many self-help books have the essence in the first 40 pages and the rest of the 200 odd pages that follow are simply continued explanations of this basic idea. Running Lean starts with thought provoking advise on the very first page and continues to remain fresh till the last one.

Know Your Customers & Their Problems

A large number of products are built with a solution in mind. In geek led startups, many of these solutions are purely motivated to satisfy a techie’s itch. I am one such person perhaps! I have been building a lot of things for years because it simply presents a great technology solution. A prime example is a time tracking software I wrote a couple of years back. It is clean and sophisticated in terms of code but I wasn’t able to make a business out of it. On hindsight, I did not invest in learning enough about the customer needs and was a bit confused about its monetization possibilities by the time I was ready to go to market with it. I wanted to sell it to everyday but didn’t sell to anybody and abandoned it prematurely.

I would surely use the lean startup methodology and speak to my potential early adopters first if I had to rebuilt that product today. I would meet these prospects in person and ask them questions with the objective of understanding their requirements.

Knowing your customers and their problems is the most important part in being successful. Steve Blank calls this customer development. His book Four Steps to the Epiphany is another must read for entrepreneurs. Running Lean is a detailed guide to how to reach out to customers and what questions to ask at each phase of your startup lifecycle. Ash Maurya recommends
(a) Problem Interview (to understand the problem and the customer’s needs),
(b) Solution Interview (to validate that your proposed solution solves the customer’s problem), and
(c) MVP Interview (to make sure your minimum viable product addresses the customer’s problem and the customer is willing to pay for it)
as three important times to connect with customers. He generously provides detailed examples from his own startup experience with CloudFire, making the recommendations way more than mere philosophical musings. I wrote a problem interview meeting request email today to one of my prospects, using a format I saw in the book.

‘Maker’ vs ‘Manager’

A techie founder of a startup can be torn between his simultaneous attempts in developing his software (maker role) and meeting his customers (manager role). Running Lean provides some guidance and proposes maintaining a balance between the two roles of a ‘maker’ and a ‘manager’. You must read Paul Graham’s Maker’s Schedule, Manager’s Schedule for a valuable viewpoint on conflicts between these roles.

I can completely empathize with Ash Maurya’s schedule of coding during early mornings (when most others are asleep) and talking to customers in the afternoon. However, I am unable to define my own priorities as clearly as I believe he presented it. My current startup is a small setup comprising of my co-founder with management experience, 5 developers, 1 designer and I. I am deeply involved in software release, bug fixes, feature planning, continuous deployment, product definition, and project management. I also spend time talking to potential customers, potential investors, potential hires, and partners. The long list doesn’t end there. I also spend time on miscellaneous other things like accounting, legal compliance, branding, and keeping my team motivated. Outsourcing many of these tasks would be great but is expensive for a bootstrapped startup. Despite reducing waste and staying focused on an MVP, a founder probably needs over 30 hours in a day! Running a startup is a grueling experience and I don’t believe there is an easy way to maintain balance among the multiple roles a founder needs to play.

Don’t be a Feature Pusher

This segues to the most important part of the lean startup methodology. Build a minimum viable product, better known as an MVP. Its very easy to get sucked into a features arms race with yourself. Lean MVP means build less but a good geek or an awesome visionary is often about doing more in less time. You see the disconnect! In my opinion, founders who are not expert makers or super efficient managers ironically do well in this regard. They quickly reconcile with building an MVP because it seems more palpable to them. The uber geeks and the smart managers struggle. Their MVP is often not minimal enough. Silicon Valley especially loves uber geeks and if you are one of those, then there is a decent possibility you may get funded without even a proper plan in your hand. This often can be a curse cause you start building a rather bloated SVP (Supposedly Valuable Product).

Conversion Metrics

The book talks in detail about user life cycle management, from acquisition to referral. This is a topic close to Ash Maurya’s heart and relates to his latest startup, UserCycle, which is trying to address the problems in this space.

Actionable metrics should be at the heart of every business. Vanity metrics are useless. Retention and repeat usage matters.

Early Adopters

All startups in their early stages should care a lot about their early adopters. Most startups are founded with a grand dream of addressing a global problem and reaching out to a diverse set of consumers. In reality though, you should feel very happy if you even get some traction in your neighborhood. The emphasis on building relationship with early adopters is highlighted in this book. Its a very valuable piece of advice.

Don’t Agree with Everything

Running Lean is a very inspiring book and I love Ash Maurya’s style of writing, full of honesty and personal examples. However, I must say that I couldn’t agree with every single part of the book. Lean Startup addresses a lot of issues but it does not consider issues related to
(a) distributed startup teams (they are increasingly getting common now),
(b) skewed skill sets (many startups are founded by geeks only who understand little beyond code!),
(c) acquired tastes (despite over 100 million iPads in the market, a majority of the users are still pretty unclear on why they need it and what problem does it solve),
(d) regulatory influences (VOIP over the years), and
(e) new markets (consumption of local and healthier food).

Its much harder to define the problem/solution fit in these cases.

My Own Failings & What Next?

I spent most of the last year building a really great product, called doaround. It will be launched to the general public soon. We did some things really well and struggled on many counts as well. We did seek consumer feedback from the very beginning but could have been more scientific about it. We built a full featured VP and not an MVP. It took much longer and costed us a lot more.

What Next? The next time I am building a product for What Next Labs (Yes! thats the name of our company) I am running lean.

Getting Started With Node

Node.js, the V8 (Chrome’s JavaScript runtime) based platform for building fast and scalable network applications,  is gaining substantial traction among developers and entering the application stack of many silicon valley companies. Like every new technology or piece of software there is enough FUD (Fear, Uncertainty, and Doubt) around Node, so I am going to write a few blog posts and help you learn Node by example.

In this post, I will simply help you get set-up so you can start playing with Node. The best and most reliable way to get Node installed on your machine is to build it from source. On Linux, Unix, and Mac OSX, you can start out by getting Node source code from its repository like so:

git clone https://github.com/joyent/node.git

This assumes that you have a git client installed for your *nix flavor and you are familiar with the essential notions of git, like cloning a repository. If you are completely new to git, then you may want to quickly read and learn about git first. A good freely available resource on git is the book titled: Pro Git — http://git-scm.com/book/.

Now that you have the Node source cloned on your machine, change to the source directory and inspect the available tags in the repository as follows:

git tag -l

A whole lot of tags will be listed in response to this git command. Some of the latest ones are follows:

...
v0.7.5
v0.7.6
v0.7.7
v0.7.8
v0.7.9
v0.8.0
v0.8.1
works

The current master branch of the Node code is v0.9.x but unfortunatley that version seems to have problems working with NPM (Node Package Manager), a very important companion of Node. Therefore, you should checkout v0.8.1 before you build the source. To checkout the v0.8.1 tag, use the following command:

git checkout v0.8.1

From here onwards, its the usual configure, make, and make install trio. Build Node as follows:

./configure
make
sudo make install

That’s it! Node and NPM are both installed.

To verify, open up a terminal and run

node -v

If you see v0.8.1 in response then you are all set.

Additionally, verify that npm is installed by running

npm -v

You should see 1.1.33.

Now that node is installed, you are ready to play with node. In my next post, we will get started with a simple example.

Build Hadoop from Source

Instructions in this write-up were tested and run successfully on Ubuntu 10.04, 10.10, 11.04, and 11.10. The instructions should run, with minor modifications, on most flavors and variants of Linux and Unix. For example, replacing apt-get with yum should get it working on Fedora and CentOS.

If you are starting out with Hadoop, one of the best ways to get it working on your box is to build it from source. Using stable binary distributions is an option, but a rather risky one. You are likely to not stop at Hadoop common but go on to setting up Pig and Hive for analyzing data and may also give HBase a try. The Hadoop suite of tools suffer from a huge version mismatch and version confusion problem. So much so that many start out with Cloudera’s distribution, also know as CDH, simply because it solves this version confusion disorder.

Michael Noll’s well written blog post titled: Building an Hadoop 0.20.x version for HBase 0.90.2, serves as a great starting point for building the Hadoop stack from source. I would recommend you read it and follow along the steps stated in that article to build and install Hadoop common. Early on in the article you are told about a critical problem that HBase faces when run on top of a stable release version of Hadoop. HBase may loose data unless it is running on top an HDFS with durable sync. This important feature is only available in the branch-0.20-append of the Hadoop source and not in any of the release versions.

Assuming you have successfully, followed along Michael’s guidelines, you should have the hadoop jars built and available in a folder named ‘build’ within the folder that contains the Hadoop source. At this stage, its advisable to configure Hadoop and take a test drive.

Configure Hadoop: Pseudo-distributed mode

Running Hadoop in pseudo-distributed mode provides a little taste of a cluster install using a single node. The Hadoop infrastructure includes a few daemon processes, namely

  1. HDFS namenode, secondary namenode, and datanode(s)
  2. MapReduce jobtracker and tasktracker(s)

When run on a single node, you can choose to run all these daemon processes within a single Java process (also known as standalone mode) or can run each daemon in a separate Java process (pseudo-distributed mode).

If you go with the pseudo-distributed setup, you will need to provide some minimal custom configuration to your Hadoop install. In Hadoop, the general philosophy is to bundle a default configuration with the source and allow for overriding it using a separate configuration file. For example, hdfs-default.xml, which you can find in the ‘src/hdfs’ folder of your Hadoop root folder, contains the default configuration for HDFS properties. The file hdfs-default.xml gets bundled within a compiled and packaged Hadoop jar file and Hadoop uses the configuration specified in this file for setting up HDFS properties. If you need to override any of the HDFS properties that uses the default configuration from hdfs-default.xml, then you need to re-specify the configuration for that property in a file named hdfs-site.xml. This custom configuration definition file, hdfs-site.xml, resides in the ‘conf’ folder within the Hadoop root folder. Custom configuration in core-site.xml, hdfs-site.xml, and mapred-site.xml corresponds to default configuration in core-default.xml, hdfs-default.xml, and mapred-default.xml, respectively.

Contents of conf/core-site.xml after custom configuration:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:9000</value>
  </property>
</configuration>

The specified configuration makes the HDFS namenode daemon accessible on port 9000 on localhost.

Contents of conf/hdfs-site.xml after custom configuration:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
  <property>
    <name>dfs.name.dir</name>
    <value>path/to/data/dfs/name</value>
    <description>Determines where on the local filesystem the DFS name node
      should store the name table(fsimage).  If this is a comma-delimited list
      of directories then the name table is replicated in all of the
      directories, for redundancy. </description>
  </property>
</configuration>

The override specifies a replication factor of 1. On a single node, you can’t have a failover, can over? The custom configuration also sets the namenode directory to a path on your file system. The default is a folder within your /tmp folder, which gets purged on a restart.

Contents of conf/mapred-site.xml after custom configuration:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
  <property>
    <name>mapred.job.tracker</name>
    <value>localhost:9001</value>
  </property>
</configuration>

After this configuration the MapReduce jobtracker is accessible on port 9001 on localhost.

Finally set JAVA_HOME in conf/hadoop-env.sh. On my Ubuntu 11.10, I have it set as follows:

export JAVA_HOME=/usr/lib/jvm/java-6-openjdk

If you don’t have passphraseless ssh setup on your machine then you may need to execute the following commands:

ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

This creates a DSA encrypted key named id_dsa and adds it to the set of authorized_keys for the SSH server running on your localhost. If you aren’t sure if passphraseless ssh access is setup or not then simply run ‘ssh localhost’ on a terminal. If you are prompted for a password then you need to complete the steps to setup passphraseless ssh.

Now start up the Hadoop daemons using:

bin/start-all.sh

Run this command from the root of your Hadoop folder.

As an additional step set HADOOP_HOME environment variable to point to the root of your Hadoop folder. HADOOP_HOME is used by other pieces of software, like HBase, Pig, and Hive, that are built on top of Hadoop.

Running a Simple Example
If you ran the tests after the Hadoop common install and they passed, then you should be ready to use Hadoop. However, for completeness, I would suggest running a simple Hadoop example while the daemons are up and waiting. Run the simple example illustrated in the official document, available online at http://hadoop.apache.org/common/docs/r0.20.203.0/single_node_setup.html#PseudoDistributed. The example is available at the end of the sub-section on Pseudo-Distributed Operation.

Build HBase from Source

Once Hadoop is up and running, you are ready to build and run HBase. Start by getting the HBase source as follows:

git clone https://github.com/apache/hbase.git

I clone it from the Apache HBase mirror on Github. Alternatively, you can get the source from the HBase svn repository, which is where the official commits are checked-in.

HBase can make use of Snappy compression. Snappy is a fast compression/decompression library, which was built by Google and is available as an open source software library under the ‘New BSD License’. The official site defines snappy as follows:

Snappy is a compression/decompression library. It does not aim for maximum compression, or compatibility with any other compression library; instead, it aims for very high speeds and reasonable compression. For instance, compared to the fastest mode of zlib, Snappy is an order of magnitude faster for most inputs, but the resulting compressed files are anywhere from 20% to 100% bigger. On a single core of a Core i7 processor in 64-bit mode, Snappy compresses at about 250 MB/sec or more and decompresses at about 500 MB/sec or more.

Snappy is widely used inside Google, in everything from BigTable and MapReduce to our internal RPC systems. (Snappy has previously been referred to as “Zippy” in some presentations and the likes.)

You can learn more about Snappy at http://code.google.com/p/snappy/.

To build HBase with snappy support we need to do the following:

  1. Build and install the snappy library
  2. Build and install hadoop-snappy, the library that bridges snappy and Hadoop
  3. Compile HBase with snappy

Build and Install snappy
Building and installing snappy is easy and quick. Get the latest stable snappy release as follows:

wget http://snappy.googlecode.com/files/snappy-1.0.4.tar.gz

The current latest release version is 1.0.4. This version number could vary as newer versions supersede this version.
Once you download the snappy zipped tarball, extract it:

tar zxvf snappy-1.0.4.tar.gz

Next, change into the snappy extracted folder and use the common configure, make, make install trio to complete the build and install process.

cd snappy-1.0.4
./configure && make && sudo make install

You may need to run ‘make install’ using the privileges of a superuser, i.e. ‘sudo make install’.

Build and Install hadoop-snappy
Hadoop-snappy is a project for Hadoop that provides access to the snappy compression/decompression library. You can learn details about hadoop-snappy at http://code.google.com/p/hadoop-snappy/. Building and installing haddop-snappy requires Maven. To get started checkout the hadoop-snappy code from its subversion repository like so:

svn checkout http://hadoop-snappy.googlecode.com/svn/trunk/ hadoop-snappy-read-only

Then, change to the ‘hadoop-snappy-read-only’ folder and make a small modification to maven/build-compilenative.xml :

# add JAVA_HOME as an env var
<exec dir="${native.staging.dir}" executable="sh" failonerror="true">
    <env key="OS_NAME" value="${os.name}"/>
    <env key="OS_ARCH" value="${os.arch}"/>
    <env key="JVM_DATA_MODEL" value="${sun.arch.data.model}"/>
    <env key="JAVA_HOME" value="/usr/lib/jvm/java-6-openjdk"/>
    <arg line="configure ${native.configure.options}"/>
</exec> 

Also, install a few required zlibc related libraries:

sudo apt-get install zlibc zlib1g zlib1g-dev[/pre]
Next, build Hadoop-snappy using maven like so:
1sudo mvn package

Once hadoop-snappy is built, install the jar and tar distributions of hadoop-snappy to your local repository:

mvn install:install-file -DgroupId=org.apache.hadoop -DartifactId=hadoop-snappy -Dversion=0.0.1-SNAPSHOT -Dpackaging=jar -Dfile=./target/hadoop-snappy-0.0.1-SNAPSHOT.jar
mvn install:install-file -DgroupId=org.apache.hadoop -DartifactId=hadoop-snappy -Dversion=0.0.1-SNAPSHOT -Dclassifier=Linux-amd64-64 -Dpackaging=tar -Dfile=./target/hadoop-snappy-0.0.1-SNAPSHOT-Linux-amd64-64.tar 

Compile, Configure & Run HBase
Once snappy and hadoop-snappy are compiled and installed, you are ready to compile HBase with snappy support. Change to the folder that contains the HBase repository clone and run the ‘maven compile’ command to build HBase from source.

cd hbase
mvn compile -Dsnappy

The -Dsnappy option tells maven to compile HBase with snappy support.

Earlier, I setup Hadoop to run in pseudo-distributed mode. Lets configure HBase to also run in pseudo-distributed mode. Alike Hadoop, the default configuration for HBase is available in hbase-default.xml and custom configuration can be specified to override the default configuration. Custom configuration resides in conf/hbase-site.xml. To setup, HBase in pseudo-distributed make sure the contents of conf/hbase-site.xml are as follows:

<configuration>
    <property>
        <name>hbase.rootdir</name>
        <value>hdfs://localhost:9000/hbase</value>
        <description>The directory shared by RegionServers.
        </description>
    </property>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
        <description>The replication count for HLog and HFile storage.
        Should not be greater than HDFS datanode count.
        </description>
    </property>
</configuration>

You may recall we configured the HDFS namenode to be accessible on port 9000 on localhost. Therefore, the ‘hbase.rootdir’ needs to be specified with respect to the HDFS url. If you configure to run HDFS daemons on a different port, then please adjust the configuration for ‘hbase.rootdir’ in line with that. The second custom property definition sets replication factor value to 1. On a single node thats the best and only option you have!

Now you can start-up Hbase using:

bin/start-hbase.sh

Build Pig
Building and Installing Pig from source is a simple 3 command operation like so:

svn checkout http://svn.apache.org/repos/asf/pig/trunk/ pig
cd pig
ant

The first command checks out source from the Pig svn repository. It grabs the source from the ‘trunk’, which is referred to as ‘master’ in git jargon. The second command changes to the folder that contains the pig source. The third command compiles the pig source using Apache Ant. Invoking the default target, i.e. simply calling ‘ant’ without any argument, compiles Pig and packages it as a jar for distribution and consumption. Pig jar file can be found at the root of the Pig folder. Pig usually generates two jar files:

  1. pig.jar — to be run with Hadoop.
  2. pigwithouthadoop.jar — to be run locally. Pig does not need to always use Hadoop.

Build Hive
Building and Installing Hive is almost as easy as building and installing Pig. The following set of commands gets the job done:

svn co http://svn.apache.org/repos/asf/hive/trunk hive
cd hive
ant package

You should be able to understand the commands if you have come so far in this article.

There is one little catch in this Hive instruction set though. As you run the ‘ant package’ task you will see the build fail. With HADOOP_HOME pointing to a hadoop-0.20-append branch build, Hive ShimLoader does not get the Hadoop version correctly. Its the “-” in the name that causes the problem! Apply, the simple patch available at https://issues.apache.org/jira/browse/HIVE-2294 and things should work just fine. Apply the patch as follows:

patch -p0 -i HIVE-2294.3.patch

To start using Hive, you will also need to minimally carry out these additional tasks:

  • Set HIVE_HOME environment variable to point to the root of the HIVE directory.
  • Add $HIVE_HOME/bin to $PATH
  • Create /tmp in HDFS and set appropriate permissions
bin/hadoop fs -mkdir /tmp 
bin/hadoop fs -chmod g+w   /tmp
  • Create /user/hive/warehouse and set appropriate permissions
bin/hadoop fs -mkdir /user/hive/warehouse 
bin/hadoop fs -chmod g+w /user/hive/warehouse

Now you are ready with pseduo-distributed Hadoop, pseudo-distributed HBase, Pig, and Hive running on your box. This is of course just the beginning. You need to learn to leverage these tools to analyze data, but that not covered in this write-up. A following post will possibly address the topic of analyzing data using MapReduce and its abstractions.

Scala syntax highlighting in gedit

Update: A small typo, an unnecessary “<” tag before xmlns in scala-mime.xml has been corrected. Thanks @win for finding the error. See the comments below for additional references.

The default text editor on Ubuntu, or for that matter any Gnome powered desktop, is gedit. If you are a developer like me, who isn’t a huge fan of IDE(s), there is a good chance you use gedit for some of your development. Gedit supports syntax highlighting for a number of languages but if you were hacking some Scala code using the editor, you wouldn’t find any syntax highlighting support out-of-the-box. However, the Scala folks offer gedit syntax highlighting support via the scala-tool-support subproject. To get it working with your gedit installation, do the following:

  1. Download the scala.lang file from http://lampsvn.epfl.ch/trac/scala/browser/scala-tool-support/trunk/src/gedit/scala.lang. You can checkout the source using svn or scrape the screen by simply copying the contents and pasting it into a file named scala.lang. On Ubuntu, using Ctrl-Shift and the mouse, helps accurately select and copy the content from the screen.
  2. Copy or move scala.lang file to ~/.gnome2/gtksourceview-1.0/language-specs/
  3. Create a file named scala-mime.xml at /usr/share/mime/packages/ using
    sudo touch /usr/share/mime/packages/scala-mime.xml
  4. Add the following contents to scala-mime.xml:
    &lt;?xml version="1.0" encoding="UTF-8"?&gt;
    &lt;mime-info
     xmlns='http://www.freedesktop.org/standards/shared-mime-info'&gt;
    &lt;mime-type type="text/x-scala"&gt;
    &lt;comment&gt;Scala programming language&lt;/comment&gt;
    &lt;glob pattern="*.scala"/&gt;
    &lt;/mime-type&gt;
    &lt;/mime-info&gt;
  5. Run
    sudo update-mime-database /usr/share/mime
  6. Start (or restart, if its running) gedit and you now have scala syntax highlighting in place.

Ubuntu and HP TouchSmart Sound

I upgraded my Ubuntu install on my HP TouchSmart machine to version 11.04 (Natty Narwhal). Ubuntu 11.04 Unity Desktop experience is so nice and smooth that I started using my HP TouchSmart actively again. It had been sitting gathering dust for the last many months!

The last version of Ubuntu on this machine was 10.04, which was upgraded to 11.04, via a 10.10 upgrade en route. During 10.04 days, I had trouble getting Ubuntu to work smoothly on this machine. The internal speakers did not work (only external speakers did), the wifi did not work, and the touch screen lost its touch qualities. After I upgraded to 11.04, I somehow believed many of these past woes would get corrected but that wasn’t the case. So I actively started making some effort to resolve these issues. Getting the internal speaker sound to work was the first of the things I did and surprisingly a few minutes is all I needed to solve the problem.

The fix on the TouchSmart is really a simple and 1 line addition to a configuration file. Open the terminal and type the following:

sudo gedit /etc/modprobe.d/alsa-base.conf

This will open alsa-base.conf in gedit, the official text editor on the Gnome desktop. If you like vi instead of gedit then open the file as follows:

sudo vi /etc/modprobe.d/alsa-base.conf

At the very end add the following 1 line to alsa-base.conf file:

options snd-hda-intel model=touchsmart

Now, save the file and reload alsa using:

sudo alsa force-reload

and the internal speakers are in business. That was quick and simple. Wasn’t it?

A little peek into why this fix works and how this may apply to systems other than the TouchSmart:

Find out the model of your sound card using:

cat /proc/asound/card0/codec* | grep Codec

On my TouchSmart the output is as follows:

Codec: Analog Devices AD1984A

ALSA (Advanced Linux Sound Architecture) provides audio and MIDI functionality to the Linux OS. Browse the ALSA documentation to see list of supported audio models for your card. The documentation is available in /usr/share/doc/alsa-base/driver/HD-Audio-Models.txt.gz, which is a compressed file. You can list the content of this file, without decompressing, as follows:

gunzip -c /usr/share/doc/alsa-base/driver/HD-Audio-Models.txt.gz

It  may be a good idea to page through the file using the more command like so:

gunzip -c /usr/share/doc/alsa-base/driver/HD-Audio-Models.txt.gz | more

On my machine, I see the following entries relevant to AD1984A :

....

AD1884A / AD1883 / AD1984A / AD1984B
====================================
desktop    3-stack desktop (default)
laptop    laptop with HP jack sensing
mobile    mobile devices with HP jack sensing
thinkpad    Lenovo Thinkpad X300
touchsmart    HP Touchsmart

....

(First column is the model and second one is the description)

This explains why the value of snd-hda-intel model was set to touchsmart. This hopefully also gives you a clue to find your sound card model and its supported configuration values for that model if you have a problem getting sound to work on your own Ubuntu install.

For additional reference, consider reading https://help.ubuntu.com/community/HdaIntelSoundHowto.

My new book: Professional NoSQL (Wiley, 2011)

My new book, Professional NoSQL (Wiley, 2011) is now available in bookstores.

NoSQL is an emerging topic and a lot of developers, architects, technology managers, and CIO(s) are fairly confused trying to understand where it fits in the stack. While these folks are trying to come up to speed and climb up the learning curve, many NoSQL enthusiasts and product vendors are presenting the usual jargon heavy, myth centric promises and confusing them further. Given this context, I have made an attempt to present an unbiased and objective overview of the topic: explaining the fundamentals, introducing the products, presenting a few of its nuances, and describing the context in which it exists.

Read the first chapter, which is available for download online and consider buying a copy. If you find errors, then please let me know of them.

Hope you enjoy reading the book and find it useful.