Big Data Educator: 2016

Sunday, February 21, 2016

Simple Hack To Keep SparkContext (and hence SparkUI) Alive After Job Completes

This example uses Scala syntax.

While programming in Spark on a local single node standalone cluster, you would be running your job either via spark-submit or running code in an IDE such as IntelliJ, NetBeans, Eclipse, etc.

After a job completes and your program exits, some activities that happen are:

– SparkContext is destroyed

Example log entry: 16/02/21 13:45:23 INFO SparkUI: Stopped Spark web UI at http://192.168.1.165:4040

– DAGScheduler is stopped

– MemoryStore is stopped, releasing your previously used memory

– BlockManager is stopped, hence all storage blocks used on disk for shuffle data, cached data spilled to disk, etc. are deleted

With the SparkContext, your SparkUI will also die. While coding on a local machine, if you do not run Spark on YARN or Mesos there would be no way to start the historyserver and retrieve historical jobs.

In such a case I use a simple hack to keep my SparkContext alive. I add the following command:

Thread.sleep(86400000); //sleep for 86400000 milliseconds, which is 24 hours

It may not be the smartest way but it does the job! My program will not exit for another 24 hours.

I've tried this method in the following cases:

1) When I want to observe in detail the steps happening in a specific action before my program proceeds. For example, if I add a sleep command after printing results of a countByValue() transformation:

val countOfWords=file.flatMap(word=>word.split("\n")).countByValue()

countOfWords.foreach(println);

Thread.sleep(86400000);

my job will print the output of countByValue() and then wait.

2) If I want to run a job and then go to sleep, but want to take a look at the SparkUI in the morning

Downside of this aproach – Because there is a sleep command in the code, your program will not exit until you kill it by force.

Friday, February 19, 2016

Access HDFS In Spark

When you are developing on Apache Spark, a very common use case is going to be accessing data from HDFS. As you would know, HDFS can be accessed via it's URI that looks like hdfs://<hostname>:<port>/user/…..

When you start your Spark shell, the SparkContext will be available as “sc”.

My file is at the following location in HDFS:

hdfs://localhost:8020/user/spark/abc.txt

A file handle can be obtained via:

val file = sc.textFile("hdfs://localhost:8020/user/spark/abc.txt");

How did I find my HDFS host and port?

Go to your <HADOOP_HOME>/etc/hadoop directory. If your configuration files are stored at a different location, navigate to the directory specified as HADOOP_CONF_DIR in your environment variables. Open file core-site.xml and look for the configuration property fs.defaultFS

Recomendations:

If there is no port specified above, try with 8020 or 9000 as they are the default ports. You can also try accessing the file system without specifying the port but I have sometimes seen errors thrown with that approach. In any case, if your Hadoop implementation has a different port explicitly configured you will need to use it.

Another good practice (in my view) would be to have a global variable somewhere that points to your HDFS root directory. You can specify all paths relative to it in latter parts of your code.

e.g. val hdfsURI="hdfs://localhost:8020/"

Saturday, February 13, 2016

Import Hadoop Libraries To IntelliJ Idea

Import Hadoop Libraries To IntelliJ Idea

The post assumes that:

– you have downloaded Hadoop installation binaries and placed them in a target location

– you have an existing Java project in IntelliJ Idea you want to add Hadoop code to, although the steps remain same for any other project (for e.g. a Scala project) as well as a freshly created empty project

I am using the following technologies:

Linux Mint 17.3

Hadoop 2.5.2

IntelliJ Idea 15.02

My Hadoop binaries are located at /usr/share/hadoop-2.5.2/. This is actually your HADOOP_HOME.

1) Open your Java project in IntelliJ Idea.

2) Go to File-→Project Structure

3) Click on Modules and go to Dependencies tab on the right

4) Click on the “+” sign on the right and select the option to add “JARs or directories”

5) Go to the location where you placed your Hadoop binaries (your HADOOP_HOME). In my case it is /usr/share/hadoop-2.5.2/

Inside your HADOOP_HOME go to…..share/hadoop/common. So the full path to navigate to in my case is /usr/share/hadoop-2.5.2/share/hadoop/common

Select the common directory and click OK. This will take you back to the Modules window

6) Click on the “+” sign once more and again select the option to add “JARs or directories”. This time, under your HADOOP_HOME, go to …..share/hadoop/common/lib, which is one level deeper than the previous step. In my case the full path is /usr/share/hadoop-2.5.2/share/hadoop/common/lib.

7) Select the lib directory and click OK. This will take you back to the Modules window. Check the “Export” checkboxes next to both locations you just added, click Apply and then OK.

8) That's it, you're done. If you want to test whether your libraries are actually recognized, create a Java class and try to import Hadoop libraries such as the following:

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.FsUrlStreamHandlerFactory;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IOUtils;

You should not see any errors.