Friday, February 19, 2016

Access HDFS In Spark

When you are developing on Apache Spark, a very common use case is going to be accessing data from HDFS. As you would know, HDFS can be accessed via it's URI that looks like hdfs://<hostname>:<port>/user/…..

When you start your Spark shell, the SparkContext will be available as “sc”.

My file is at the following location in HDFS:

hdfs://localhost:8020/user/spark/abc.txt

 

A file handle can be obtained via:

val file = sc.textFile("hdfs://localhost:8020/user/spark/abc.txt");

How did I find my HDFS host and port?

Go to your <HADOOP_HOME>/etc/hadoop directory. If your configuration files are stored at a different location, navigate to the directory specified as HADOOP_CONF_DIR in your environment variables. Open file core-site.xml and look for the configuration property fs.defaultFS

 

clip_image002

 

Recomendations:

If there is no port specified above, try with 8020 or 9000 as they are the default ports. You can also try accessing the file system without specifying the port but I have sometimes seen errors thrown with that approach. In any case, if your Hadoop implementation has a different port explicitly configured you will need to use it.

Another good practice (in my view) would be to have a global variable somewhere that points to your HDFS root directory. You can specify all paths relative to it in latter parts of your code.

e.g. val hdfsURI="hdfs://localhost:8020/"

2 comments:

  1. Harrah's Cherokee Casino Site - Lucky Club
    Harrah's Cherokee Casino is a tribal casino and hotel in Cherokee, North Carolina. It is luckyclub.live owned by the Eastern Band of Cherokee Indians. Harrah's

    ReplyDelete
  2. Harrah's Atlantic City Casino & Hotel - JM Hub
    View 전라남도 출장마사지 customer ratings, hours, contact details & reviews of 사천 출장샵 Harrah's Atlantic City Casino & Hotel, including latest 김천 출장마사지 reviews, videos, photos 울산광역 출장샵 and 용인 출장마사지 more.

    ReplyDelete