Jupyter + Spark cluster + HDFS

With Spark How to install Spark? and HDFS How to install HDFS?, we are moving to form a cluster. Oh, I hope you are using virtual machine if you are following this series. It matters because you don’t have go through every step again by just cloning a VM!

So clone completely a copy of virtual machine with Spark and HDFS installed. Boot it and run ifconfig to get the IP, e.g. Let us call this copied vm the slave, and the original vm the master with IP

HDFS cluster

  1. In the master, reformat namenode  giving a cluster name, whatever you want to call it

    $ hdfs namenode -format <cluster name>

  2. In the slave machine, edit your core-site.xml to replace localhost with your real IP address, e.g. (you can always use ifconfig to find it out).
  3. In the master, edit etc/hadoop/workers and add one line for the additional worker, i.e. If you are looking at hadoop-2.7, it should be etc/hadoop/slaves. (No idea why they change the filename. Personally, I think slaves  is better as it says.)
  4. Start the services

    $ /opt/hadoop/sbin/start-dfs.sh

  5. Check by run ‘jps’ in to see whether DataNode service has been started. Also, in your slave machine, you can put README.md file to HDFS and list from master. Also, try the Spark script using the hdfs url: hdfs://

That’s it! Goddamn it! These Hadoop guys really make the basic deployment easy!

Spark Cluster

Spark has a similar framework as HDFS, i.e. master-slave mode.

$ $SPARK_HOME/sbin/start-master.sh

Launch Spark monitoring: http://localhost:8080/

copy SSPARK_HOME/conf/slaves.template to SPARK_HOME/conf/slaves. Add at the end of the file

$ $SPARK_HOME/sbin/start-slaves.sh

From the Spark Monitor, I can see the local worker (in the master machine) is alive, and the worker in slave machine (192.168. 11.138) never shows up. Not that easy, hah!

[Trouble shooting]

I login to the slave and run jps. Surprisely enough, it shows that the worker is running. I check the log – nothing!!!

I noticed the parameters in the worker thread by ‘ps -aux’. There is an option ‘host’, which is specified as a hostname. Since I copied the virtual machine, so my slave, i.e., has the same dignity name as my master. That’s why!!!!

But even after I run start-slave.sh at my slave, trying to connect my master, it fails either. However, the worker in the master always shows up in the Spark monitoring.

$ $SPARK_HOME/sbin/start-slave.sh spark://

Why? Because my Spark master is running on localhost!!! So, I need to start my master on its IP, instead of hostname (in fact, it should be FQDN). So I went back to shutdown the master. and restart it by the following command

$ SPARK_MASTER_HOST=  $SPARK_HOME/sbin/start-master.sh

Now, I run start-slave.sh on both machines. Two workers show up. For the start-slaves.sh, instead of localhost, I need to put the real IP of my master if I want my master is also a worker.

There is a script conf/spark-env.sh, which is used to make all these problem goes away. The idea is that you can configure all these environment variables in the script and run it in all nodes where you want turn it to be a worker. Easy, ah!


Now, let us connect jupyter to Spark master so we can parallelize our data processing but still have the power of jupyter. How?

$ pyspark –master spark://

That is it?! Yes, that’s it!




Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s