How to install Spark?

It is fairly easy to install spark. This tutorial mostly adapt from this.

  1. Install Oracle Java 8 if you haven’t

    $ sudo add-apt-repository ppa:webupd8team/java
    $ sudo apt-get update
    $ sudo apt-get install oracle-java8-installer

  2. Download latest binary distribution Spark from Spark Website
  3. unzip the tar file

    $ cd Downloads; tar zxvf spark-2.0.1-bin-hadoop2.7.tgz

  4. Copy to the root directory (personal choice)

    $ sudo cp ~/Downloads/spark-2.0.1-bin-hadoop2.7 /opt/spark-2.0.1

  5. Create a soft link for easy maintenance, e.g. upgrade spark

    $ sudo ln -s /opt/spark-2.0.1 /opt/spark

  6. Set environment in your .bashrc by adding two lines as below

    export SPARK_HOME=/opt/spark
    export PATH=$SPARK_HOME/bin:$PATH

  7. test your Spark

    $ pyspark
    Python 2.7.12 (default, Jul 1 2016, 15:12:24)
    [GCC 5.4.0 20160609] on linux2
    Type “help”, “copyright”, “credits” or “license” for more information.
    Using Spark’s default log4j profile: org/apache/spark/log4j-defaults.properties
    Setting default log level to “WARN”.
    To adjust logging level use sc.setLogLevel(newLevel).
    16/10/04 13:32:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
    16/10/04 13:32:43 WARN Utils: Your hostname, [yourhostname] resolves to a loopback address: 127.0.0.1; using [some ip] instead (on interface ens33)
    16/10/04 13:32:43 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
    Welcome to
    ____ __
    / __/__ ___ _____/ /__
    _\ \/ _ \/ _ `/ __/ ‘_/
    /__ / .__/\_,_/_/ /_/\_\ version 2.0.1
    /_/

    Using Python version 2.7.12 (default, Jul 1 2016 15:12:24)
    SparkSession available as ‘spark’.
    >>>

[Troubleshoot] If you run into the error, java.net.UnknownHostException:, edit the /etc/hosts files to add the following line

127.0.0.1    [you hostname]

if you don’t know what your hostname is, run the following command to find it out.

$ hostname

Run your first spark script in python

Using Python version 2.7.12 (default, Jul 1 2016 15:12:24)
SparkSession available as ‘spark’.
>>> myfile = sc.textFile(‘/tmp/test.txt’)
>>> counts = myfile.flatMap(lambda line: line.split(” “)).map(lambda word: (word, 1)).reduceByKey(lambda v1,v2: v1 + v2)
>>> counts.saveAsTextFile(“/tmp/output”)

If no error, you should find a folder created under /tmp, which may have several files with name pattern ‘part-xxxxx’.

As Spark can also work with HDFS, if you have HDFS ready to use, here is another test you can run

>>> myfile = sc.textFile(“hdfs://namenode_host:8020/path/to/input”)
>>> counts = myfile.flatMap(lambda line: line.split(” “)).map(lambda word: (word, 1)).reduceByKey(lambda v1,v2: v1 + v2)
>>> counts.saveAsTextFile(“hdfs://namenode:8020/path/to/output”)

The word count example is copied and modified from the article. You can also refer to my “copied” tutorial How to set up HDFS.

You can run the command ‘jps’ to find out that there is one job ‘SparkSubmit’ is running also.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s