How to install Spark?

It is fairly easy to install spark. This tutorial mostly adapt from this.

  1. Install Oracle Java 8 if you haven’t

    $ sudo add-apt-repository ppa:webupd8team/java
    $ sudo apt-get update
    $ sudo apt-get install oracle-java8-installer

  2. Download latest binary distribution Spark from Spark Website
  3. unzip the tar file

    $ cd Downloads; tar zxvf spark-2.0.1-bin-hadoop2.7.tgz

  4. Copy to the root directory (personal choice)

    $ sudo cp ~/Downloads/spark-2.0.1-bin-hadoop2.7 /opt/spark-2.0.1

  5. Create a soft link for easy maintenance, e.g. upgrade spark

    $ sudo ln -s /opt/spark-2.0.1 /opt/spark

  6. Set environment in your .bashrc by adding two lines as below

    export SPARK_HOME=/opt/spark
    export PATH=$SPARK_HOME/bin:$PATH

  7. test your Spark

    $ pyspark
    Python 2.7.12 (default, Jul 1 2016, 15:12:24)
    [GCC 5.4.0 20160609] on linux2
    Type “help”, “copyright”, “credits” or “license” for more information.
    Using Spark’s default log4j profile: org/apache/spark/log4j-defaults.properties
    Setting default log level to “WARN”.
    To adjust logging level use sc.setLogLevel(newLevel).
    16/10/04 13:32:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
    16/10/04 13:32:43 WARN Utils: Your hostname, [yourhostname] resolves to a loopback address: 127.0.0.1; using [some ip] instead (on interface ens33)
    16/10/04 13:32:43 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
    Welcome to
    ____ __
    / __/__ ___ _____/ /__
    _\ \/ _ \/ _ `/ __/ ‘_/
    /__ / .__/\_,_/_/ /_/\_\ version 2.0.1
    /_/

    Using Python version 2.7.12 (default, Jul 1 2016 15:12:24)
    SparkSession available as ‘spark’.
    >>>

[Troubleshoot] If you run into the error, java.net.UnknownHostException:, edit the /etc/hosts files to add the following line

127.0.0.1    [you hostname]

if you don’t know what your hostname is, run the following command to find it out.

$ hostname

Run your first spark script in python

Using Python version 2.7.12 (default, Jul 1 2016 15:12:24)
SparkSession available as ‘spark’.
>>> myfile = sc.textFile(‘/tmp/test.txt’)
>>> counts = myfile.flatMap(lambda line: line.split(” “)).map(lambda word: (word, 1)).reduceByKey(lambda v1,v2: v1 + v2)
>>> counts.saveAsTextFile(“/tmp/output”)

If no error, you should find a folder created under /tmp, which may have several files with name pattern ‘part-xxxxx’.

As Spark can also work with HDFS, if you have HDFS ready to use, here is another test you can run

>>> myfile = sc.textFile(“hdfs://namenode_host:8020/path/to/input”)
>>> counts = myfile.flatMap(lambda line: line.split(” “)).map(lambda word: (word, 1)).reduceByKey(lambda v1,v2: v1 + v2)
>>> counts.saveAsTextFile(“hdfs://namenode:8020/path/to/output”)

The word count example is copied and modified from the article. You can also refer to my “copied” tutorial How to set up HDFS.

You can run the command ‘jps’ to find out that there is one job ‘SparkSubmit’ is running also.

How to Change your Username in Ubuntu?

You may create a username during the installation later you would rather use another one. How to change it? Here is what I found from http://askubuntu.com/questions/34074/how-do-i-change-my-username. It works for me.

    1. At the start screen press Ctrl+Alt+F1.
    2. Log in using your username and password.
    3. Set a password for the “root” account.
      sudo passwd root
      
    4. Log out.
      exit
      
    5. Log in using the “root” account and the password you have previously set.
    6. Change the username and the home folder to the new name that you want.
      usermod -l <newname> -d /home/<newname> -m <oldname>
      
    7. Change the group name to the new name that you want.
      groupmod -n <newgroup> <oldgroup>
      
    8. Lock the “root” account.
      passwd -l root
      
    9. If you were using ecryptfs (encrypted home directory). Mount your encrypted directory using ecryptfs-recover-private and edit <mountpoint>/.ecryptfs/Private.mnt to reflect your new home directory.
    10. Log out.
      exit
      
    11. Press Ctrl+Alt+F7.

HDFS

Notes about HDFS Deamon

$ bin/hdfs namenode -format

HDFS isn’t a real filesystem that runs on a hard drive like ext3 or something similar. It stores data on a regular file system like ext3 and provides API to access its data. It is more like a database. This command initializes the database.

By default the namenode location: /tmp/hadoop-/dfs/name

To change the namenode location add the follwing properties At hdfs-site.xml

<property>
   <name>dfs.namenode.name.dir</name>
   <value>file:/dfs/namenode</value>
</property>

 <property>
   <name>dfs.datanode.data.dir</name>
   <value>file:/dfs/datanode</value>
</property>

Make sure you have the right permission to access the paths specified.
Make sure you got dfs.namenode.name.dir and dfs.datanode.data.dir right. (when copying, may leave name->data not completely replaced) 😦

HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. There are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. I

The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes.

A file is split into one or more blocks and these blocks are stored in a set of DataNodes.

The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.

——

; Start NameNode daemon and DataNode daemon
$ sbin/start-dfs.sh

; create folder /usr and /usr/abc in the hdfs
$ bin/hdfs dfs -mkdir /usr
$ bin/hdfs dfs -mkdir /usr/abc

; put a local file to hdfs location
$ bin/hdfs dfs -put

—-

HDFS daemons are NameNode, SecondaryNameNode, and DataNode.

– etc/hadoop/core-site.xml
fs.defaultFS NameNode URI hdfs://host:port/

– etc/hadoop/hdfs-site.xml
dfs.namenode.name.dir
Path on the local filesystem where the NameNode stores the namespace and transactions logs persistently.
If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy.

dfs.datanode.data.dir Comma separated list of paths on the local filesystem of a DataNode where it should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices.

—- TROUBLE SHOOT
Connection Refused
– jps # check whether hadoop is running: nemanode, datanode
– /etc/hosts -> remove 127.0.1.1 ubuntu

http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
http://stackoverflow.com/questions/27143409/what-the-command-hadoop-namenode-format-will-do
https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/ClusterSetup.html

Hadoop

  1. a clean vm: Ubuntu 16.04 LTS
  2. follow single node cluster tutorial: http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html
    • download hadoop-2.7.3-src
    • follow BUILDING.txt
      • install all packages listed
        $ sudo apt-get purge openjdk*
        $ sudo apt-get install software-properties-common
        $ sudo add-apt-repository ppa:webupd8team/java
        $ sudo apt-get update
        $ sudo apt-get install oracle-java7-installer
        $ sudo apt-get -y install maven
        $ sudo apt-get -y install build-essential autoconf automake libtool cmake zlib1g-dev pkg-config libssl-dev
        $ sudo apt-get -y install libprotobuf-dev protobuf-compiler
        $ sudo apt-get install snappy libsnappy-dev
        $ sudo apt-get install bzip2 libbz2-dev
        $ sudo apt-get install libjansson-dev
        $ sudo apt-get install fuse libfuse-dev
    • building hadoop binary
      $ mvn package -Pdist -DskipTests -Dtar

      • ERROR: protoc version is ‘libprotoc 2.6.1’, expected version is ‘2.5.0’
      • FIX: http://codetips.coloza.com/compile-hadoop-from-source/
      • ERROR: libprotoc.so.8: cannot open shared object file: No such file or directory
      • FIX: $ sudo ldconfig /usr/local/lib    (Note: the libprotoc.so.8 should be in /usr/local/lib
    • install $mvn install
    • edit hadoop-dist/target/etc/hadoop/hadoop-env.sh
    • set export JAVA_HOME=/usr/lib/jvm/java-7-oracle
    • standalone mode test passed
    • pseudo-distributed mode:
      • $ bin/hdfs dfs -mkdir /user/<username>/input
        $ bin/hdfs dfs -put etc/hadoop/*.xml /user/<username>/input
          $ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar grep input output 'dfs[a-z.]+'

Malware: mokes

Kaspersky Labs discovered a new piece of malware, dubbed Mokes, first in January this year. This backdoor has variants cross operating systems, including Windows, Linux and Mac OS X, written in C++ using Qt, a cross-platform application framework.

This backdoor specializes in capturing audio-video, obtaining keystrokes, taking screenshots every 30 seconds, monitoring removable storage like USB drive from victim’s machine. It can also scan the system for files with suffix .docx, .doc, .xls and .xls. This backdoor connects to command-and-control server with an encrypted channel using AES-256 encryption. It also copies itself to a handful of locations including caches belonged to Skype, Dropbox, Google and Firefox.

Infection vector and how widespread it is remains unknown to this point.

Source: thehackernews.com

Dynamic Programming

Dynamic programming (DP) is not a specific algorithm, but a technique to design an efficient algorithm for a specific problem, and DP problems are particularly those problems which this technique can be applied to.

Memoization is a usual way to speedup a recursive algorithm. Dynamics Programming ≠ recursion + memoization, although a dynamic programming algorithm usually appears applying memoization technique to a recursive algorithm, for example, fibonacci sequence. There are some DP problems which cannot be implemented as a recursive function with memoization, e.g. Egg Dropping puzzle, see [1].

There is a thoughtful discussion in stackexchange.com[2]. However, the comments from Rapheal might not be accurate, for the conditions given by the author are too general, which is applicable for divid-and-conquer technique. The critical one which is missing is the property of overlapping subproblems. That is the watershed between dynamic programming and divid-and-conque. The overlapping

In the wikipedia[1], it is stated that the dynamic programming is applicable to the problems exhibiting the properties of overlapping subproblems and optimal substructure, which is not accurate, either. Not all
[1] https://en.wikipedia.org/wiki/Dynamic_programming

[2] http://cs.stackexchange.com/questions/2057/when-can-i-use-dynamic-programming-to-reduce-the-time-complexity-of-my-recursive

[3]http://people.cs.clemson.edu/~bcdean/dp_practice/

Wireshark on Ubuntu 12.04.03

The latest version of wireshark has been ported to GTK+3.0. However, I could not run configure in my Ubuntu 12.04.03 LTS, because it cannot find GTK3.0 properly installed. It seems GTK+3.0 should be installed by default (not sure though).

Search around, I found someone suggest install libgtk-3-dev. However, it failed again with the following information:

The following packages have unmet dependencies:
libgtk-3-dev : Depends: libpango1.0-dev (>= 1.30.0) but it is not going to be installed
Depends: libcairo2-dev (>= 1.10.0) but it is not going to be installed
E: Unable to correct problems, you have held broken packages.

This is another problem which I could not fix. Most solutions online I could find do not work for me. for example, many people suggest use synaptic by ‘edit->fix broken package’. Also, ‘apt-get clean & apt-get update’ does not fix the problem.

Back to wireshark, at last, I have to give up GTK+3.0 and run configure with –without-gtk3 and –with-qt.