How to install HDFS?

I have a blog about HDFS before, which has more background knowledge about HDFS. In this blog, I mainly show the steps to install and configure HDFS in Ubuntu 16.04. This tutorial assumes you have oracle-java8-installer installed. If not, please refer the first step in

  1. Download latest Hadoop tar from Apache Hadoop. I chose 3.0.0-alpha1 binary.
  2. unzip the tar to a directory and create a soft link

    $ sudo tar zxvf  hadoop-3.0.0-alpha1.tar.gz -C /opt
    $ sudo ln -s /opt/hadoop-3.0.0-alpha1 /opt/hadoop

  3. edit ~/.bashrc file and add the following line.

    export HADOOP_HOME=/opt/hadoop
    export PATH=$HADOOP_HOME/bin:$PATH

  4. edit /opt/hadoop/etc/hadoop/hadoop-env.sh

    export JAVA_HOME=/usr/lib/jvm/java-8-oracle

  5. Now, you can check whether your hdfs is available or not

    $ hadoop

II. Configure the pseudo-distributed mode

The core-site.xml configuration

<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>

the hdfs-site.xml configuration

<property>

<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/tmp/dfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/tmp/dfs/datanode</value>
</property>

III. Start the NameNode and DataNode services

  1. initialize the HDFS database (NOTICE!!!! the database will be gone after you restart this machine and your namenode service will fail when you run start-dfs.sh, because this command create a temporary database under /tmp, as specified in the hdfs-site.xml. To avoid this issue, you can simply reconfigure the two properties dfs.datanode.data.dir and dfs.namenode.name.dir to be some permenent path in hdfs-site.xml.

    $ hdfs namenode -format

  2. Now, the HDFS “database” is ready but HDFS service isn’t running. If you run ‘$hdfs dfs -ls’, you will get an error saying, ConnectionRefused. To start the HDFS service (the API for operating this database), you have to run the following script.

    $ /opt/hadoop/sbin/start-dfs.sh

  3. But you may get the port 22 connection refused error. Create a passphraseless SSH key and add to the authorized list. The way to check  the following way is to try ‘ssh localhost’ and see whether it prompts for a passphrase
      $ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
      $ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
      $ chmod 0600 ~/.ssh/authorized_keys

    [Trouble Shooting] Oddly enough, it complains ‘permission denied’. Search around to find that I can use the following command to check what is going on during SSH connection, i.e. the debug mode.

    $ ssh -v localhost

    what I can see is following. You are smart enough to figure it out what is going wrong here (notice the encryption schema checked).

    debug1: kex_input_ext_info: server-sig-algs=<rsa-sha2-256,rsa-sha2-512>
    debug1: SSH2_MSG_SERVICE_ACCEPT received
    debug1: Authentications that can continue: publickey,password
    debug1: Next authentication method: publickey
    debug1: Trying private key: /home/xxx/.ssh/id_rsa
    debug1: Trying private key: /home/xxx/.ssh/id_ecdsa
    debug1: Trying private key: /home/xxx/.ssh/id_ed25519
    debug1: Next authentication method: password

  4. rerun the start-dfs.sh again and you will see namenode and datanode started. You can always use the command jps to see whether they are running. Also, you can use netstat to check wehther the hdfs is active on port 9000

    netstat -tulp

IV. Sanity Check: you should be able to put files into the database and also to list

$ hdfs dfs -put /opt/hadoop/README.txt /
$ hdfs dfs -ls /


If you have Spark installed, you can also test HDFS with Spark, which is extremely straightforward by simply using the HDFS path instead of local path. Use ‘$hdfs dfs -cat /output/part-00000’ to see the results. It shows that HDFS is another way of storing file and Spark is compatible with HDFS.

>>> myfile=sc.textFile(‘hdfs://localhost:9000/README.txt’)
>> counts = myfile.flatMap(lambda line: line.split(” “)).map(lambda word: (word, 1)).reduceByKey(lambda v1,v2: v1 + v2)
>>> counts.saveAsTextFile(“hdfs://localhost:9000/output”)

To stop HDFS … (really, I won’t be going to tell you anything, lookup the manual for handy commands 🙂

$ /opt/hadoop/sbin/stop-dfs.sh

V. Last Word as a Gift

If anything goes wrong, check /opt/hadoop/logs/*.log, you may easily solve any problems you unfortunately run into.

 

 

https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/SingleCluster.html

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s