Pyspark Practics (2) – SPARK_LOCAL_DIRS

 

You may have run into the error that there is no space left on the disk for shuffle RDD data although you seems having much than enough disk space in fact.

It happens because usually we allocate a not-so-large space for system dir /tmp, while SPARK by default use /tmp for shuffle RDD data which might be quite large. (There are some posts questioning whether SPARK never clean temporary data – which can be a severe problem that I personally did not confirm). Anyway, as you can guess now, the SPARK_LOCAL_DIRS is designed for this purpose that specifies the location for temporary data.

You could configure this variable in conf/spark-env.sh, e.g. use hdfs

SPARK_LOCAL_DIRS=hdfs://server:50090

There is spark.local.dirs in conf/spark-default.conf for the same purpose, which however will be overwritten by SPARK_LCOAL_DIRS.

Pyspark Practice (1) – PYTHONHASHSEED

Here is a small chunk of code for testing Spark RDD join function.


a=[(1, 'a'), (2, 'b')]
b=[(1, 'c'), (4, 'd')]
ardd = sc.parallelize(a)
brdd = sc.parallelize(b)
def merge(a, b):
if a is None:
return b
if b is None:
return a
return a+b
ardd.fullOuterJoin(brdd).map(lambda x: (x[0], merge(x[1][0], x[1][1]))).collect()

This code works fine. But when I apply this to my real data (reading from HDFS and Join and write it back). I ran into the PYTHONHASHSEED problem again! YES AGAIN. I did not get chance to fix this problem before.

This problem happens for Python 3.3+. The line of code responsible for this trouble is pythont/pyspark/rdd.py, line 74.

 if sys.version >= '3.3' and 'PYTHONHASHSEED' not in os.environ:
        raise Exception("Randomness of hash of string should be disabled
via
PYTHONHASHSEED")

After searching around and trying many different proposals, I really got frustrated about this. It seems the community knows well this issue and Spark Github seems having fixed it (2015), while my version (2016) still does not work

A few options I found:

  1. put export PYTHONOHASHSEED=0 .bashrc
    • Failed. In a notebook, I could get out the os.environ[‘PYTHONHASHSEE’] and it was correctly set. This is the correct way for standalone python program, but not for spark cluster.
    • A possible reason is pyspark has a different set of environment variables. It is not about propagating this variable across workers either because even if all workers has this variable exported in .bashrc, it still will complain.
  2. SPARK_YARN_USER_ENV=PYTHONHASHSEED=0
    • Doesn’t work. Some suggested to pass this to pyspark when starting notebook. Unfortunately, nothing fortunate happened. and I don’t think I am even using yarn.

Anyway, in the end, I find the solution from this link. Most of pssh can be ignored. The only line matters is place ‘Export PYTHONHASHSEED=0’ in to conf/spark-env.sh for each worker, which confirms the statement that PYTHONHASHSEED=0 should be somehow placed into the Spark Run-time Environment.

Thanks to this post, which saves my ass: http://comments.gmane.org/gmane.comp.lang.scala.spark.user/24459

module ‘urllib’ has no attribute ‘request’

Ran into a error ” Module ‘urllib’ has no attribute ‘request’ ”

The script runs well before I threw it into a parallel mode by calling sc.parallelize(data, 8). The spark log shows the above error. So far, I could not find any solution by googling. I have printed the the python version used, which is 3.5. Have no clue where goes wrong.

Update.

after a few exploration, I finally found the solution, i.e. put a statement import urllib.request right before I use urllib.request.urlopen(…). Is this caused by the fact that I am using Jupyter, in which, the import statement was in another cell.

How to Change your Username in Ubuntu?

You may create a username during the installation later you would rather use another one. How to change it? Here is what I found from http://askubuntu.com/questions/34074/how-do-i-change-my-username. It works for me.

    1. At the start screen press Ctrl+Alt+F1.
    2. Log in using your username and password.
    3. Set a password for the “root” account.
      sudo passwd root
      
    4. Log out.
      exit
      
    5. Log in using the “root” account and the password you have previously set.
    6. Change the username and the home folder to the new name that you want.
      usermod -l <newname> -d /home/<newname> -m <oldname>
      
    7. Change the group name to the new name that you want.
      groupmod -n <newgroup> <oldgroup>
      
    8. Lock the “root” account.
      passwd -l root
      
    9. If you were using ecryptfs (encrypted home directory). Mount your encrypted directory using ecryptfs-recover-private and edit <mountpoint>/.ecryptfs/Private.mnt to reflect your new home directory.
    10. Log out.
      exit
      
    11. Press Ctrl+Alt+F7.

Dynamic Programming

Dynamic programming (DP) is not a specific algorithm, but a technique to design an efficient algorithm for a specific problem, and DP problems are particularly those problems which this technique can be applied to.

Memoization is a usual way to speedup a recursive algorithm. Dynamics Programming ≠ recursion + memoization, although a dynamic programming algorithm usually appears applying memoization technique to a recursive algorithm, for example, fibonacci sequence. There are some DP problems which cannot be implemented as a recursive function with memoization, e.g. Egg Dropping puzzle, see [1].

There is a thoughtful discussion in stackexchange.com[2]. However, the comments from Rapheal might not be accurate, for the conditions given by the author are too general, which is applicable for divid-and-conquer technique. The critical one which is missing is the property of overlapping subproblems. That is the watershed between dynamic programming and divid-and-conque. The overlapping

In the wikipedia[1], it is stated that the dynamic programming is applicable to the problems exhibiting the properties of overlapping subproblems and optimal substructure, which is not accurate, either. Not all
[1] https://en.wikipedia.org/wiki/Dynamic_programming

[2] http://cs.stackexchange.com/questions/2057/when-can-i-use-dynamic-programming-to-reduce-the-time-complexity-of-my-recursive

[3]http://people.cs.clemson.edu/~bcdean/dp_practice/

Wireshark on Ubuntu 12.04.03

The latest version of wireshark has been ported to GTK+3.0. However, I could not run configure in my Ubuntu 12.04.03 LTS, because it cannot find GTK3.0 properly installed. It seems GTK+3.0 should be installed by default (not sure though).

Search around, I found someone suggest install libgtk-3-dev. However, it failed again with the following information:

The following packages have unmet dependencies:
libgtk-3-dev : Depends: libpango1.0-dev (>= 1.30.0) but it is not going to be installed
Depends: libcairo2-dev (>= 1.10.0) but it is not going to be installed
E: Unable to correct problems, you have held broken packages.

This is another problem which I could not fix. Most solutions online I could find do not work for me. for example, many people suggest use synaptic by ‘edit->fix broken package’. Also, ‘apt-get clean & apt-get update’ does not fix the problem.

Back to wireshark, at last, I have to give up GTK+3.0 and run configure with –without-gtk3 and –with-qt.