AWS EMR + RedShift

It is painful to play with Redshift.

Advertisements

Pyspark Practics (2) – SPARK_LOCAL_DIRS

 

You may have run into the error that there is no space left on the disk for shuffle RDD data although you seems having much than enough disk space in fact.

It happens because usually we allocate a not-so-large space for system dir /tmp, while SPARK by default use /tmp for shuffle RDD data which might be quite large. (There are some posts questioning whether SPARK never clean temporary data – which can be a severe problem that I personally did not confirm). Anyway, as you can guess now, the SPARK_LOCAL_DIRS is designed for this purpose that specifies the location for temporary data.

You could configure this variable in conf/spark-env.sh, e.g. use hdfs

SPARK_LOCAL_DIRS=hdfs://server:50090

There is spark.local.dirs in conf/spark-default.conf for the same purpose, which however will be overwritten by SPARK_LCOAL_DIRS.

Pyspark Practice (1) – PYTHONHASHSEED

Here is a small chunk of code for testing Spark RDD join function.


a=[(1, 'a'), (2, 'b')]
b=[(1, 'c'), (4, 'd')]
ardd = sc.parallelize(a)
brdd = sc.parallelize(b)
def merge(a, b):
if a is None:
return b
if b is None:
return a
return a+b
ardd.fullOuterJoin(brdd).map(lambda x: (x[0], merge(x[1][0], x[1][1]))).collect()

This code works fine. But when I apply this to my real data (reading from HDFS and Join and write it back). I ran into the PYTHONHASHSEED problem again! YES AGAIN. I did not get chance to fix this problem before.

This problem happens for Python 3.3+. The line of code responsible for this trouble is pythont/pyspark/rdd.py, line 74.

 if sys.version >= '3.3' and 'PYTHONHASHSEED' not in os.environ:
        raise Exception("Randomness of hash of string should be disabled
via
PYTHONHASHSEED")

After searching around and trying many different proposals, I really got frustrated about this. It seems the community knows well this issue and Spark Github seems having fixed it (2015), while my version (2016) still does not work

A few options I found:

  1. put export PYTHONOHASHSEED=0 .bashrc
    • Failed. In a notebook, I could get out the os.environ[‘PYTHONHASHSEE’] and it was correctly set. This is the correct way for standalone python program, but not for spark cluster.
    • A possible reason is pyspark has a different set of environment variables. It is not about propagating this variable across workers either because even if all workers has this variable exported in .bashrc, it still will complain.
  2. SPARK_YARN_USER_ENV=PYTHONHASHSEED=0
    • Doesn’t work. Some suggested to pass this to pyspark when starting notebook. Unfortunately, nothing fortunate happened. and I don’t think I am even using yarn.

Anyway, in the end, I find the solution from this link. Most of pssh can be ignored. The only line matters is place ‘Export PYTHONHASHSEED=0’ in to conf/spark-env.sh for each worker, which confirms the statement that PYTHONHASHSEED=0 should be somehow placed into the Spark Run-time Environment.

Thanks to this post, which saves my ass: http://comments.gmane.org/gmane.comp.lang.scala.spark.user/24459

module ‘urllib’ has no attribute ‘request’

Ran into a error ” Module ‘urllib’ has no attribute ‘request’ ”

The script runs well before I threw it into a parallel mode by calling sc.parallelize(data, 8). The spark log shows the above error. So far, I could not find any solution by googling. I have printed the the python version used, which is 3.5. Have no clue where goes wrong.

Update.

after a few exploration, I finally found the solution, i.e. put a statement import urllib.request right before I use urllib.request.urlopen(…). Is this caused by the fact that I am using Jupyter, in which, the import statement was in another cell.

How to Change your Username in Ubuntu?

You may create a username during the installation later you would rather use another one. How to change it? Here is what I found from http://askubuntu.com/questions/34074/how-do-i-change-my-username. It works for me.

    1. At the start screen press Ctrl+Alt+F1.
    2. Log in using your username and password.
    3. Set a password for the “root” account.
      sudo passwd root
      
    4. Log out.
      exit
      
    5. Log in using the “root” account and the password you have previously set.
    6. Change the username and the home folder to the new name that you want.
      usermod -l <newname> -d /home/<newname> -m <oldname>
      
    7. Change the group name to the new name that you want.
      groupmod -n <newgroup> <oldgroup>
      
    8. Lock the “root” account.
      passwd -l root
      
    9. If you were using ecryptfs (encrypted home directory). Mount your encrypted directory using ecryptfs-recover-private and edit <mountpoint>/.ecryptfs/Private.mnt to reflect your new home directory.
    10. Log out.
      exit
      
    11. Press Ctrl+Alt+F7.