Here is a small chunk of code for testing Spark RDD join function.
a=[(1, 'a'), (2, 'b')] b=[(1, 'c'), (4, 'd')] ardd = sc.parallelize(a) brdd = sc.parallelize(b) def merge(a, b): if a is None: return b if b is None: return a return a+b ardd.fullOuterJoin(brdd).map(lambda x: (x, merge(x, x))).collect()
This code works fine. But when I apply this to my real data (reading from HDFS and Join and write it back). I ran into the PYTHONHASHSEED problem again! YES AGAIN. I did not get chance to fix this problem before.
This problem happens for Python 3.3+. The line of code responsible for this trouble is pythont/pyspark/rdd.py, line 74.
if sys.version >= '3.3' and 'PYTHONHASHSEED' not in os.environ: raise Exception("Randomness of hash of string should be disabled via PYTHONHASHSEED")
After searching around and trying many different proposals, I really got frustrated about this. It seems the community knows well this issue and Spark Github seems having fixed it (2015), while my version (2016) still does not work
A few options I found:
- put export PYTHONOHASHSEED=0 .bashrc
- Failed. In a notebook, I could get out the os.environ[‘PYTHONHASHSEE’] and it was correctly set. This is the correct way for standalone python program, but not for spark cluster.
- A possible reason is pyspark has a different set of environment variables. It is not about propagating this variable across workers either because even if all workers has this variable exported in .bashrc, it still will complain.
- Doesn’t work. Some suggested to pass this to pyspark when starting notebook. Unfortunately, nothing fortunate happened. and I don’t think I am even using yarn.
Anyway, in the end, I find the solution from this link. Most of pssh can be ignored. The only line matters is place ‘Export PYTHONHASHSEED=0’ in to conf/spark-env.sh for each worker, which confirms the statement that PYTHONHASHSEED=0 should be somehow placed into the Spark Run-time Environment.
Thanks to this post, which saves my ass: http://comments.gmane.org/gmane.comp.lang.scala.spark.user/24459