There are articles that call the strategy of python pass-by-value and others call pass-by-reference. For some cases, python acts as like pass-by-valu, but it does not. The fact is python is always passing by reference. This article is trying to give an explaination that illustrates this matter.
- Everything is Object in python
- what is Object?
- what is reference?
- Mutable vs. Immutable
- immutable: int, strings, tuples
- mutable: list
- Assignment or Modification
- difference between x=x+y (assignment) and x+=y (modification)
It is painful to play with Redshift.
You may have run into the error that there is no space left on the disk for shuffle RDD data although you seems having much than enough disk space in fact.
It happens because usually we allocate a not-so-large space for system dir /tmp, while SPARK by default use /tmp for shuffle RDD data which might be quite large. (There are some posts questioning whether SPARK never clean temporary data – which can be a severe problem that I personally did not confirm). Anyway, as you can guess now, the SPARK_LOCAL_DIRS is designed for this purpose that specifies the location for temporary data.
You could configure this variable in conf/spark-env.sh, e.g. use hdfs
There is spark.local.dirs in conf/spark-default.conf for the same purpose, which however will be overwritten by SPARK_LCOAL_DIRS.
Here is a small chunk of code for testing Spark RDD join function.
a=[(1, 'a'), (2, 'b')]
b=[(1, 'c'), (4, 'd')]
ardd = sc.parallelize(a)
brdd = sc.parallelize(b)
def merge(a, b):
if a is None:
if b is None:
ardd.fullOuterJoin(brdd).map(lambda x: (x, merge(x, x))).collect()
This code works fine. But when I apply this to my real data (reading from HDFS and Join and write it back). I ran into the PYTHONHASHSEED problem again! YES AGAIN. I did not get chance to fix this problem before.
This problem happens for Python 3.3+. The line of code responsible for this trouble is pythont/pyspark/rdd.py, line 74.
if sys.version &gt;= '3.3' and 'PYTHONHASHSEED' not in os.environ:
raise Exception("Randomness of hash of string should be disabled
After searching around and trying many different proposals, I really got frustrated about this. It seems the community knows well this issue and Spark Github seems having fixed it (2015), while my version (2016) still does not work
A few options I found:
- put export PYTHONOHASHSEED=0 .bashrc
- Failed. In a notebook, I could get out the os.environ[‘PYTHONHASHSEE’] and it was correctly set. This is the correct way for standalone python program, but not for spark cluster.
- A possible reason is pyspark has a different set of environment variables. It is not about propagating this variable across workers either because even if all workers has this variable exported in .bashrc, it still will complain.
- Doesn’t work. Some suggested to pass this to pyspark when starting notebook. Unfortunately, nothing fortunate happened. and I don’t think I am even using yarn.
Anyway, in the end, I find the solution from this link. Most of pssh can be ignored. The only line matters is place ‘Export PYTHONHASHSEED=0’ in to conf/spark-env.sh for each worker, which confirms the statement that PYTHONHASHSEED=0 should be somehow placed into the Spark Run-time Environment.
Thanks to this post, which saved my day: http://comments.gmane.org/gmane.comp.lang.scala.spark.user/24459
Most of time, I get the impression that machine learning is statistic under beautiful costume. Here is an article that gives some ideas of difference between two subjects.
Ran into a error ” Module ‘urllib’ has no attribute ‘request’ ”
The script runs well before I threw it into a parallel mode by calling sc.parallelize(data, 8). The spark log shows the above error. So far, I could not find any solution by googling. I have printed the the python version used, which is 3.5. Have no clue where goes wrong.
after a few exploration, I finally found the solution, i.e. put a statement import urllib.request right before I use urllib.request.urlopen(…). Is this caused by the fact that I am using Jupyter, in which, the import statement was in another cell.