Python: pass-by-reference

There are articles that call the strategy of python pass-by-value and others call pass-by-reference. The fact is python is always pass-by-reference. Then, where does this confusion come from? Why sometimes python acts like pass-by-value.

STATEMENT: if one pass the reference of an object, then the callee will share the same object as the caller. [This is TRUE] Any changes made to that object [troublemaker] in the callee _should_ reflect to its origin in the caller. We will show this statement might a little tricky. In the end, this article want to stand back on the assertion “Python is passing parameter by reference”.

Here we start from a example:
>>def ref_demo(x):
…        x=7
>> x=6
>> ref_demo(x)
>> print(x)
In that case, should the output of the program above be 7? The result however is 6. That is just like pass-by-value: instead of passing the reference, the value of x, i.e. 6, is passed to the function! That is why any changes made did not reflect to its origin. So, is that true? No! That is just an illusion. Python is always pass-by-reference. There two reasons explains this conflict. The statement we lay out at the beginning turns out not acculate, in the case of Python.

Mutable vs. Immutable

In python, there are two types of objects: mutable v.s. immutable, which basically tells wether the content of an object is subject to change or not. Immutable types include integer, strings, tuples etc, while mutable objects are list, set, dict. Since immutable objects should not be changed, any attempt to change them will implicitly instruct the program to create a new object (but a same name). We can use the identity function to justify our claim.
First, we create a immutable object, integer
>> x=6
>> id(x)
This is the identiy of variable x, which is unique. Now, we try to change the value and look up its identiy again
>> x=7
>> id(x)
That is! This object x (stores 7) is not the one (stores 6) we used to know!!

Now return to the function call. We can print their identies to test our theory:
>> def ref_demo(x):
…        print(id(x))
…        x = 7
…        print(id(x)
>>  x = 6
>> id(x)
>> ref_demo(x)
The first identiy is indeed same as the variable in caller, i.e. pass-by-reference. However, once we try to change the value, the identiy changed.

Thus, python follows pass-by-reference. It is different from C++ or other languages because it introduces mutable or immutable variables which makes the difference.

Assignment vs. Modification 

There is another case also falsify our statement. See the example below.
>>> def ref_demo(l1):
…             l1= l1 + [‘b’]
>>> ls = [‘a’]
>>> ref_demo(ls)
>>> ls
We claimed that python is alsway passing by reference. The pass-by-value illusion is caused by immutable objects. But in this example, the conflict seems still legitimate for the case of mutable object. If pass-by-reference, why did we fail to change the content of the list?

Hers is the explain. We say that an object is mutable means it is possible to modify its content, which is not the same as an assignment. We know int variable x = 6 is immutable, but we can still perform an assignment x = 7 which creates a new variable. Here, l1 = l1 + [‘b’] consists of two steps: (1) addition operation of two lists (2) an assignment. And the second step creates a new object, forming a new list, whereas the changes was not able to reflect to its original list. We can use the same technique as above to justify this claim, i.e. the identify function.

So how to performa a modification then such that we can make changes to the original object because that is the purpose of pass-by-reference?

AFAIK, here are two ways:
(1) l1 += [‘b’]
(2) l1.append(‘b’) or l1.extend([‘b’]

The first one seems a little bit confused as usually we think it is a equivalent operation as l1 = l1 + [‘b’]. However, it’s not.

Using the rule of thumb, we can summarize both cases to the devil ‘assignment’. A reference is an alias of an variable/object, the attempt to change a reference, ie. assignment, will cause a creation of a new variable.

To conclude, in some cases, it is easy to get confused between pass-by-value and pass-by-reference because of exta rules imposed by ptyhon, but deep in the core, python follows pass-by-reference withou no doubt.



Pyspark Practics (2) – SPARK_LOCAL_DIRS


You may have run into the error that there is no space left on the disk for shuffle RDD data although you seems having much than enough disk space in fact.

It happens because usually we allocate a not-so-large space for system dir /tmp, while SPARK by default use /tmp for shuffle RDD data which might be quite large. (There are some posts questioning whether SPARK never clean temporary data – which can be a severe problem that I personally did not confirm). Anyway, as you can guess now, the SPARK_LOCAL_DIRS is designed for this purpose that specifies the location for temporary data.

You could configure this variable in conf/, e.g. use hdfs


There is spark.local.dirs in conf/spark-default.conf for the same purpose, which however will be overwritten by SPARK_LCOAL_DIRS.

Pyspark Practice (1) – PYTHONHASHSEED

Here is a small chunk of code for testing Spark RDD join function.

a=[(1, 'a'), (2, 'b')]
b=[(1, 'c'), (4, 'd')]
ardd = sc.parallelize(a)
brdd = sc.parallelize(b)
def merge(a, b):
if a is None:
return b
if b is None:
return a
return a+b
ardd.fullOuterJoin(brdd).map(lambda x: (x[0], merge(x[1][0], x[1][1]))).collect()

This code works fine. But when I apply this to my real data (reading from HDFS and Join and write it back). I ran into the PYTHONHASHSEED problem again! YES AGAIN. I did not get chance to fix this problem before.

This problem happens for Python 3.3+. The line of code responsible for this trouble is pythont/pyspark/, line 74.

 if sys.version >= '3.3' and 'PYTHONHASHSEED' not in os.environ:
        raise Exception("Randomness of hash of string should be disabled

After searching around and trying many different proposals, I really got frustrated about this. It seems the community knows well this issue and Spark Github seems having fixed it (2015), while my version (2016) still does not work

A few options I found:

  1. put export PYTHONOHASHSEED=0 .bashrc
    • Failed. In a notebook, I could get out the os.environ[‘PYTHONHASHSEE’] and it was correctly set. This is the correct way for standalone python program, but not for spark cluster.
    • A possible reason is pyspark has a different set of environment variables. It is not about propagating this variable across workers either because even if all workers has this variable exported in .bashrc, it still will complain.
    • Doesn’t work. Some suggested to pass this to pyspark when starting notebook. Unfortunately, nothing fortunate happened. and I don’t think I am even using yarn.

Anyway, in the end, I find the solution from this link. Most of pssh can be ignored. The only line matters is place ‘Export PYTHONHASHSEED=0’ in to conf/ for each worker, which confirms the statement that PYTHONHASHSEED=0 should be somehow placed into the Spark Run-time Environment.

Thanks to this post, which saved my day:

module ‘urllib’ has no attribute ‘request’

Ran into a error ” Module ‘urllib’ has no attribute ‘request’ ”

The script runs well before I threw it into a parallel mode by calling sc.parallelize(data, 8). The spark log shows the above error. So far, I could not find any solution by googling. I have printed the the python version used, which is 3.5. Have no clue where goes wrong.


after a few exploration, I finally found the solution, i.e. put a statement import urllib.request right before I use urllib.request.urlopen(…). Is this caused by the fact that I am using Jupyter, in which, the import statement was in another cell.