HMAC provides digital signatures using symmetric keys instead of PKI”. without complexities of public/private keys, root of trust and certificate chains, still have reliable digital signature with HMAC.

HMAC relies on symmetric key cryptography with pre-shared secrets instead of private/public pairs.

The downside is the same as with symmetric key cryptography in general – key distribution and protection of your secret keys.

SHA256(key||data) Merkle–Damgård construction, is vulnerable to a length extension attack:

given H(x), it’s very simple to find H(x||y), even if you only know the length of x, because of how the construction works.

Essentially, the construction works like this: You have a variable state that starts at some fixed value specified in the algorithm. You split the input to the hash function into blocks of size specified in the algorithm (padding the last block if it’s too short), and for each block, you use the current block and the current state to compute the new state using some special function specified in the algorithm. The value of state after processing the last block is the hash value. With any function using this construction, if you have the length of x, you can compute the padding p used. Then if you have H(x), you have the state after processing every block of x||p, which means you can proceed from there to compute H(x||p||y)).

That means that an attacker who knows the length of your MAC key and knows a particular value of SHA256(key||data) can easily compute SHA256(key||data||otherdata) for some given otherdata. They can choose most of the other data, but even if they couldn’t, it’s a fatal flaw in a MAC scheme if an attacker without the key can forge any MAC-data pair from other legitimate MAC-data pairs.

Incidentally, SHA256(data||key), while not vulnerable to length extension, is vulnerable to collisions in SHA256, which can also produce collisions in the proposed MAC, due to the same iterated construction. HMAC’s nesting prevents these and various other attacks. With non-Merkle-Damgård hashes, you don’t necessarily need the HMAC construction, though.ård-hash-functions-like-md5-contain-the-messa


Python: pass-by-reference

There are articles that call the strategy of python pass-by-value and others call pass-by-reference. The fact is python is always pass-by-reference. Then, where does this confusion come from? Why sometimes python acts like pass-by-value.

STATEMENT: if one pass the reference of an object, then the callee will share the same object as the caller. Any changes made to that object in the callee _should_ reflect to its origin in the caller.
>>def ref_demo(x):
…        x=7
>> x=6
>> ref_demo(x)
>> print(x)
In that case, should the output of the program above be 7? The result however is 6. That is just like pass-by-value: instead of passing the reference, the value of x, i.e. 6, is passed to the function! That is why any changes made did not reflect to its origin. That is just an illusion. Python is always pass-by-reference. There two reasons explains this conflict. The statement we lay out at the beginning turns out not acculate, in the case of Python.

Mutable vs. Immutable

In python, there are two types of objects: mutable v.s. immutable, which basically tells wether the content of an object is subject to change or not. Immutable types include integer, strings, tuples etc, while mutable objects are list, set, dict. Since immutable objects should not be changed, any attempt to change them will implicitly instruct the program to create a new object (but a same name). We can use the identity function to justify our claim.
First, we create a immutable object, integer
>> x=6
>> id(x)
This is the identiy of variable x, which is unique. Now, we try to change the value and look up its identiy again
>> x=7
>> id(x)
That is! This object x (stores 7) is not the one (stores 6) we used to know!!

Now return to the function call. We can print their identies to test our theory:
>> def ref_demo(x):
…        print(id(x))
…        x = 7
…        print(id(x)
>>  x = 6
>> id(x)
>> ref_demo(x)
The first identiy is indeed same as the variable in caller, i.e. pass-by-reference. However, once we try to change the value, the identiy changed.

Thus, python follows pass-by-reference. It is different from C++ or other languages because it introduces mutable or immutable variables which makes the difference.

Assignment vs. Modification 

There is another case also falsify our statement. See the example below.
>>> def ref_demo(l1):
…             l1= l1 + [‘b’]
>>> ls = [‘a’]
>>> ref_demo(ls)
>>> ls
We claimed that python is alsway passing by reference. The pass-by-value illusion is caused by immutable objects. But in this example, the conflict seems still legitimate for the case of mutable object. If pass-by-reference, why did we fail to change the content of the list?

Hers is the explain. We say that an object is mutable means it is possible to modify its content, which is not the same as an assignment. We know int variable x = 6 is immutable, but we can still perform an assignment x = 7 which creates a new variable. Here, l1 = l1 + [‘b’] consists of two steps: (1) addition operation of two lists (2) an assignment. And the second step creates a new object, forming a new list, whereas the changes was not able to reflect to its original list. We can use the same technique as above to justify this claim, i.e. the identify function.

So how to performa a modification then such that we can make changes to the original object because that is the purpose of pass-by-reference?

AFAIK, here are two ways:
(1) l1 += [‘b’]
(2) l1.append(‘b’) or l1.extend([‘b’]

The first one seems a little bit confused as usually we think it is a equivalent operation as l1 = l1 + [‘b’]. However, it’s not.

Using the rule of thumb, we can summarize both cases to the devil ‘assignment’. A reference is an alias of an variable/object, the attempt to change a reference, ie. assignment, will cause a creation of a new variable.

To conclude, in some cases, it is easy to get confused between pass-by-value and pass-by-reference because of exta rules imposed by ptyhon, but deep in the core, python follows pass-by-reference withou no doubt.


Pyspark Practics (2) – SPARK_LOCAL_DIRS


You may have run into the error that there is no space left on the disk for shuffle RDD data although you seems having much than enough disk space in fact.

It happens because usually we allocate a not-so-large space for system dir /tmp, while SPARK by default use /tmp for shuffle RDD data which might be quite large. (There are some posts questioning whether SPARK never clean temporary data – which can be a severe problem that I personally did not confirm). Anyway, as you can guess now, the SPARK_LOCAL_DIRS is designed for this purpose that specifies the location for temporary data.

You could configure this variable in conf/, e.g. use hdfs


There is spark.local.dirs in conf/spark-default.conf for the same purpose, which however will be overwritten by SPARK_LCOAL_DIRS.

Pyspark Practice (1) – PYTHONHASHSEED

Here is a small chunk of code for testing Spark RDD join function.

a=[(1, 'a'), (2, 'b')]
b=[(1, 'c'), (4, 'd')]
ardd = sc.parallelize(a)
brdd = sc.parallelize(b)
def merge(a, b):
if a is None:
return b
if b is None:
return a
return a+b
ardd.fullOuterJoin(brdd).map(lambda x: (x[0], merge(x[1][0], x[1][1]))).collect()

This code works fine. But when I apply this to my real data (reading from HDFS and Join and write it back). I ran into the PYTHONHASHSEED problem again! YES AGAIN. I did not get chance to fix this problem before.

This problem happens for Python 3.3+. The line of code responsible for this trouble is pythont/pyspark/, line 74.

 if sys.version >= '3.3' and 'PYTHONHASHSEED' not in os.environ:
        raise Exception("Randomness of hash of string should be disabled

After searching around and trying many different proposals, I really got frustrated about this. It seems the community knows well this issue and Spark Github seems having fixed it (2015), while my version (2016) still does not work

A few options I found:

  1. put export PYTHONOHASHSEED=0 .bashrc
    • Failed. In a notebook, I could get out the os.environ[‘PYTHONHASHSEE’] and it was correctly set. This is the correct way for standalone python program, but not for spark cluster.
    • A possible reason is pyspark has a different set of environment variables. It is not about propagating this variable across workers either because even if all workers has this variable exported in .bashrc, it still will complain.
    • Doesn’t work. Some suggested to pass this to pyspark when starting notebook. Unfortunately, nothing fortunate happened. and I don’t think I am even using yarn.

Anyway, in the end, I find the solution from this link. Most of pssh can be ignored. The only line matters is place ‘Export PYTHONHASHSEED=0’ in to conf/ for each worker, which confirms the statement that PYTHONHASHSEED=0 should be somehow placed into the Spark Run-time Environment.

Thanks to this post, which saved my day: