Jupyter is the Swiss knife for data scientist. I am addicted to it since I discovered this tool. As the limitation of python, esp. when we are dealing with high volume data, we may naturally wonder how to use Spark, which is another fantastic tool but for parallel data processing. This tutorial intends to bind them together for a powerful cannon.
If you don’t have Jupyter yet, install it.
$ sudo apt-get python-pip python-pip3
$ sudo pip install jupyter
$ sudo pip3 install jupyter
$ jupyter notebook
Although you installed jupyter with both version 2 and 3. If you check the ‘new’, you may see only one kernel either ‘Python 2’ or ‘Python 3’ but not both. To enable both kernels, you can simply run the following
$ python3 -m ipykernel install –user
Create a file for testing
$ echo “abcdabcef” >> README.md
Now, run ‘$ jupyter notebook’ anywhere you want. Once the page is up, create a new notebook using python 3 (or python 2 if you want). Try this example.
from pyspark import SparkContext
logFile = “README.md”
logData = sc.textFile(logFile).cache()
numAs = logData.filter(lambda s: ‘a’ in s).count()
numBs = logData.filter(lambda s: ‘b’ in s).count()
print(“Lines with a: %i, lines with b: %i” % (numAs, numBs))
Of course you;’ll get a “ImportError: No module named ‘pyspark'” because the jupyter has not integrated with Spark yet.
Or if you run ‘$ pyspark’, it should give you the Spark welcome doodle.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ ‘_/
/__ / .__/\_,_/_/ /_/\_\ version 2.0.1
/_/Using Python version 2.7.12 (default, Jul 1 2016 15:12:24)
SparkSession available as ‘spark’.
>>>
Ok. Here comes the magic. You copy and paste the following lines to ~/.bashrc.
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_DRIVER_PYTHON=/usr/local/bin/jupyter
export PYSPARK_DRIVER_PYTHON_OPTS=”notebook –NotebookApp.port=8880″
export PYSPARK_PYTHON=/usr/bin/python3.5
Run ‘$ pyspark’ again, you will see a different interface like the below:
[W 09:24:01.833 NotebookApp] WARNING: The notebook server is listening on all IP addresses and not using encryption. This is not recommended.
[W 09:24:01.833 NotebookApp] WARNING: The notebook server is listening on all IP addresses and not using authentication. This is highly insecure and not recommended.
[I 09:24:01.839 NotebookApp] Serving notebooks from local directory: /home/dvlabs
[I 09:24:01.839 NotebookApp] 0 active kernels
[I 09:24:01.839 NotebookApp] The Jupyter Notebook is running at: http://%5Ball ip addresses on your system]:8880/
[I 09:24:01.840 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
Now, open the browser, type sincerely ‘http://localhost:8880’, create a notebook with python 3, and rerun our example. And?! It works. Just like that.
from pyspark import SparkContext
logFile = “README.md”
logData = sc.textFile(logFile).cache()
numAs = logData.filter(lambda s: ‘a’ in s).count()
numBs = logData.filter(lambda s: ‘b’ in s).count()
print(“Lines with a: %i, lines with b: %i” % (numAs, numBs))
That I keep mentioning ‘python 3’ is for the syntax reason. If you are comfortable with python 2. Go for it! Just modify the code accordingly.