I installed Spark, ran the sbt assembly, and can open bin/pyspark with no problem. However, I am running into problems loading the pyspark module into ipython. I’m getting the following error:
In : import pyspark --------------------------------------------------------------------------- ImportError Traceback (most recent call last) <ipython-input-1-c15ae3402d12> in <module>() ----> 1 import pyspark /usr/local/spark/python/pyspark/__init__.py in <module>() 61 62 from pyspark.conf import SparkConf ---> 63 from pyspark.context import SparkContext 64 from pyspark.sql import SQLContext 65 from pyspark.rdd import RDD /usr/local/spark/python/pyspark/context.py in <module>() 28 from pyspark.conf import SparkConf 29 from pyspark.files import SparkFiles ---> 30 from pyspark.java_gateway import launch_gateway 31 from pyspark.serializers import PickleSerializer, BatchedSerializer, UTF8Deserializer, 32 PairDeserializer, CompressedSerializer /usr/local/spark/python/pyspark/java_gateway.py in <module>() 24 from subprocess import Popen, PIPE 25 from threading import Thread ---> 26 from py4j.java_gateway import java_import, JavaGateway, GatewayClient 27 28 ImportError: No module named py4j.java_gateway
In my environment (using docker and the image sequenceiq/spark:1.1.0-ubuntu), I ran in to this. If you look at the pyspark shell script, you’ll see that you need a few things added to your PYTHONPATH:
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH
That worked in ipython for me.
Update: as noted in the comments, the name of the py4j zip file changes with each Spark release, so look around for the right name.
I solved this problem by adding some paths in .bashrc
export SPARK_HOME=/home/a141890/apps/spark export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH
After this, it never raise ImportError: No module named py4j.java_gateway.
Install pip module ‘py4j’.
pip install py4j
I got this problem with Spark 2.1.1 and Python 2.7.x. Not sure if Spark stopped bundling this package in latest distributions. But installing
py4j module solved the issue for me.
before running above script, ensure that you have unzipped the py4j*.zip file.
and add its reference in script
sys.path.append(“path to spark*/python/lib”)
It worked for me.
#/home/shubham/spark-1.6.2 import os import sys # Set the path for spark installation # this is the path where you have built spark using sbt/sbt assembly os.environ['SPARK_HOME'] = "/home/shubham/spark-1.6.2" # os.environ['SPARK_HOME'] = "/home/jie/d2/spark-0.9.1" # Append to PYTHONPATH so that pyspark could be found sys.path.append("/home/shubham/spark-1.6.2/python") sys.path.append("/home/shubham/spark-1.6.2/python/lib") # sys.path.append("/home/jie/d2/spark-0.9.1/python") # Now we are ready to import Spark Modules try: from pyspark import SparkContext from pyspark import SparkConf`enter code here` print "Hey nice" except ImportError as e: print ("Error importing Spark Modules", e) sys.exit(1)
For setup of PySpark with python 3.8, add below paths to bash profile (Mac):
export SPARK_HOME=/Users/<username>/spark-3.0.1-bin-hadoop2.7 export PATH=$PATH:/Users/<username>/spark-3.0.1-bin-hadoop2.7/bin export PYSPARK_PYTHON=python3 export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.9-src.zip:$PYTHONPATH
NOTE: Use the py4j path present in your downloaded spark package.
Save the new updated bash file: Ctrl + X.
Run the new bash file: source ~/.bash_profile