https://cwiki.apache.org/confluence/display/Hive/HiveClient#HiveClient-Python appears to be outdated.
When I add this to /etc/profile:
I can then do the imports as listed in the link, with the exception of
from hive import ThriftHive which actually need to be:
from hive_service import ThriftHive
Next the port in the example was 10000, which when I tried caused the program to hang. The default Hive Thrift port is 9083, which stopped the hanging.
So I set it up like so:
from thrift import Thrift from thrift.transport import TSocket from thrift.transport import TTransport from thrift.protocol import TBinaryProtocol try: transport = TSocket.TSocket('<node-with-metastore>', 9083) transport = TTransport.TBufferedTransport(transport) protocol = TBinaryProtocol.TBinaryProtocol(transport) client = ThriftHive.Client(protocol) transport.open() client.execute("CREATE TABLE test(c1 int)") transport.close() except Thrift.TException, tx: print '%s' % (tx.message)
I received the following error:
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/hive/lib/py/hive_service/ThriftHive.py", line 68, in execute self.recv_execute() File "/usr/lib/hive/lib/py/hive_service/ThriftHive.py", line 84, in recv_execute raise x thrift.Thrift.TApplicationException: Invalid method name: 'execute'
But inspecting the ThriftHive.py file reveals the method execute within the Client class.
How may I use Python to access Hive?
I believe the easiest way is to use PyHive.
To install you’ll need these libraries:
pip install sasl pip install thrift pip install thrift-sasl pip install PyHive
Please note that although you install the library as
PyHive, you import the module as
pyhive, all lower-case.
If you’re on Linux, you may need to install SASL separately before running the above. Install the package libsasl2-dev using apt-get or yum or whatever package manager for your distribution. For Windows there are some options on GNU.org, you can download a binary installer. On a Mac SASL should be available if you’ve installed xcode developer tools (
xcode-select --install in Terminal)
After installation, you can connect to Hive like this:
from pyhive import hive conn = hive.Connection(host="YOUR_HIVE_HOST", port=PORT, username="YOU")
Now that you have the hive connection, you have options how to use it. You can just straight-up query:
cursor = conn.cursor() cursor.execute("SELECT cool_stuff FROM hive_table") for result in cursor.fetchall(): use_result(result)
…or to use the connection to make a Pandas dataframe:
import pandas as pd df = pd.read_sql("SELECT cool_stuff FROM hive_table", conn)
I assert that you are using HiveServer2, which is the reason that makes the code doesn’t work.
You may use pyhs2 to access your Hive correctly and the example code like that:
import pyhs2 with pyhs2.connect(host='localhost', port=10000, authMechanism="PLAIN", user='root', password='test', database='default') as conn: with conn.cursor() as cur: #Show databases print cur.getDatabases() #Execute query cur.execute("select * from table") #Return column info from query print cur.getSchema() #Fetch table results for i in cur.fetch(): print i
Attention that you may install python-devel.x86_64 cyrus-sasl-devel.x86_64 before installing pyhs2 with pip.
Wish this can help you.
Below python program should work to access hive tables from python:
import commands cmd = "hive -S -e 'SELECT * FROM db_name.table_name LIMIT 1;' " status, output = commands.getstatusoutput(cmd) if status == 0: print output else: print "error"
You can use hive library,for that you want to import hive Class from hive import ThriftHive
Try This example:
import sys from hive import ThriftHive from hive.ttypes import HiveServerException from thrift import Thrift from thrift.transport import TSocket from thrift.transport import TTransport from thrift.protocol import TBinaryProtocol try: transport = TSocket.TSocket('localhost', 10000) transport = TTransport.TBufferedTransport(transport) protocol = TBinaryProtocol.TBinaryProtocol(transport) client = ThriftHive.Client(protocol) transport.open() client.execute("CREATE TABLE r(a STRING, b INT, c DOUBLE)") client.execute("LOAD TABLE LOCAL INPATH '/path' INTO TABLE r") client.execute("SELECT * FROM r") while (1): row = client.fetchOne() if (row == None): break print row client.execute("SELECT * FROM r") print client.fetchAll() transport.close() except Thrift.TException, tx: print '%s' % (tx.message)
To connect using a username/password and specifying ports, the code looks like this:
from pyhive import presto cursor = presto.connect(host='host.example.com', port=8081, username='USERNAME:PASSWORD').cursor() sql = 'select * from table limit 10' cursor.execute(sql) print(cursor.fetchone()) print(cursor.fetchall())
here’s a generic approach which makes it easy for me because I keep connecting to several servers (SQL, Teradata, Hive etc.) from python. Hence, I use the pyodbc connector. Here’s some basic steps to get going with pyodbc (in case you have never used it):
- Pre-requisite: You should have the relevant ODBC connection in your windows setup before you follow the below steps. In case you don’t have it, find the same here
STEP 1. pip install:
pip install pyodbc (here’s the link to download the relevant driver from Microsoft’s website)
STEP 2. now, import the same in your python script:
STEP 3. Finally, go ahead and give the connection details as follows:
conn_hive = pyodbc.connect('DSN = YOUR_DSN_NAME , SERVER = YOUR_SERVER_NAME, UID = USER_ID, PWD = PSWD' )
The best part of using pyodbc is that I have to import just one package to connect to almost any data source.
The examples above are a bit out of date.
One new example is here:
import pyhs2 as hive import getpass DEFAULT_DB = 'default' DEFAULT_SERVER = '10.37.40.1' DEFAULT_PORT = 10000 DEFAULT_DOMAIN = 'PAM01-PRD01.IBM.COM' u = raw_input('Enter PAM username: ') s = getpass.getpass() connection = hive.connect(host=DEFAULT_SERVER, port= DEFAULT_PORT, authMechanism='LDAP', user=u + '@' + DEFAULT_DOMAIN, password=s) statement = "select * from user_yuti.Temp_CredCard where pir_post_dt = '2014-05-01' limit 100" cur = connection.cursor() cur.execute(statement) df = cur.fetchall()
In addition to the standard python program, a few libraries need to be installed to allow Python to build the connection to the Hadoop databae.
1.Pyhs2, Python Hive Server 2 Client Driver
2.Sasl, Cyrus-SASL bindings for Python
3.Thrift, Python bindings for the Apache Thrift RPC system
4.PyHive, Python interface to Hive
Remember to change the permission of the executable
chmod +x test_hive2.py
Wish it helps you.
It is a common practice to prohibit for a user to download and install packages and libraries on cluster nodes. In this case solutions of @python-starter and @goks are working perfect, if hive run on the same node. Otherwise, one can use a
beeline instead of
hive command line tool. See details
#python 2 import commands cmd = 'beeline -u "jdbc:hive2://node07.foo.bar:10000/...<your connect string>" -e "SELECT * FROM db_name.table_name LIMIT 1;"' status, output = commands.getstatusoutput(cmd) if status == 0: print output else: print "error"
#python 3 import subprocess cmd = 'beeline -u "jdbc:hive2://node07.foo.bar:10000/...<your connect string>" -e "SELECT * FROM db_name.table_name LIMIT 1;"' status, output = subprocess.getstatusoutput(cmd) if status == 0: print(output) else: print("error")