2023.06.12 (MON) ํ์ต์ ๋ฆฌ
#Airflow #Hive #HiveServer2Hook #HiveCliHook
์ง๋ ๊ธ์์ ๋จ์ํ HiveOperator๋ฅผ ํ์ฉํ์ฌ ๋จ์ํ query๋ฅผ ์ ๋ฌํ๋ ๋ฐฉ์์ผ๋ก hive์ ์ ๊ทผํ์์ผ๋,
์ด๋ฒ์๋ HiveServer2๋ฅผ ํตํด Hiveserver2Hook, HiveCliHook์ ํ์ฉํ๊ณ ์ํจ
1. HiverServer2
HiveServer2๋ ํด๋ผ์ด์ธํธ๊ฐ ํ์ด๋ธ์ ๋ํด ์ฟผ๋ฆฌ๋ฅผ ์คํํ ์ ์๊ฒ ํด์ฃผ๋ ์๋น์ค๋ก HiveServer2Hook, HiveCliHook ์ฌ์ฉ ์ HiverServer2์ ์ ๊ทผํ๊ธฐ ์ํ ๊ธฐ๋ณธ ์ค์ ์ ์งํํด์ผ ํจ
- os ๊ณ์ ๋ช (username) ํ์ธ
$ hdfs dfs -ls -R /user/hive
drwxr-xr-x - username supergroup 0 2023-06-12 12:31 /user/hive/warehouse
...
- core-site.xml
Hadoop์ core-site.xml ํ์ผ์ username ๋ณ๊ฒฝ ํ ์๋ ๋ด์ฉ ์ถ๊ฐ
#core-site.xml
<property>
<name>hadoop.proxyuser.username.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.username.groups</name>
<value>*</value>
</property>
- hive-site.xml
Hive์ hive-site.xml ํ์ผ์ ์๋ ๋ด์ฉ ์ถ๊ฐ
# hive-site.xml
<property>
<name>hive.server2.enable.doAs</name>
<value>true</value>
</property>
<property>
<name>hive.server2.authentication</name>
<value>NONE</value>
</property>
- hiveserver 2 ์คํ
$ hiveserver2
2023-06-12 16:59:58: Starting HiveServer2
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in ...
SLF4J: Found binding in ...
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Hive Session ID = ...
2. HiveServer2Hook
2-1. Connection ์ค์
- connection ์ค์
Admin - Connections์์ ์๋ก์ด Connection์ ์์ฑํ๊ณ ์๋์ ๊ฐ์ด Connection Type, Host, Login(username), Port ์์ฑ ํ Extra์ ์๋ ๋ด์ฉ ์์ฑ
{
"authMechanism": "LDAP"
}

2-2. Connection Test
ํด๋น ํ์ด์ง ํ๋จ์ Test ๋ฒํผ์ ํด๋ฆญํ์ฌ ์ฐ๊ฒฐ ์ฌ๋ถ๋ฅผ ํ์ธ
- Error
Could not connect to any of [('127.0.0.1', 10000)]
ํด๋น ์ค๋ฅ์ ๊ฒฝ์ฐ, hiveserver2 ์คํ
errorMessage='Failed to open new session: java.lang.RuntimeException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.authorize.AuthorizationException): User: ์ฌ์ฉ์๋ช is not allowed to impersonate hive'), serverProtocolVersion=9, sessionHandle=None, configuration=None)
ํด๋น ์ค๋ฅ์ ๊ฒฝ์ฐ, ์์์ ์ธ๊ธํ core-site.xml ํ์ผ ์ค์ ํ์ธ
- Success
Connection successfully tested
test ๊ฒฐ๊ณผ ์๋จ์ ์๋์ ๊ฐ์ด ๋จ๋ฉด ์ฐ๊ฒฐ ์ฑ๊ณต ๐
2-3. ํ์ฉ ์์
์ฌ์ฉ ๋ฐฉ๋ฒ์ ๊ณต์๋ฌธ์๋ฅผ ์ฐธ๊ณ
๐ HiveServer2Hook์ ์ฌ์ฉํ ๊ฐ๋จํ ์์
from airflow import DAG
from datetime import datetime
from airflow.providers.apache.hive.hooks.hive import *
from airflow.operators.python import PythonOperator
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2023, 6, 1),
'retries': 0,
}
test_dag = DAG(
'hive_test',
default_args=default_args,
schedule_interval="* */10 * * *"
)
def hive_test():
hql = "SELECT * FROM raw_seoul LIMIT 10"
hm = HiveServer2Hook(hiveserver2_conn_id = 'Hiveserver2_test') #connection id
result = hm.get_records(hql) #table์ records ๊ฐ์ ธ์ค๊ธฐ
print(result) # ์ถ๋ ฅ
t1 = PythonOperator(
task_id = 'HiveServer2Hook_test',
python_callable=simple_query,
dag=test_dag,
)
3. HiveCliHook
3-1. Connection ์ค์
- connection ์ค์
Admin - Connections์์ ์๋ก์ด Connection์ ์์ฑํ๊ณ ์๋์ ๊ฐ์ด Connection Type, Host, Login(username), Port ์์ฑ ํ Extra์ ์๋ ๋ด์ฉ ์์ฑ
{
"use_beeline": true,
"auth": ""
}

3-2. ํ์ฉ์์
์ฌ์ฉ๋ฐฉ๋ฒ์ ๊ณต์๋ฌธ์๋ฅผ ์ฐธ๊ณ
from airflow import DAG
from datetime import datetime
from airflow.providers.apache.hive.hooks.hive import *
from airflow.operators.python import PythonOperator
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2023, 6, 1),
'retries': 0,
}
test_dag = DAG(
'hive_test',
default_args=default_args,
schedule_interval="* */10 * * *"
)
def hive_test():
arr = [[1,2],[3,4]]
df = pd.DataFrame(arr,columns=['a','b'])
print(df)
hh = HiveCliHook(hive_cli_conn_id='hive_cli_connect') #connection id
hh.load_df(df=df,table='test',
field_dict={
'a':'INT',
'b':'INT'
}) #๋ฐ์ดํฐ ํ๋ ์์ ํ
์ด๋ธ๋ก ์์ฑํ๊ธฐ
t1 = PythonOperator(task_id='HiveCliHook_test',
python_callable=hive_test,
dag=test_dag)'๐ Data > Engineering' ์นดํ ๊ณ ๋ฆฌ์ ๋ค๋ฅธ ๊ธ
| [Hive] Hive table์์ partition ์ฌ์ฉํ๊ธฐ (0) | 2023.06.14 |
|---|---|
| [Pig] Pig ์ค์น ๋ฐ ๊ฐ์ (0) | 2023.06.13 |
| [Airflow] Airflow์ Python Operator / Hive Operator ์ฌ์ฉํ๊ธฐ (0) | 2023.06.09 |
| [Sqoop] Sqoop ์ค์น ๋ฐ ๊ฐ์ (0) | 2023.06.08 |
| [Spark] Spark, Zeppelin Notebook ์ฌ์ฉํ๊ธฐ (0) | 2023.06.05 |