2023.06.09 (FRI) ํ์ต์ ๋ฆฌ
#Airflow #Hive
1. Airflow Operator
์ง๊ธ๊น์ง BashOperator, EmptyOperator ๊ฐ์ด ๊ธฐ๋ณธ์ ์ธ operator๋ง ์ฌ์ฉํ์ผ๋ ์กฐ๊ธ ๋ ๊ณ ๋ํ ํ๊ธฐ์ํด ๋ค๋ฅธ ๊ฒ๋ค๋ ์ฌ์ฉํด๋ณด๊ณ ์ ํจ๐ช
2. Python Operator
python operator๋ ์ถ๊ฐ ์ค์น ์์ด import ํ ์ฌ์ฉ๊ฐ๋ฅ
๐ python operator๋ฅผ ์ฌ์ฉํ ๊ฐ๋จํ ์์
from airflow import DAG
from airflow.operators.python import PythonOperator
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2023, 6, 1, tzinfo=local_tz),
'retries': 0,
}
test_dag = DAG(
'python_test',
default_args=default_args,
schedule_interval="* */10 * * *"
)
# ๋ฆฌ์คํธ ํ์์ผ๋ก ์ธ์ ์ ๋ฌ
def get_sum(*num):
sum = 0
for n in num:
sum += num
t1 = PythonOperator(task_id='python_test1',
python_callable=get_sum,
op_args=[1,2,3],
dag=test_dag)
# ๋์
๋๋ฆฌ ํ์์ผ๋ก ์ธ์ ์ ๋ฌ
def get_sum2(**num):
sum = num['first'] + num['second']
t2 = PythonOperator(task_id='python_test2',
python_callable=get_sum2,
op_args={'first':1,'second':2},
dag=test_dag)
3. Hive Operator
Hive connection์๋ ์๋์ ๊ฐ์ 3๊ฐ์ง ์ ํ์ด ์ฃผ๋ก ์ฌ์ฉ๋๋๋ฐ ๊ทธ ์ค Hive Server2 Connection, ์ธ์ฆ๋ฐฉ์์ LDAP ์ ๊ธฐ๋ฐ์ผ๋ก ์งํ - ๐ ์ฐธ๊ณ ๋ฌธ์
- Hive CLI Connection
- Hive Metastore Connection
- Hive Sever2 Connection
3-1. package ์ค์น
hive operator๋ providers package ์ค apache-airflow-providers-apache-hive package ์ค์น ๅฟ
์ค์น๋ ๐ ๊ณต์๋ฌธ์๋ฅผ ์ฐธ๊ณ ํ์ฌ ์งํ
- Requirements ์ค์น
์๋ ํ ์ค์ ์ค์น๋์ง ์์ ํจํค์ง๊ฐ ์๋์ง ํ์ธ ํ ์ค์น ์งํ
pip list #์ค์น ํจํค์ง ํ์ธ
| apache-airflow |
| apache-airflow-providers-common-sql |
| hmsclient |
| pandas |
| pyhive[hive] |
| sasl |
| thrift |
- ์ฐธ๊ณ ) sasl ์ค์น ์ค๋ฅ - error: command '/usr/bin/gcc' failed with exit code 1
....
In file included from sasl/saslwrapper.cpp:629:
sasl/saslwrapper.h:22:10: fatal error: sasl/sasl.h: No such file or directory
22 | #include <sasl/sasl.h>
| ^~~~~~~~~~~~~
compilation terminated.
error: command '/usr/bin/gcc' failed with exit code 1
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for sasl
Running setup.py clean for sasl
Failed to build sasl
ERROR: Could not build wheels for sasl, which is required to install pyproject.toml-based projects
์์ ๊ฐ์ ์ค๋ฅ๊ฐ ๋ฐ์ํ๋ค๋ฉด ์๋ ์ค์น ํ sasl ์ค์น ์ฌ์งํ
sudo apt-get install libsasl2-dev
- apache-airflow-providers-apache-hive ์ค์น
pip install apache-airflow-providers-apache-hive
3-2. connection ์ค์
- hiveserver2 ์ค์
connection ์์ฑ ์ ์๋ ๊ธ ์ฐธ๊ณ ํ์ฌ hiveserver2 ์ค์ ์งํํ๊ธฐ
๐ [Airflow] HiveServer2Hook ์ฌ์ฉํ์ฌ hive์ ์ ๊ทผํ๊ธฐ
- connection ์์ฑ
๋ฉ๋ด - Admin - Connections์ ๋ค์ด๊ฐ์ connection ์์ฑํ๊ธฐ

์๋์ ๊ฐ์ด connection Type : Hive Client Wrapper, Host, Login(username), Password, Port ์ ๋ ฅ ํ Extra์ ์๋ ๋ด์ฉ ์์ฑ
{
"use_beeline": true,
"auth": ""
}

๐ hive operator๋ฅผ ์ฌ์ฉํ ๊ฐ๋จํ ์์
from airflow import DAG
from airflow.providers.apache.hive.operators.hive import HiveOperator
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2023, 6, 1, tzinfo=local_tz),
'retries': 0,
}
test_dag = DAG(
'hive_test',
default_args=default_args,
schedule_interval="* */10 * * *"
)
hql= '''
CREATE TABLE test;
'''
h1 = HiveOperator(
task_id='HiveOperator_test',
hql=hql,
hive_cli_conn_id='hive_cli_connect', #connection id ์์ฑ
run_as_owner=True,
dag=test_dag,
)
+ ์๋ฃ๋ ๋ช ์๊ณ ๐ฅ ๊ณต์๋ฌธ์๋ ์์ฒญ ์์ธํ๊ฒ ๋์์์ง๋ ์์์ ์ต๋ํ ๊ณต์๋ฌธ์๋ฅผ ์ฐธ๊ณ ํ๋ฉด์ ์ค๋ฅ๋ฅผ ํ๋ํ๋ ์ฐพ์ ํด๊ฒฐํ๋ฉด์ ์งํํ๋๋ ๋ง๋ ๊ณผ์ ์ธ์ง ๋ชจ๋ฅด๊ฒ ์ง๋ง ํด๊ฒฐํ๋ ์์์..๐
'๐ Data > Engineering' ์นดํ ๊ณ ๋ฆฌ์ ๋ค๋ฅธ ๊ธ
| [Pig] Pig ์ค์น ๋ฐ ๊ฐ์ (0) | 2023.06.13 |
|---|---|
| [Airflow] HiveServer2Hook, HiveCliHook ์ฌ์ฉํ์ฌ Hive์ ์ ๊ทผํ๊ธฐ (0) | 2023.06.12 |
| [Sqoop] Sqoop ์ค์น ๋ฐ ๊ฐ์ (0) | 2023.06.08 |
| [Spark] Spark, Zeppelin Notebook ์ฌ์ฉํ๊ธฐ (0) | 2023.06.05 |
| [HIVE] Airflow / Hive๋ฅผ ์ด์ฉํ ๋ฐ์ดํฐ ์ฒ๋ฆฌ (0) | 2023.06.02 |