2023-06-05 (MON) νμ΅μ 리
#Hadoop #Hive #Zeppelin #Spark
1. Spark μ μ
λ°μ΄ν° λΆμ μμ μ κ°λ°μ λ¨μννμ¬ ν¨μ¨μ±μ λμ΄λ μ€νμμ€ νλ μμν¬
Spark SQL, μ€μκ° λ°μ΄ν° μ²λ¦¬λ₯Ό μ§μνλ Spark Streaming, ML κΈ°λ²μ μ§μνλ Spark MLib λ±μ λΌμ΄λΈλ¬λ¦¬ μ§μ
2. Spark κ°μ
2-1. spark μ€μΉ (with. docker)
8080 portλ μ€λ³΅λμ΄ 18080μΌλ‘ λ³κ²½ ν μ€μΉ μ§νν¨
docker run -p 18080:8080 --name zeppelin apache/zeppelin:0.10.0
2-2. zeppelin notebook
- νμΌ μ λ‘λ λ° μ΄λ
docker cp λ‘컬경λ‘/νμΌλͺ
zeppelin:/opt/zeppelin # local νμΌ → 컨ν
μ΄λ λ΄λΆ νμΌ λ³΅μ¬
docker exec -it zeppelin bash # 컨ν
μ΄λμ bash shell μ°κ²°
$ mkdir -p seoul/parquet # ν΄λ λ§λ€κΈ°
$ mv νμΌλͺ
seoul/parguet # ν΄λΉ ν΄λλ‘ νμΌ μ΄λ
- note book μμ±
'http://localhost:18080' μΌλ‘ μ μ - μλ¨ 'Notebook - Create new note' - μμμ note name μ§μ ν create


- notebook μ¬μ©νκΈ°
%spark.pyspark
df = spark.read.parquet('/opt/zeppelin/seoul/parquet/*') #parquet νμΌ read
df.createOrReplaceTempView("temp_tb") #table μμ±
%spark.sql
SELECT * FROM temp_tb limit 10;

- νμ© μμ
GROUP BYλ‘ κ°μ Έμ€λ©΄ μ체 κΈ°λ₯μΌλ‘ μ°¨νΈ μκ°νλ κ°λ₯


'π Data > Engineering' μΉ΄ν κ³ λ¦¬μ λ€λ₯Έ κΈ
| [Airflow] Airflowμ Python Operator / Hive Operator μ¬μ©νκΈ° (0) | 2023.06.09 |
|---|---|
| [Sqoop] Sqoop μ€μΉ λ° κ°μ (0) | 2023.06.08 |
| [HIVE] Airflow / Hiveλ₯Ό μ΄μ©ν λ°μ΄ν° μ²λ¦¬ (0) | 2023.06.02 |
| [Hive] Hive μ€μΉ λ° κ°μ (0) | 2023.05.31 |
| [Hadoop] Hadoop μ€μΉ λ° κ°μ (0) | 2023.05.30 |