=====Erste Schritte mit Apache Sqoop 2 - v1.99 =====
Mit Apache Sqoop (siehe => http://sqoop.apache.org/ ) können Daten zwischen Hadoop und einer Datenbank ausgetauscht werden.
Architektur:
{{ :hadoop:sqoop_v01.png?500 |Apache sqoop 1.99 Übersicht }}
In der Version 2 (v1.99) wird eine Client Server Architektur realisiert.
====Installation Apache Sqoop====
Voraussetzung:
* Hadopp mindestens als Client auf dem Server installiert
* siehe zum Beispiel diese Hadoop Umgebung [[nosql:hadoop_first_setup|Eine Hadoop 2.4 Test Umgebung installieren und konfigurieren]]
Ablauf:
* Downlad Sqoop von [[http://mirror.netcologne.de/apache.org/sqoop/1.99.3/sqoop-1.99.3-bin-hadoop200.tar.gz|sqoop-1.99.3-bin-hadoop200.tar]] für Hadoop 2.x oder neuer
* Entpacken nach /opt/hadoop/product
* Umgebung setzen
* Oracle JDBC Treiber herunterladen [[http://www.oracle.com/technetwork/database/features/jdbc/jdbc-drivers-12c-download-1958347.html|Oracle Database 12c Release 1 JDBC Driver Downloads]]
* JDBC Treiber in das Sqoop Server Lib Verzeichnis kopieren
* Test
Ablauf:
# auspacken
cd /opt/hadoop/product
wget http://mirror.netcologne.de/apache.org/sqoop/1.99.3/sqoop-1.99.3-bin-hadoop200.tar.gz
tar xfvz sqoop-1.99.3-bin-hadoop200.tar.gz
rm sqoop-1.99.3-bin-hadoop200.tar.gz
# Verlinken
ln -s sqoop-1.99.3-bin-hadoop200 sqoop
#Umgebung setzen
vi ~/.bashrc
..
#Sqoop
export SQOOP_HOME=/opt/hadoop/product/sqoop
export PATH=$SQOOP_HOME/bin:$PATH
..
. ~/.bashrc
#Copy Oracle JDBC Driver in das server Lib verzeichnis:
cp ~/ojdbc7.jar $SQOOP_HOME/server/lib/
#lib Verzeichnis anlegen
mkdir $SQOOP_HOME/lib
#1. test des Clients
sqoop.sh client
Sqoop home directory: /opt/hadoop/product/sqoop
Sqoop Shell: Type 'help' or '\h' for help.
sqoop:000> show version
client version:
Sqoop 1.99.3 revision 2404393160301df16a94716a3034e31b03e27b0b
Compiled by mengweid on Fri Oct 18 14:15:53 EDT 2013
sqoop:000>
===Konfiguration Apache Sqoop===
Für den Zugriff auf die Hadoop Lib in der Datei catalina.properties in $SQOOP_HOME/server/conf den common.loader Pfad anpassen, dazu **ALLE** hadoop jar Verzeichnisse mit in den Pfad aufnehmen!
vi $SQOOP_HOME/server/conf/catalina.properties
common.loader=${catalina.base}/lib,${catalina.base}/lib/*.jar,${catalina.home}/lib,${catalina.home}/lib/*.jar,${catalina.home}/../lib/*.jar,/opt/hadoop/product/hadoop/share/hadoop/common/*.jar,/opt/hadoop/product/hadoop/share/hadoop/common/lib/*.jar,/opt/hadoop/product/hadoop/share/hadoop/hdfs/*.jar,/opt/hadoop/product/hadoop/share/hadoop/hdfs/lib/*.jar,/opt/hadoop/product/hadoop/share/hadoop/mapreduce/*.jar,/opt/hadoop/product/hadoop/share/hadoop/mapreduce/lib/*.jar,/opt/hadoop/product/hadoop/share/hadoop/tools/*.jar,/opt/hadoop/product/hadoop/share/hadoop/tools/lib/*.jar,/opt/hadoop/product/hadoop/share/hadoop/yarn/*.jar,/opt/hadoop/product/hadoop/share/hadoop/yarn/lib/*.jar
die Sqoop Server Konfigurieren befindet sich unter **$SQOOP_HOME/server/conf** in der Datei **sqoop_bootstrap.properties** und der Datei **sqoop.properties**.
In der **sqoop.properties** muss das Haddop Config Verzeichniss konfiguriert/angepasst werden:
vi $SQOOP_HOME/server/conf/sqoop.properties
# Hadoop configuration directory
org.apache.sqoop.submission.engine.mapreduce.configuration.directory=/opt/hadoop/product/hadoop/etc/hadoop/
Die Ports lassen sich mit der Datei **$SQOOP_HOME/server/bin/setenv.sh** über **SQOOP_HTTP_PORT** and **SQOOP_ADMIN_PORT** setzen, der Default ist 12000 and 12001.
Server starten:
$SQOOP_HOME/bin/sqoop.sh server start
Über die URL http://localhost:12000/sqoop/version läßt sich prüfen, ob der Server mit welcher Version läuft.
curl http://localhost:12000/sqoop/version
Server stoppen:
$SQOOP_HOME/bin/sqoop.sh server stop
===Test des Servers und Beheben erster evlt. Fehler===
Mit client am Server anmelden und testen:
$SQOOP_HOME/bin/sqoop.sh client
Sqoop Shell: Type 'help' or '\h' for help.
# Am Server anmelden
sqoop:000> set server --host nosqldb01 --port 12000 --webapp sqoop
Server is set successfully
sqoop:000> show version
# Server abfragen
sqoop:000> show version --all
client version:
Sqoop 1.99.3 revision 2404393160301df16a94716a3034e31b03e27b0b
Compiled by mengweid on Fri Oct 18 14:15:53 EDT 2013
Exception has occurred during processing command
Exception: com.sun.jersey.api.client.UniformInterfaceException Message: GET http://nosqldb01:12000/sqoop/version returned a response status of 404 Not Found
Tritt dieser Fehler auf, liegt vermutlich ein Konfigurationsproblem vor.
Logfile unter $SQOOP_HOME/server/logs analyiseren:
vi localhost.2014-08-09.log
...
Caused by: org.apache.sqoop.common.SqoopException: MAPREDUCE_0002:Failure on submission engine initialization - Invalid Hadoop configuration directory (not a directory or permission issues): /etc/hadoop/conf/
...
Lösung:
cd $SQOOP_HOME/server/conf
vi sqoop.properties
# Hadoop configuration directory
org.apache.sqoop.submission.engine.mapreduce.configuration.directory=/opt/hadoop/product/hadoop/etc/hadoop/
Fehler beim Stoppen des Server:
Aug 09, 2014 2:44:41 PM org.apache.catalina.startup.ClassLoaderFactory validateFile
WARNING: Problem with directory [/opt/hadoop/product/sqoop-1.99.3-bin-hadoop200/lib], exists: [false], isDirectory: [false], canRead: [false]
#Lösung
mkdir /opt/hadoop/product/sqoop-1.99.3-bin-hadoop200/lib
==== Repository Datenbank ====
In einer default Installation liegt die Repository Datenbank unter $SQOOP_HOME/@BASEDIR@/repository/db
==== Die ersten Daten zwischen der Datenbank und Hadoop kopieren ====
Es soll die Tabelle **SYSTEM.AUD$** aus der Oracle Datenbank als Text Datei in das Hadoop Cluster importiert werden.
Ablauf:
* Client starten und mit dem Sqoop Server verbinden
* Connection Object anlegen
* Job Object definieren
* Job starten und Daten nach Hadoop importieren
**Client starten und mit dem Sqoop Server**
$SQOOP_HOME/bin/sqoop.sh client
sqoop:000> set server --host nosqldb01 --port 12000 --webapp sqoop
**Connection Object anlegen**
Für das Connection Object werden die folgenden Informationen benötigt:
* Treibername - oracle.jdbc.driver.OracleDriver
* JDBC URL - jdbc:oracle:thin:@//10.10.10.1:1521/gpi
* Username und Password
sqoop:000> create connection --cid 1
Creating connection for connector with id 1
Please fill following values to create new connection object
Name: GPI DB
Connection configuration
JDBC Driver Class: oracle.jdbc.driver.OracleDriver
JDBC Connection String: jdbc:oracle:thin:@//10.10.10.1:1521/gpi
Username: system
Password: ******
JDBC Connection Properties:
There are currently 0 values in the map:
entry#
Security related configuration options
Max connections: 10
New connection was successfully created with validation status FINE and persistent id 1
sqoop:000> show connection --xid 1
1 connection(s) to show:
Connection with id 2 and name GPI DB (Enabled: true, Created by hadoop at 8/9/14 5:27 PM, Updated by hadoop at 8/9/14 5:27 PM)
Using Connector id 1
Connection configuration
JDBC Driver Class: oracle.jdbc.driver.OracleDriver
JDBC Connection String: jdbc:oracle:thin:@//10.10.10.1:1521/gpi
Username: system
Password:
JDBC Connection Properties:
Security related configuration options
Max connections: 10
**Job Object definieren**
Für das Job Object werden die folgenden Informationen benötigt:
* Name und Schema der Tabelle
* Speicherort im HDFS hdfs dfs -mkdir /auditVault
hdfs dfs -chmod 777 /auditVault
sqoop:000> create job --xid 1 --type import
Creating job for connection with id 1
Please fill following values to create new job object
Name: GPI AUD Import
Database configuration
Schema name: SYSTEM
Table name: AUD$
Table SQL:
Table column names:
Partition column name:USERID
Nulls in partition column:
Boundary query:
Output configuration
Storage type:
0 : HDFS
Choose: 0
Output format:
0 : TEXT_FILE
1 : SEQUENCE_FILE
Choose: 1
Compression format:
0 : NONE
1 : DEFAULT
2 : DEFLATE
3 : GZIP
4 : BZIP2
5 : LZO
6 : LZ4
7 : SNAPPY
Choose: 0
Output directory: /auditVault/GPI_AUD
Throttling resources
Extractors:
Loaders:
New job was successfully created with validation status FINE and persistent id 1
sqoop:000> show job
+----+----------------+--------+-----------+---------+
| Id | Name | Type | Connector | Enabled |
+----+----------------+--------+-----------+---------+
| 1 | GPI AUD Import | IMPORT | 1 | true |
+----+----------------+--------+-----------+---------+
**Job starten und Daten nach Hadoop importieren**
sqoop:000> submission start --jid 1
Exception has occurred during processing command
Unknown command: No such property: start for class: groovysh_evaluate
????????
sqoop:000> start job --jid 1
Exception has occurred during processing command
Exception: org.apache.sqoop.common.SqoopException Message: CLIENT_0001:Server has returned exception
# Fehler meldung sichtbar machen:
sqoop:000> set option --name verbose --value true
sqoop:000> start job --jid 1
...
Caused by: Exception: java.lang.Throwable Message: GENERIC_JDBC_CONNECTOR_0005:No column is found to partition data
Stack trace:
...
sqoop:000>update job --jid 1
#alle alten werte bis auf
Partition column name: TIMESTAMP#
#und
Output directory: /auditValut/gpi_aud
sqoop:000> start job --jid 1
2014-08-09 17:56:17 CEST: FAILURE_ON_SUBMIT
Exception: java.io.FileNotFoundException: File does not exist: hdfs://nosqldb01:9000/opt/hadoop/product/sqoop-1.99.3-bin-hadoop200/server/webapps/sqoop/WEB-INF/lib/sqoop-common-1.99.3.jar
Diese Anpassung an Hadoop Configuration löst dieser Verhalten:
vi $HADOOP_HOME/etc/hadoop/mapred-site.xml
mapreduce.framework.name
yarn
Nächster Test:
sqoop:000> start job --jid 1
2014-08-09 18:01:18 CEST: FAILURE_ON_SUBMIT
Exception: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.SafeModeException): Cannot create directory /tmp/hadoop-yarn/staging/hadoop/.staging. Name node is in safe mode.
The reported blocks 10 has reached the threshold 0.9990 of total blocks 10. The number of live datanodes 1 has reached the minimum number 0. In safe mode extension
# Lösung:
hdfs dfsadmin -safemode leave
Nächster Test:
sqoop:000> start job --jid 1
Submission details
Job ID: 1
Server URL: http://nosqldb01:12000/sqoop/
Created by: hadoop
Creation date: 2014-08-09 18:06:13 CEST
Lastly updated by: hadoop
External ID: job_1407600059752_0001
http://nosqldb01:8088/proxy/application_1407600059752_0001/
2014-08-09 18:06:13 CEST: BOOTING - Progress is not available
# was tut sich:
sqoop:000> start job --jid 1
Submission details
Job ID: 1
Server URL: http://nosqldb01:12000/sqoop/
Created by: hadoop
Creation date: 2014-08-11 21:17:42 CEST
Lastly updated by: hadoop
External ID: job_1407784150673_0001
http://nosqldb01:8088/proxy/application_1407784150673_0001/
Connector schema: Schema{name=AUD$,columns=[
Decimal{name=SESSIONID,nullable=null,precision=null,scale=null},
Decimal{name=ENTRYID,nullable=null,precision=null,scale=null},
Decimal{name=STATEMENT,nullable=null,precision=null,scale=null},
Date{name=TIMESTAMP#,nullable=null,fraction=null,timezone=null},
Text{name=USERID,nullable=null,size=null},
Text{name=USERHOST,nullable=null,size=null},
Text{name=TERMINAL,nullable=null,size=null},
Decimal{name=ACTION#,nullable=null,precision=null,scale=null},
Decimal{name=RETURNCODE,nullable=null,precision=null,scale=null},
Text{name=OBJ$CREATOR,nullable=null,size=null},
Text{name=OBJ$NAME,nullable=null,size=null},
Text{name=AUTH$PRIVILEGES,nullable=null,size=null},
Text{name=AUTH$GRANTEE,nullable=null,size=null},
Text{name=NEW$OWNER,nullable=null,size=null},
Text{name=NEW$NAME,nullable=null,size=null},
Text{name=SES$ACTIONS,nullable=null,size=null},
Decimal{name=SES$TID,nullable=null,precision=null,scale=null},
Decimal{name=LOGOFF$LREAD,nullable=null,precision=null,scale=null},
Decimal{name=LOGOFF$PREAD,nullable=null,precision=null,scale=null},
Decimal{name=LOGOFF$LWRITE,nullable=null,precision=null,scale=null},
Decimal{name=LOGOFF$DEAD,nullable=null,precision=null,scale=null},
Date{name=LOGOFF$TIME,nullable=null,fraction=null,timezone=null},
Text{name=COMMENT$TEXT,nullable=null,size=null},
Text{name=CLIENTID,nullable=null,size=null},
Text{name=SPARE1,nullable=null,size=null},
Decimal{name=SPARE2,nullable=null,precision=null,scale=null},
Binary{name=OBJ$LABEL,nullable=null,size=null},
Binary{name=SES$LABEL,nullable=null,size=null},
Decimal{name=PRIV$USED,nullable=null,precision=null,scale=null},
Decimal{name=SESSIONCPU,nullable=null,precision=null,scale=null},
Date{name=NTIMESTAMP#,nullable=null,fraction=null,timezone=null},
Decimal{name=PROXY$SID,nullable=null,precision=null,scale=null},
Text{name=USER$GUID,nullable=null,size=null},
Decimal{name=INSTANCE#,nullable=null,precision=null,scale=null},
Text{name=PROCESS#,nullable=null,size=null},
Binary{name=XID,nullable=null,size=null},
Text{name=AUDITID,nullable=null,size=null},
Decimal{name=SCN,nullable=null,precision=null,scale=null},
Decimal{name=DBID,nullable=null,precision=null,scale=null},
FloatingPoint{name=SQLBIND,nullable=null,byteSize=null},
FloatingPoint{name=SQLTEXT,nullable=null,byteSize=null},
Text{name=OBJ$EDITION,nullable=null,size=null}]}
2014-08-11 21:17:42 CEST: BOOTING - Progress is not available
Status in YARN:
yarn application -list
14/08/11 21:21:33 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
14/08/11 21:21:33 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):1
Application-Id Application-Name Application-Type User Queue State Final-State Progress Tracking-URL
application_1407784150673_0001 Sqoop: GPI AUD Import MAPREDUCE hadoop default RUNNING UNDEFINED 14% http://nosqldb01:35645
Leider bricht der Ladevorgang am Ende immer ab:
Caused by: org.apache.sqoop.common.SqoopException: GENERIC_JDBC_CONNECTOR_0001:Unable to get a connection
at org.apache.sqoop.connector.jdbc.GenericJdbcExecutor.(GenericJdbcExecutor.java:51)
at org.apache.sqoop.connector.jdbc.GenericJdbcImportExtractor.extract(GenericJdbcImportExtractor.java:42)
at org.apache.sqoop.connector.jdbc.GenericJdbcImportExtractor.extract(GenericJdbcImportExtractor.java:31)
at org.apache.sqoop.job.mr.SqoopMapper.run(SqoopMapper.java:96)
... 7 more
Caused by: java.sql.SQLRecoverableException: No more data to read from socket
?? Suche wir weiter ??
Mehr Erfolg mit : [[nosql:oracle_hadoop_big_data_appl_erste_schritte##daten_aus_oracle_mit_sqoop_2_-_19_in_das_hdfs_exportieren|Erste Schritte mit der Oracle Big Data Appliance]]
==== Einsatz von OraOop =====
Oraoop is a special plugin for sqoop that provides faster access to Oracle's RDBMS by using custom protocols that are not available publicly. Quest software partnered with Oracle to get those protocols, implemented them and created Oraoop.
See = > [[https://github.com/QuestSoftwareTCD/OracleSQOOPconnector|Quest:registered: Data Connector for Oracle and Hadoop]]
See : => http://blog.cloudera.com/wp-content/uploads/2011/01/oraoopuserguide-With-OraHive.pdf
??? Seit zwei Jahren nicht mehr gepflegt ????
==== Sqoop Client API ====
über die Client API läßt sich auch direkt auf den Server zugreifen um das zum Beispiel in eigene Programme zu integrieren.
=> http://sqoop.apache.org/docs/1.99.2/ClientAPI.html
==== Quellen ====
* http://sqoop.apache.org/docs/1.99.3/Sqoop5MinutesDemo.html
Netz:
* http://hadooped.blogspot.de/2013/06/apache-sqoop-part-3-data-transfer.html
* http://hadooped.blogspot.de/2013/05/apache-sqoop-for-data-integration.html