=====Mit Python 3.4 Referenzen alle Dateien in einer Verzeichnisstruktur in der Oracle Datenbank speichern=====
**Aufgabe:**
Eine Verzeichnisstruktur auf einen Windows Server soll über Oracle Text in der Oracle Datenbank indiziert werden.
Dazu muss eine Referenz auf jede Datei in der Datenbank gespeichert werden.
Beim Laden soll auch der Datei Typ, die Datei Endung, der MIME Type, der MD5 Hash, die Größe der Datei und die Datumsfelder der Datei bestimmt und in die Ziel Tabelle geladen werden.
Später soll noch ein Vorschaubild und eine Zusammenfassung der Meta Daten der Datei, die Autor etc und die Sprache des Dokuments ermittelt werden.
Nach dem Laden werden die Daten mit Oracle Text indiziert, siehe dazu => [[dba:oracle_text_index_binary_data|Oracle Text für die Indizierung binärer Daten verwenden]]
Über eine APEX Oberfläche kann dann eine Suche über die Dateien durchgeführt werden siehe dazu => [[prog:oracle_apex_oracle_text_document_archive|Mit Oracle APEX 5 und Oracle Text ein Dokumentenarchive für technische Dokumentation aufbauen]].
----
==== Vorbereitung ====
=== Mit Unix Magic (file Befehl) den File Type bestimmen ====
Leider ist es mir unter Windows unter Python nicht gelungen, Magic direkt einzusetzen.
Normalerweise müssen nur die Windows DDL' von Magic zur Verfügung stehen und mit python-magic Package von https://github.com/ahupp/python-magic kann dann direkt in Python der File Typ und weitere Meta Daten ausgelesen werden.
Leider hat das aber bei mir nicht funktioniert.
.\python -m pip install python-magic
Libs von herunterladen:
* http://gnuwin32.sourceforge.net/packages/file.htm
* file über http://sourceforge.net/projects/gnuwin32/files/file/5.03/
* Dependencies wie lib1.dll und regex2.dll über http://gnuwin32.sourceforge.net/downlinks/file-dep-zip.php
Test:
import magic
magic_file_path= "D:\\tools\\file\\share\misc"
magic.Magic(magic_file =magic_file_path)
#Fehler:
...
D:\tools\file\share\misc/magic.mgc, 1: Warning: offset `ñ' invalid
D:\tools\file\share\misc/magic.mgc, 1: Warning: type `ñ' invalid
...
#usw..
Trotz einer weitere Versuche und Internet Recherche ist es mir nicht gelungen das hinzubekommen.
Daher rufe ich nun magic über "file.exe" direkt als Programm auf, dazu alle dll's (magic1.ddl, zlib1.dll und regex2.dll ) in das bin Verzeichnis von file kopieren.
Gerade bei alten Office Formaten kann magic sehr hilfreich sein, es wird zum Beispiel auch der Autor mit ausgelesen.
=== Directory Objekt in der Datenbank anlegen und Zugriffs-Rechte vergeben ===
Über ein Directory Object wird der Einstieg in die Verzeichnisstruktur auf der Festplatte definiert.
In unseren Test Fall liegen die Daten unter "d:\data\info-archiv" in einer Verzeichnisstruktur.
Anlegen als SYS und Rechte vergeben:
sqlplus / as sysdba
#Directory anlegen
create directory INFO_ARCHIVE as 'D:\data\info-archiv';
#Rechte vergeben
grant read,write on directory INFO_ARCHIVE to GPI;
=== Dokumenten Tabelle anlegen ===
In diese Tabelle werden die Informationen eingelesen:
sqlplus gpi/gpi
---------------------------------------------------------
-- Document table
---------------------------------------------------------
CREATE TABLE Documents
(
ID NUMBER (30) NOT NULL ,
Filename VARCHAR2 (512 CHAR) NOT NULL ,
FileTyp VARCHAR2 (32 CHAR) ,
FileDirectory VARCHAR2 (2000 CHAR) NOT NULL ,
FilePointer BFILE NOT NULL ,
MD5Hash VARCHAR2 (32 CHAR) NOT NULL ,
FileCreateDate DATE ,
FileLastModify DATE ,
Language VARCHAR2 (32 CHAR) ,
CreateDate DATE default sysdate,
CreateUser VARCHAR2 (32 CHAR) default user,
ChangeDate DATE default sysdate,
ChangeUser VARCHAR2 (32 CHAR) default user,
Theme_data_avaiable VARCHAR2 (1 CHAR) DEFAULT 'N'
) ;
CREATE UNIQUE INDEX IDX_Documents_ID_PK ON Documents ( ID ASC );
ALTER TABLE Documents ADD CONSTRAINT Document_PK PRIMARY KEY ( ID ) ;
COMMENT ON TABLE Documents IS 'Store the file information' ;
COMMENT ON COLUMN Documents.ID IS 'Primary Key' ;
COMMENT ON COLUMN Documents.Filename IS 'Name of the file on disk' ;
COMMENT ON COLUMN Documents.FileTyp IS 'Typ of the file' ;
COMMENT ON COLUMN Documents.FileDirectory IS 'Directory of the file' ;
COMMENT ON COLUMN Documents.FilePointer IS 'Bfile Pointer to the File' ;
COMMENT ON COLUMN Documents.MD5Hash IS 'Hash of the files to indentify dublicate files' ;
COMMENT ON COLUMN Documents.FileCreateDate IS 'File Create Time from the file' ;
COMMENT ON COLUMN Documents.FileLastModify IS 'Last modificatoin date from the file' ;
COMMENT ON COLUMN Documents.Language IS 'Language of the file (Oracle NLS Format String!)' ;
COMMENT ON COLUMN Documents.CreateDate IS 'Date when the record was created' ;
COMMENT ON COLUMN Documents.CreateUser IS 'User create the record' ;
COMMENT ON COLUMN Documents.ChangeDate IS 'Last Change on the record' ;
COMMENT ON COLUMN Documents.ChangeUser IS 'User change the record' ;
COMMENT ON COLUMN Documents.Theme_data_avaiable IS 'If Themdata is there => Y, if not N' ;
---------------------------------------------------------
create sequence documents_seq;
----
=== Oracle CX für den Datenbank Zugriff für Python 3.4 einrichten ===
Siehe dazu => [[python:python_database|Python 2.7 / 3.4 und die Oracle Datenbank]]
----
==== Code DocumentLoader.py (Python 3.4 !) ====
Aufruf mit "DocumentLoader.py -s -r -d -c "
Beim ersten Aufruf wird das Template für die Konfigurationsdatei "dataLoader.conf" erzeugt, diese Datei dann mit den entsprechenden Info's für die jeweilige Umgebung ausfüllen.
__author__ = 'gpipperr'
import datetime, time
import glob, filecmp, ntpath, shutil
import os, errno, sys, getopt
import hashlib
import subprocess
import io
import cx_Oracle
import locale
import configparser
# get FileType
def getFileType(file, command, magicFilePath):
# magicFilePath = "d:\\tools\\file\\bin\\file.exe"
result = "-2"
# Command + arguments as sequence
args = magicFilePath + " " + command + ' "' + file + '"'
# execute the program and pipe the output to subprocess.PIPE
prog = subprocess.Popen(args, stdout=subprocess.PIPE, shell=True)
# wait until finished
stat = prog.wait()
# connect pipes together
(output, err) = prog.communicate()
result = str(output)
result = str.replace(result, "b'", "")
result = str.replace(result, "\\r", "")
result = str.replace(result, "\\n", "")
escFileName = str.replace(str(file), "\\", "\\\\") + "; "
result = str.replace(result, escFileName, "")
return result
# Get extension
# Get extension
def getFile_ext(filename):
fparts = filename.split('.')
c = len(fparts)
# try to get .tar.gz
v_return = fparts[-1]
if c > 3:
if len(fparts[-2]) < 5:
chars = set('$,()[]_ ')
if any((c in chars) for c in fparts[-2]):
v_return = fparts[-1]
else:
v_return = fparts[-2] + "." + fparts[-1]
if v_return.find('.') < 0:
v_return = "." + v_return
return v_return.lower()
# get the MD5 Hash the file
def getMD5(filename):
blocksize = 65536
if os.path.exists(filename) == False:
return 'N/A'
md5 = hashlib.md5()
with open(filename, "rb") as f:
for block in iter(lambda: f.read(blocksize), b""):
md5.update(block)
return md5.hexdigest()
# Remember the global Size of all copied files
def setStatisticTotalSize(size):
global totalFileSize
totalFileSize += size
# global for the total filesize
totalFileSize = 0
# print the file info
# later insert the record to the database
# fileInfo=dict(filename=filename,filepath=fileDirectoryName,fileADate=fileAccessDate,fileMDate=fileModDate,fileCDate=fileCreateDate,md5=md5checkSum,fileBType=fileType,fileMtype=fileMimeType)
def insertFileInfo(fileInfo, ora_dir_path, connection):
# strip the oracle path
relative_doc_path = str.replace(fileInfo['filepath'], ora_dir_path, "")
try:
print("-- Info for the file ::{0}".format(fileInfo['filename'].encode('utf-8')))
print(" Path::{0}".format(fileInfo['filepath'].encode('utf-8')))
print(" Relativ Path::{0}".format(relative_doc_path + os.sep + str(fileInfo['filename'])))
print(" Access Date::{0}".format(fileInfo['fileADate']))
print(" Modify Date::{0}".format(fileInfo['fileMDate']))
print(" Create Date::{0}".format(fileInfo['fileCDate']))
print(" File Size::{0}".format(fileInfo['fileSize']))
print(" MD5::{0}".format(fileInfo['md5']))
print(" File Binary Type::{0}".format(fileInfo['fileBType']))
print(" File Mime Type::{0}".format(fileInfo['fileMtype']))
print(" File Extention::{0}".format(fileInfo['fileExtention']))
except:
print("Unexpected error with file infos")
# insert into the database
# Cursor auf die DB oeffenen
cursor = connection.cursor()
# filename,filetyp,filedirectory,md5hash, FILECREATEDATE, FILELASTMODIFY
# .encode('utf-8')
# .encode('utf-8')
doc_rows = [(str(fileInfo['filename'])
, fileInfo['fileExtention']
, fileInfo['filepath']
, fileInfo['md5']
, fileInfo['fileCDate']
, fileInfo['fileMDate']
, relative_doc_path + os.sep + str(fileInfo['filename'])
, fileInfo['fileSize']
, fileInfo['fileADate']
)
]
# how many rows to insert
cursor.bindarraysize = len(doc_rows)
# datatype - lenght of the inserted data
# cursor.setinputsizes(int, 14, 13)
try:
# insert the whole record
cursor.executemany(
"insert into documents (ID , FILENAME, FILETYP, FILEDIRECTORY, MD5HASH, FILECREATEDATE, FILELASTMODIFY,FILEPOINTER,filesize,FILELASTACESS) values ( documents_seq.nextval, :1, :2, :3, :4, :5, :6,BFILENAME('INFO_ARCHIVE',replace(:7,';','')),:8,:9)",
doc_rows)
except cx_Oracle.DatabaseError as e:
connection.rollback()
print("-- Error - Oracle Database Error:" + str(e))
except Exception as e:
print("-- Error - Unexpected error:" + str(e))
connection.rollback()
# raise
connection.commit()
# Main Script part
def main(argv):
global totalFileSize
# Parameter 1 - Import Directory
# Parameter 2 - Subfolder Level
# Remember the start time of the program
start_time = time.clock()
usage_string = " DocumentLoader.py -s -r -d -c "
path_name = '-'
dest_name = '-'
recursiveLevel = 0
config_path = 'dataLoader.conf'
# -- Parameter from the config file
oracle_port = '-'
oracle_host = '-'
oracle_service = '-'
oracle_user = '-'
oracle_pwd = '-'
magicFilePath = '-'
ingoreFileExt = []
# Path of the oracle Info Archive
ora_dir_path = '-'
try:
opts, args = getopt.getopt(argv, "hs:d:r:d:c:", ["src=", "rec=", "dir=", "config="])
except getopt.GetoptError:
print("usage: {0}".format(usage_string))
sys.exit(2)
# read the parameter
for opt, arg in opts:
if opt == '-h':
print("usage: {0}").format(usage_string)
sys.exit()
elif opt in ("-s", "--src"):
path_name = arg
elif opt in ("-d", "--dir"):
ora_dir_path = arg
elif opt in ("-c", "--config"):
config_path = arg
elif opt in ("-r", "--rec"):
recursiveLevel = int(arg)
# read the config file
config = configparser.ConfigParser()
# check if the file exits
if os.path.exists(config_path) == False:
# use normal configparser to write the template
print("--" + 80 * "!")
print("-- Error to read file {0}".format(config_path))
print("-- Error usage: {0}".format(usage_string))
config['DEFAULT'] = {'MagicFile': 'd:\\tools\\file\\bin\\file.exe', 'ignoreFileExt': '.iso'}
config['ORACLE_DB_CONNECT'] = {'Host': 'localhost', 'Port': '1521', 'Service': 'ORCL', 'DB_User': 'USER',
'DB_Password': 'xxxxx'}
with open(config_path, 'w') as configfile:
config.write(configfile)
print("--" + 80 * "!")
print("-- Info create Configuration Template :: {0}".format(config_path))
print("-- Info fillout the configuration file with your personal values and start again!")
print("--" + 80 * "!")
sys.exit(2)
else:
print("-- Info read config file {0}".format(config_path))
config.read(config_path)
# Parameter of the application
general_configuration = config['DEFAULT']
magicFilePath = general_configuration['MagicFile']
ingoreFileExtString = general_configuration['ignoreFileExt']
ingoreFileExt = str.split(str.replace(ingoreFileExtString, ' ', ''), ',')
# Oracle DB Connect
oracle_db_configuration = config['ORACLE_DB_CONNECT']
oracle_port = oracle_db_configuration['Port']
oracle_host = oracle_db_configuration['Host']
oracle_service = oracle_db_configuration['Service']
oracle_user = oracle_db_configuration['DB_User']
oracle_pwd = oracle_db_configuration['DB_Password']
# check if Directory exists and if the * is necessary
# BUG ! if more then 1 then the * not match the documents on root level??
# FIX IT!
# Source
if os.path.isdir(path_name):
if path_name.endswith(os.path.sep):
path_name += ("*" + os.path.sep) * recursiveLevel
path_name += "*.*"
else:
path_name += os.path.sep
path_name += ("*" + os.path.sep) * recursiveLevel
path_name += "*.*"
else:
print("-- Error :: 05 Source Directory (-s) {0} not found".format(path_name))
print("usage: {0}").format(usage_string)
sys.exit(2)
# connect to the database
print("--" + 40 * "=")
print("-- Info :: Oracle Client Library Version :: {0}".format(cx_Oracle.clientversion()))
# get the connection to the database
print("-- Info :: oracle_host {0} oracle_port {1} oracle_service {2} oracle_user {3} oracle_pwd ********".format(
oracle_host, oracle_port, oracle_service, oracle_user, oracle_pwd))
connection = cx_Oracle.connect(
oracle_user + '/' + oracle_pwd + '@' + oracle_host + ':' + oracle_port + '/' + oracle_service)
# Version der DB ausgeben
print("-- Info :: Oracle Database Version :: {0}".format(connection.version))
print("--" + 40 * "=")
print("-- Info :: Environment Settings :: Language :: {0} - Char Set ::{1}".format(locale.getdefaultlocale()[0],
locale.getdefaultlocale()[1]))
print("--" + 40 * "=")
print("-- Info :: Read all files from {0}".format(path_name))
print("-- Info :: Copy files to {0}".format(dest_name))
print("-- Info :: Not index this file types ::" + str(ingoreFileExt))
print("--" + 40 * "=")
fileCount = 0
fileExistsCount = 0
dirCount = 0
dirPathList = []
totalFileSize = 0
# Get the list of all Files
fileList = glob.glob(path_name)
# remove Thumbs.db if exist from the list
thumbsDBFile = "Thumbs.db"
for file in fileList:
if file.endswith(thumbsDBFile):
fileList.remove(file)
# Loop one read files in Import Directory
for file in fileList:
fileCount += 1
# do the work
try:
# check if directoy
if ntpath.isdir(file):
print("-- Info :: found dirctory {0} ::".format(file))
dirCount += 1
else:
# get only the filename without the path
filename = ntpath.basename(file)
# get directory
fileDirectoryName = ntpath.dirname(file)
# get Create date
fileAccessDate = datetime.datetime.fromtimestamp(ntpath.getatime(file))
fileModDate = datetime.datetime.fromtimestamp(ntpath.getmtime(file))
fileCreateDate = datetime.datetime.fromtimestamp(ntpath.getctime(file))
# get md5 hash
md5checkSum = getMD5(file)
# get File Type over file from external, Python Lib magic not working, error with magic file!
# now I implement this stupid solution
fileType = getFileType(file, " ", magicFilePath)
# Call file with -i to get the mime Type
fileMimeType = getFileType(file, "-i", magicFilePath)
# get Extenstion
fileExt = getFile_ext(filename)
# getFileSize
fileSize = os.path.getsize(file)
# Remember for statistic
setStatisticTotalSize(fileSize)
# not add url files to the index
# endswith(".url")
if fileExt in ingoreFileExt:
# encode the output with UTF-8 to avoid errors with stange things in filenames
print("-- Info :: Not index this file types ::" + str(ingoreFileExt))
print("-- Info :: Not index this file::{0}".format(repr(filename.encode('utf-8'))))
print("-- Info :: Not index this Dir::{0}".format(repr(fileDirectoryName.encode('utf-8'))))
print("-- --")
else:
# record
fileInfo = dict(filename=str(filename), filepath=str(fileDirectoryName), fileADate=fileAccessDate,
fileMDate=fileModDate, fileCDate=fileCreateDate, md5=md5checkSum,
fileBType=fileType,
fileMtype=fileMimeType, fileExtention=fileExt, fileSize=fileSize)
##print("-- Index this Dir::{0}".format(repr(fileDirectoryName.encode('utf-8'))))
# encode the output with UTF-8 to avoid errors with stange things in filenames
##print("-- Index this file ::{0}".format(repr(filename.encode('utf-8'))))
# write to DB
insertFileInfo(fileInfo, ora_dir_path, connection)
except OSError as exception:
if exception.errno != errno.EEXIST:
print("-- Error :: Error read file :: see error {1}".format(file, sys.exc_info()[0]))
# print statistics
print("--" + 40 * "=")
print("-- Finish with :: {0} files in {1} new directories".format(fileCount, dirCount))
print("-- The run needs :: {0:5.4f} seconds".format(time.clock() - start_time))
print("-- Read size :: {0:5.3f} MB".format(totalFileSize / 1024 / 1024))
print("--" + 40 * "=")
# Close the DB Connection
connection.close()
if __name__ == "__main__":
main(sys.argv[1:]);
Config file:
[DEFAULT]
# Path and name of the magic file program
MagicFile=d:\tools\file\bin\file.exe
# Do Not read files with this extension
# use , to separate the file extensions
ignoreFileExt=.cmd ,.iso ,.url,.exe,.class,.dbf
[ORACLE_DB_CONNECT]
Port = 1521
Host = 10.10.10.1
Service = GPI
DB_User= GPI
DB_Password = xxxxxx