I am needing to get full path and file name, file hash and several other properties of files in to a database. I was hoping some one had some good reading and suggestions on database creation that would do so safely. the import script will be python3.
I have to deal with all kinds of file types some with language specific characters as well as spaces and glyphs? in the names.
example /home/test/Music/RÜFÜS DU SOL ●● Solace (Lastlings Remix) [Official Audio] -.webm
or /home/test/Music/超好聽的中國古典音樂 笛子名曲 古箏 - Hermosa Música de Flauta- Música Para Meditación.mp4
I am sure there are some rules to creating the database that would improve the experience as well as I am sure there are some practices in python3 that would need to be used to not make a giant mess of things. At present copying all non duplicate files to a specific location preserving file structure would be the most difficult task the python script would need to do or possibly making a file link back to the initial duplicate file scanned.
I have not been able to find any thing specifically related when searching. I have usually found how to make a table with a strange name but not much about the table contents having strange characters.
Side note: one folder has over 4M files across 20k+ sub folders hence the need for indexing in a database to start sorting out the folder/subfolder mess in a systematic way as ls can take 60-90 seconds to return results there.
I use filename varchar(4096) collate utf8mb4_unicode_ci
for this. If case of letters are important (for e.g. searching or sorting), then use collate utf8mb4_bin
. Number 4096 is the value of linux PATH_MAX
, the maximum supported chars in a filename.
When you must save space, then use filename varchar(4096) collate binary
. In this case, the app must interpret the data as UTF-8. A char of binary
needs 1 byte, a char of utf8mb4_*
4 bytes space.
Instead of varchar(4096) collate binary
you can use varbinary(4096)
.
If you don't want to store the path, but only the base name (last part of the path), then varchar(255)
is good. This value is set by linux NAME_MAX
.