Search code examples
pythonpython-3.xfilemariadbspecial-characters

import Full file path/name and attributes in to mariadb safely


I am needing to get full path and file name, file hash and several other properties of files in to a database. I was hoping some one had some good reading and suggestions on database creation that would do so safely. the import script will be python3.

I have to deal with all kinds of file types some with language specific characters as well as spaces and glyphs? in the names.

example /home/test/Music/RÜFÜS DU SOL ●● Solace (Lastlings Remix) [Official Audio] -.webm or /home/test/Music/超好聽的中國古典音樂 笛子名曲 古箏 - Hermosa Música de Flauta- Música Para Meditación.mp4

I am sure there are some rules to creating the database that would improve the experience as well as I am sure there are some practices in python3 that would need to be used to not make a giant mess of things. At present copying all non duplicate files to a specific location preserving file structure would be the most difficult task the python script would need to do or possibly making a file link back to the initial duplicate file scanned.

I have not been able to find any thing specifically related when searching. I have usually found how to make a table with a strange name but not much about the table contents having strange characters.

Side note: one folder has over 4M files across 20k+ sub folders hence the need for indexing in a database to start sorting out the folder/subfolder mess in a systematic way as ls can take 60-90 seconds to return results there.


Solution

  • I use filename varchar(4096) collate utf8mb4_unicode_ci for this. If case of letters are important (for e.g. searching or sorting), then use collate utf8mb4_bin. Number 4096 is the value of linux PATH_MAX, the maximum supported chars in a filename.

    When you must save space, then use filename varchar(4096) collate binary. In this case, the app must interpret the data as UTF-8. A char of binary needs 1 byte, a char of utf8mb4_* 4 bytes space.

    Instead of varchar(4096) collate binary you can use varbinary(4096).

    If you don't want to store the path, but only the base name (last part of the path), then varchar(255) is good. This value is set by linux NAME_MAX.