Search code examples
directorydirectory-structure

Directory set-up for online file storage service


I'm developing an online file storage service in mainly PHP and MySQL, where users will be able to upload files up to 10 - 20 GB in size.

Unregistered users will be able to upload files but not in a personal storage space, just a directory where all file uploads of unregistered users will be stored.

Registered users will get a fixed amount (that might increase in the future) of personal storage space and access to a file manager to easily manage and organize all their files. They'll also be able to set their files private (not downloadable by anyone but themselves) or public.


What would be a good possible directory set-up?

I'm thinking about a "personal" directory that will contain folders with the user's id as the folder name for each registered user.

Alongside the personal directory, there will be an "other" folder which will just contain every file that's been uploaded by unregistered users.

Both will contain uploaded files, with each their corresponding row id (from the files table in the database) as the file name.

ROOT
  FOLDER uploads
    FOLDER personal
      FOLDER 1
        FILE file_id1
        FILE file_id2
             (...)
      FOLDER 2
        FILE file_id3
        FILE file_id4
             (...)
        (...)
    FOLDER other
      FILE file_id5
      FILE file_id6
           (...)

This is the first time I'm dealing with a situation like this, but this concept is all so far what I could came up with. Any suggestions are also welcome!


Solution

  • Basically you need to address the following topics:

    1. Security: With what you described it is pretty unclear who is allowed to read access the files. If this is always "everybody read everything" you set up a file structure within a web server virtual server. Otherwise you set up the folder structure in a "hidden" area and only access those via server side scripts (eg. copy on demand). The secure approach eats more ressources, but opens room to setup a technically optimized folder structure.

    2. OS constraints: Each OS limits there number of items and/or files per folder. The actual figures of limitation depend on the os specific configuration of the file system. If I remember that right, there are LINUX setups that support 32000 items per folder. At the end of the day the example is not important. However importance lays on the fact, that your utilization planning does not exceed the limitations on your servers. So if you plan to provide your service to 10 users you may likely have a folder "other", if you target at a million users you probably need lots of folders "other". If you also do not want to restrict your users in number of files being uploaded you probably need the option to extend the folder per user. Personally I apply a policy where I not have more than 1000 items in a folder.

    3. SEO requirements: If your service needs to be SEO complaint, it needs to be able to present speaking names to users - ideally without general categorization such as "Personal"/"Other". Your proposed structure may meet this requirement. However the OS constraints may force you into a more technical physical structure (eg. where chunk item id into 3 digits and use those to make up your folder and file structure). On top of that you can implement a logical structure which then converts IDs into names. However such implementation means file access via server side scripts and therefore demands for more ressources. Alternatively you could play with webserver url rewrites...

    4. Consistency + Availability + Partition tolerance: Making your service a service likely requires you to have a balanced setup according those. Separating the beast into physical and logical layer helps here a lot. Consistency + Availability + Partition tolerance would be dealt with at the logical layer. http://en.wikipedia.org/wiki/NoSQL might be your way to go forward. http://en.wikipedia.org/wiki/CAP_theorem for details on the topic.

    ====================== UPDATE

    From the comments we know now that you store meta data in an relational database, that you have physical layer (files on disk) and logical layer (access via php scripts) and that you base your physical file/folder layer on IDs.

    This opens room to fully move any structural considerations to the relational database and maybe to improve the physical layer from the very beginning. So here are the tables of the sql database I would create:

     ======
     users
     ======
     id (unsigned INT, primary key)
     username
     password
     isregisteredflag
     ...any other not relevant for the topic...
    
     ======
     files
     ======     
     id (unsigned INT,primary key)
     filename
     _userid (foreign key to users.id)
     createddate
     fileattributes
     ...any other not relevant for the topic...
    
     ======
     tag2file
     ======
     _fileid (foreign key to files.id)
     _tagid (foreign key to tag.id)
    
     ======
     tags
     ======
     id  (unsigned INT,primary key)
     tagname
    

    Since this structure allows you to derive files from user IDs and also you can derive userID from files you do not need to store that relation as part of your folder structure. You just name the files on the physical layer files.id, which is a numeric value generated by the database. Since the ID is generated by the datebase you make sure to have them unique. Also now you can have tags which gives a richer categorization experience to your users (if you do not like tags you could do folder instead as well - in the database).

    Taking care for at point 4 very much impacts on your design. If you take care after you did set up the whole thing you potentially double efforts. Since everything is settled to build files from numeric IDs it is a very small step to store your physical files in a key value store in a no-sql database (rather than on the file system), which makes your system scalable as hell. This would mean you would employ a sql database for meta and structure data and a nosql database for files content.

    Btw. to cover your public files I would assume you to have a user "public" with ID=1. This ends up in some data hardcoding which is meant to be ugly. However as the functionality "public" is such a central element in your application you can contribute to unwritten laws by documenting that in a proper way. Alternatively you can add some more tables and blow up your code to cover two different things in a 'clean' way.