grahl/ch

Sharing Drupal's public file storage

In a client project we recently had the challenge of not having persistent file storage available for public/private files across multiple instances.

The only realistic solution, at the moment, is using a module such as S3 File System to employ a different storage backend. The S3 module for Drupal isn’t perfect but let’s look at more of why this is even difficult.

Drupal’s public storage structure

ls -l sites/default/files
drwxrwxr-x 2 www-data www-data 20480 Dec 20 13:57 css
drwxrwxr-x 2 www-data www-data 16384 Dec 20 13:56 js
drwxrwxr-x 3 www-data www-data  4096 Jul 28  2018 media
drwxrwxr-x 3 www-data www-data  4096 Dec 20 13:56 php
drwxrwxr-x 9 www-data www-data  4096 Jun 24 21:29 styles

This minimal example already gives us four distinct behaviors which should really be addressed through distinct storage solutions. Unfortunately we are currently stuck with one for historic reasons. I’ll address css/js at the end, since it’s the tricky one.

files/media (or any upload folder)

You can define arbitrary folders in Drupal’s Field UI to manage uploads coming in from end users (even though that’s risky and should probably be in private), as well as images, videos, etc. from content editors.

We need this to be synced across all instances so this is something where we absolutely need something like an S3 bucket or NFS storage.

files/php

I’m assuming that you are not using config folders in here (keep them in your repo) but rather have Twig generating caches in php/cache.

This is a pure cache which could just as well live in a temp directory. End users do not receive its data. If files are deleted in that folder they get regenerated on-demand without a manual cache clear. Nothing to do here, it’s correct on all instances.

files/styles

This folder is basically a cache, one we want to persist if at all possible, but if it were to vanish and be empty, we would have no problem. It would simply take a bit more processing when incoming requests to them are generating assets on the fly (like php/twig does).

It would be nice if this were available on a shared storage but it’s not absolutely necessary. It should persist in some form since always recomputing all images is not really acceptable but that’s going to vary depending on what your infrastructure makes available on instantiation of additional instances.

files/css & files/js

During cache rebuilding (i.e. drush cr) the asset dumper writes the relevant assets to disk. Afterwards they can be served directly by the web server without involving PHP.

This process makes sense insofar as that we need everything bootstrapped to know which assets to build. It works mostly fine with the S3 module unless you change buckets while having to regenerate caches and have messy metadata (e.g. prod ➡ dev).

Unfortunately these files cannot be regenerated on-demand. This means that if I deploy and the cache is rebuild on instance 1 (assuming a change in assets) the end user will get a 404 if it’s requested from instance 3.

At this point there does not seem to be a separate shorthand available to just dump the assets, so per instance one would need to drush cr, which is not an acceptable penalty.

Thus, we need this on a persistent storage as well. Or do we?

There has been some work on aggregation in core: Race conditions, stampedes and cold cache performance issues with css/js aggregation (#1014086). However, advagg promises us to do just that in terms of lazy-loading. Without file I/O so it might even be worthwhile when you do have a shared storage but no CDN.

I’ve used advagg in the past just to get a bit better of an asset preprocessor but clearly this module provides much more essential features and should probably be on your list of standard performance optimizations.