Search code examples
http-live-streamingrtmplive-streamingsimple-realtime-server

How to ensure access the right backend M3U8 file in origin cluster mode


From SRS how to transmux HLS wiki, we know SRS generate the corresponding M3U8 playlist in hls_path, here is my config file:

http_server {
    enabled         on;
    listen          8080;
    dir             ./objs/nginx/html;
}
vhost __defaultVhost__ {
    hls {
        enabled         on;
        hls_path        /data/hls-records;
        hls_fragment    10;
        hls_window      60;
    }
}

In one SRS server case, every client play the HLS stream access the same push SRS server, that's OK. But in origin cluster mode, there are many SRS servers, and each stream is in one of them. When client play this HLS stream we can't guard it can access the right origin SRS server(cause 404 http status code if not exist). Unlike the RTMP and HTTP-FLV stream, SRS use coworker by HTTP-API feature to redirect the right origin SRS.

In order to fix this issue, I think below two solutions:

  • Use specialized backend HLS segment SRS server:
    Don't generate the M3U8 in origin SRS server, every stream is forward to this SRS server, all the M3U8 are generated in this server and all HLS request is proxy to this server(use nginx). The cons. of this solution is limit to one instance, no scaling ability and single node risk.

the origin srs.conf forward config like this:

vhost same.vhost.forward.srs.com {
    # forward stream to other servers.
    forward {

        enabled on;

        destination 192.168.1.120:1935;
    }
}

where 192.168.1.120 is the backend hls segment SRS server.

  • Use cloud storage such as NFS/K8S PV/Distributed File System:
    Mount the cloud storage as local folder in every SRS server, whatever the stream in which SRS server, the M3U8 file and ts segment is transfer to same big storage, so after HLS request, the http server served them as static file. From my test, if the cloud storage write speed is reliable, it is a good solution. But if network shake or write speed is not as fast as received speed, it will block the other coroutine and this cause the SRS abnormal.

The hls_path config like this:

vhost __defaultVhost__ {
    hls {
        enabled         on;
        hls_path        /shared_storage/hls-records;
        hls_fragment    10;
        hls_window      60;
    }
}

Here 'shared_stoarge' means a nfs/cephfs/pv mount point.

The above solutions in my perspective are not radically resolve the access issue, I am looking forward to find better reliable product solution for such case?


Solution

  • As you use OriginCluster, then you must get lots of streams to serve, there are lots of encoders to publish streams to your media servers. The key to solve the problem:

    1. Never use single server, use cluster for elastic ability, because you might get much more streams in future. So forward is not good, because you must config a special set of streams to foward to, similar to a manually hash algorithm.
    2. Beside of bandwidth, the disk IO is also the bottleneck. You definitely need a high performance network storage cluster. But be careful, never let SRS directly write to the storage, it will block SRS coroutine.

    So the best solution, as I know, is to:

    1. Use SRS Origin Cluster, to write HLS on your local disk, or RAM disk is better, to make sure the disk IO never block the SRS coroutine(driven by state-threads network IO).
    2. Use network storage cluster to store the HLS files, for example cloud storage like AWS S3, or NFS/K8S PV/Distributed File System whatever. Use nginx or CDN to deliver the HLS.

    Now the problem is: How to move data from memory/disk to a network storage cluster?

    You must build a service, by Python or Go:

    • Use on_hls callback, to notify your service to move the HLS files.
    • Use on_publish callback, to notify your service to start FFmpeg to convert RTMP to HLS.

    Note that FFmpeg should pull stream from SRS edge, never from origin server directly.