Search code examples
apilibrariesintelhpcinfiniband

What is Tag-matching interface?


I heard that PSM is a library supporting tag-matching. What is Tag-matching interface? Why is tag-matching important for performance in the context of MPI?


Solution

  • Short intro in tag matching for MPI: https://www.hpcwire.com/2006/08/18/a_critique_of_rdma-1/ section "Matching"

    MPI is a two-sided interface with a large matching space: a MPI Recv is associated with a MPI Send according to several criteria such as Sender, Tag, and Context, with the first two possibly ignored (wildcard). Matching is not necessarily in order and, worse, a MPI Send can be posted before the matching MPI Recv ... MPI requires 64 bits of matching information, and MX, Portals, and QsNet provide such a matching capability.

    InfiniBand Verbs and other RDMA-based APIs do not support matching at all

    So, it sounds like PSM is way to include fast matching to Infiniband-style network adapters (first versions with software matching, but with possibility of moving part of matching to the hardware).

    I can't find public documentation of PSM (there are no details in User Guide http://storusint.com/pdf/qlogic/InfiniPath%20User%20Guide%202_0.pdf). But there are sources of the library: https://github.com/01org/psm

    Some details are listed in PSM2 presentation https://www.openfabrics.org/images/eventpresos/2016presentations/304PSM2Features.pdf

    What is PSM? Matched Queue (MQ) component

    • • Semantically matched to the needs of MPI using tag matching
    • • Provides calls for communication progress guarantees
    • • MQ completion semantics (standard vs. synchronized)

    PSM API

    • • Global tag matching API with 64-bit tags
    • • Scale up to 64K processes per job
    • • MQ APIs provide point-to-point message passing between endpoints
    • • e.g. psm_mq_send, psm_mq_irecv
    • • No “recvfrom” functionality – needed by some applications

    So, there are 64-bit tags. Every message has a tag, and Matched Queue has tag (in some tag matching implementations there is also tag mask). According to the source psm_mq_internal.h: mq_req_match() https://github.com/01org/psm/blob/67c0807c74e9d445900d5541358f0f575f22a630/psm_mq_internal.h#L381, there is mask in PSM:

    typedef struct psm_mq_req {
    ...
        /* Tag matching vars */
        uint64_t    tag;
        uint64_t    tagsel;     /* used for receives */
    ...
    } psm_mq_req_t;
    
    mq_req_match(struct mqsq *q, uint64_t tag, int remove)
    )
    {
        psm_mq_req_t *curp;
        psm_mq_req_t cur;
    
        for (curp = &q->first; (cur = *curp) != NULL; curp = &cur->next) {
          if (!((tag ^ cur->tag) & cur->tagsel)) { /* match! */
            if (remove) {
              if ((*curp = cur->next) == NULL) /* fix tail */
                q->lastp = curp;
              cur->next = NULL;
            }
            return cur;
          }
        }
    

    So, match is when the incoming tag is xored with tag of receives, posted to the MQ, result anded with tagsel of receive. If after these operations there are only zero bits, the match is found, else next receive is processed.

    Comment from psm_mq.h, psm_mq_irecv() function, https://github.com/01org/psm/blob/4abbc60ab02c51efee91575605b3430059f71ab8/psm_mq.h#L206

    /* Post a receive to a Matched Queue with tag selection criteria
     *
     * Function to receive a non-blocking MQ message by providing a preposted
     * buffer. For every MQ message received on a particular MQ, the tag and @c
     * tagsel parameters are used against the incoming message's send tag as
     * described in tagmatch.
     *
     * [in] mq Matched Queue Handle
     * [in] rtag Receive tag
     * [in] rtagsel Receive tag selector
     * [in] flags Receive flags (None currently supported)
     * [in] buf Receive buffer 
     * [in] len Receive buffer length
     * [in] context User context pointer, available in psm_mq_status_t
     *                    upon completion
     * [out] req PSM MQ Request handle created by the preposted receive, to
     *                 be used for explicitly controlling message receive
     *                 completion.
     *
     * [post] The supplied receive buffer is given to MQ to match against incoming
     *       messages unless it is cancelled via psm_mq_cancel @e before any
     *       match occurs.
     *
     * The following error code is returned.  Other errors are handled by the PSM
     * error handler (psm_error_register_handler).
     *
     * [retval] PSM_OK The receive buffer has successfully been posted to the MQ.
     */
    psm_error_t
    psm_mq_irecv(psm_mq_t mq, uint64_t rtag, uint64_t rtagsel, uint32_t flags,
             void *buf, uint32_t len, void *context, psm_mq_req_t *req);
    

    Example of encoding data into tag:

     *     uint64_t tag = ( ((context_id & 0xffff) << 48) |
     *                      ((my_rank & 0xffff) << 32)    |
     *                      ((send_tag & 0xffffffff)) );
    

    With tagsel mask we can encode both "match everything", "match tags with some bytes or bits equal to value, and anything in other", "match exactly".

    There is newer PSM2 API, open source too - https://github.com/01org/opa-psm2, programmer's guide published at http://www.intel.com/content/dam/support/us/en/documents/network/omni-adptr/sb/Intel_PSM2_PG_H76473_v1_0.pdf.

    In PSM2 tags are longer, and the matching rule is defined (stag is "Message Send Tag" - the tag value sent in message, and rtag is tag of receive request): https://www.openfabrics.org/images/eventpresos/2016presentations/304PSM2Features.pdf#page=7

    Tag matching improvement

    • • Increased tag size to 96 bits
    • • Fundamentally ((stag ^ rtag) & rtagsel) == 0
    • • Supports wildcards such as MPI_ANY_SOURCE or MPI_ANY_TAG using zero bits in rtagsel
    • • Allows for practically unlimited scalability
    • • Up to 64M processes per job

    PSM2 TAG MATCHING

    #define PSM_MQ_TAG_ELEMENTS 3 
    typedef 
    struct
     psm2_mq_tag { 
        union { 
            uint32_t tag[PSM_MQ_TAG_ELEMENTS] __attribute__((aligned(16))); 
            struct { 
                uint32_t tag0; 
                uint32_t tag1; 
                uint32_t tag2; 
            }; 
        }; 
    } psm2_mq_tag_t;
    
    • • Application fills ‘tag’ array or ‘tag0/tag1/tag2’ and passes to PSM
    • • Both tag and tag mask use the same 96 bit tag type

    And actually there is source peer address near matching variables in psm2_mq_req struct: https://github.com/01org/opa-psm2/blob/master/psm_mq_internal.h#L180

       /* Tag matching vars */
       psm2_epaddr_t peer;
       psm2_mq_tag_t tag;
       psm2_mq_tag_t tagsel;    /* used for receives */
    

    And software list scanning for match, mq_list_scan() called from mq_req_match() https://github.com/01org/opa-psm2/blob/85c07c656198204c4056e1984779fde98b00ba39/psm_mq_recv.c#L188:

    psm2_mq_req_t
    mq_list_scan(struct mqq *q, psm2_epaddr_t src, psm2_mq_tag_t *tag, int which, uint64_t *time_threshold)
    {
        psm2_mq_req_t *curp, cur;
    
        for (curp = &q->first;
             ((cur = *curp) != NULL) && (cur->timestamp < *time_threshold);
             curp = &cur->next[which]) {
            if ((cur->peer == PSM2_MQ_ANY_ADDR || src == cur->peer) &&
                !((tag->tag[0] ^ cur->tag.tag[0]) & cur->tagsel.tag[0]) &&
                !((tag->tag[1] ^ cur->tag.tag[1]) & cur->tagsel.tag[1]) &&
                !((tag->tag[2] ^ cur->tag.tag[2]) & cur->tagsel.tag[2])) {
                *time_threshold = cur->timestamp;
                return cur;
            }
        }
        return NULL;
    }