kernel/include/rdma/rdmavt_qp.h, branch linux-5.2.y

IB/hfi1: Unreserve a flushed OPFN request

2019-09-16T06:23:21Z

When an OPFN request is flushed, the request is completed without unreserving itself from the send queue. Subsequently, when a new request is post sent, the following warning will be triggered: WARNING: CPU: 4 PID: 8130 at rdmavt/qp.c:1761 rvt_post_send+0x72a/0x880 [rdmavt] Call Trace: [] dump_stack+0x19/0x1b [] __warn+0xd8/0x100 [] warn_slowpath_null+0x1d/0x20 [] rvt_post_send+0x72a/0x880 [rdmavt] [] ? account_entity_dequeue+0xae/0xd0 [] ? __kmalloc+0x55/0x230 [] ib_uverbs_post_send+0x37c/0x5d0 [ib_uverbs] [] ? rdma_lookup_put_uobject+0x26/0x60 [ib_uverbs] [] ib_uverbs_write+0x286/0x460 [ib_uverbs] [] ? security_file_permission+0x27/0xa0 [] vfs_write+0xc0/0x1f0 [] SyS_write+0x7f/0xf0 [] system_call_fastpath+0x22/0x27 This patch fixes the problem by moving rvt_qp_wqe_unreserve() into rvt_qp_complete_swqe() to simplify the code and make it less error-prone. Fixes: ca95f802ef51 ("IB/hfi1: Unreserve a reserved request when it is completed") Link: https://lore.kernel.org/r/20190715164528.74174.31364.stgit@awfm-01.aw.intel.com Cc: Reviewed-by: Mike Marciniszyn Reviewed-by: Dennis Dalessandro Signed-off-by: Kaike Wan Signed-off-by: Mike Marciniszyn Signed-off-by: Jason Gunthorpe

IB/{rdmavt, qib, hfi1}: Convert to new completion API

2019-09-16T06:23:21Z

Convert all completions to use the new completion routine that fixes a race between post send and completion where fields from a SWQE can be read after SWQE has been freed. This patch also addresses issues reported in https://marc.info/?l=linux-kernel&m=155656897409107&w=2. The reserved operation path has no need for any barrier. The barrier for the other path is addressed by the smp_load_acquire() barrier. Cc: Andrea Parri Reviewed-by: Michael J. Ruhl Signed-off-by: Mike Marciniszyn Signed-off-by: Dennis Dalessandro Signed-off-by: Doug Ledford

IB/rdmavt: Add new completion inline

2019-09-16T06:23:21Z

There is opencoded send completion logic all over all the drivers. We need to convert to this routine to enforce ordering issues for completions. This routine fixes an ordering issue where the read of the SWQE fields necessary for creating the completion can race with a post send if the post send catches a send queue at the edge of being full. Is is possible in that situation to read SWQE fields that are being written. This new routine insures that SWQE fields are read prior to advancing the index that post send uses to determine queue fullness. Reviewed-by: Michael J. Ruhl Signed-off-by: Mike Marciniszyn Signed-off-by: Dennis Dalessandro Signed-off-by: Doug Ledford

IB/{rdmavt, qib, hfi1}: Use new routine to release reference counts

2019-04-24T14:31:49Z

The reference count adjustments on reference count completion are open coded throughout. Add a routine to do all reference count adjustments and use. Reviewed-by: Michael J. Ruhl Signed-off-by: Mike Marciniszyn Signed-off-by: Dennis Dalessandro Signed-off-by: Jason Gunthorpe

IB/rdmavt: Fix ab/ba include issues

2019-04-24T14:31:49Z

The currently include file ordering for rdmavt headers has an ab/ba include issue the precludes using inlines from rdma_vt.h in rdmavt_qp.h. At the heart of the issue is that rdma_vt.h includes rdmavt_qp.h. Fix the ordering issue by adjusting rdma_vt.h to not require rdmavt_qp.h and move qp related inlines to rdmavt_qp.h. Additionally, promote rvt_mmap_info to rdma_vt.h since it is shared by rdmavt_cq.h and rdmavt_qp.h. Reviewed-by: Michael J. Ruhl Signed-off-by: Mike Marciniszyn Signed-off-by: Dennis Dalessandro Signed-off-by: Jason Gunthorpe

IB/{rdmavt, hfi1): Miscellaneous comment fixes

2019-04-24T14:31:48Z

This patch fixes miscellaneous comment errors. Reviewed-by: Mike Marciniszyn Signed-off-by: Kaike Wan Signed-off-by: Dennis Dalessandro Signed-off-by: Jason Gunthorpe

IB/hfi1: Add TID RDMA WRITE functionality into RDMA verbs

2019-02-05T23:07:44Z

This patch integrates TID RDMA WRITE protocol into normal RDMA verbs framework. The TID RDMA WRITE protocol is an end-to-end protocol between the hfi1 drivers on two OPA nodes that converts a qualified RDMA WRITE request into a TID RDMA WRITE request to avoid data copying on the responder side. Reviewed-by: Mike Marciniszyn Signed-off-by: Mitko Haralanov Signed-off-by: Kaike Wan Signed-off-by: Dennis Dalessandro Signed-off-by: Doug Ledford

IB/hfi1: Add an s_acked_ack_queue pointer

2019-02-05T23:07:43Z

The s_ack_queue is managed by two pointers into the ring: r_head_ack_queue and s_tail_ack_queue. r_head_ack_queue is the index of where the next received request is going to be placed and s_tail_ack_queue is the entry of the request currently being processed. This works perfectly fine for normal Verbs as the requests are processed one at a time and the s_tail_ack_queue is not moved until the request that it points to is fully completed. In this fashion, s_tail_ack_queue constantly chases r_head_ack_queue and the two pointers can easily be used to determine "queue full" and "queue empty" conditions. The detection of these two conditions are imported in determining when an old entry can safely be overwritten with a new received request and the resources associated with the old request be safely released. When pipelined TID RDMA WRITE is introduced into this mix, things look very different. r_head_ack_queue is still the point at which a newly received request will be inserted, s_tail_ack_queue is still the currently processed request. However, with pipelined TID RDMA WRITE requests, s_tail_ack_queue moves to the next request once all TID RDMA WRITE responses for that request have been sent. The rest of the protocol for a particular request is managed by other pointers specific to TID RDMA - r_tid_tail and r_tid_ack - which point to the entries for which the next TID RDMA DATA packets are going to arrive and the request for which the next TID RDMA ACK packets are to be generated, respectively. What this means is that entries in the ring, which are "behind" s_tail_ack_queue (entries which s_tail_ack_queue has gone past) are no longer considered complete. This is where the problem is - a newly received request could potentially overwrite a still active TID RDMA WRITE request. The reason why the TID RDMA pointers trail s_tail_ack_queue is that the normal Verbs send engine uses s_tail_ack_queue as the pointer for the next response. Since TID RDMA WRITE responses are processed by the normal Verbs send engine, s_tail_ack_queue had to be moved to the next entry once all TID RDMA WRITE response packets were sent to get the desired pipelining between requests. Doing otherwise would mean that the normal Verbs send engine would not be able to send the TID RDMA WRITE responses for the next TID RDMA request until the current one is fully completed. This patch introduces the s_acked_ack_queue index to point to the next request to complete on the responder side. For requests other than TID RDMA WRITE, s_acked_ack_queue should always be kept in sync with s_tail_ack_queue. For TID RDMA WRITE request, it may fall behind s_tail_ack_queue. Reviewed-by: Mike Marciniszyn Signed-off-by: Mitko Haralanov Signed-off-by: Kaike Wan Signed-off-by: Dennis Dalessandro Signed-off-by: Doug Ledford

IB/hfi1: Increment the retry timeout value for TID RDMA READ request

2019-02-05T22:53:55Z

The RC retry timeout value is based on the estimated time for the response packet to come back. However, for TID RDMA READ request, due to the use of header suppression, the driver is normally not notified for each incoming response packet until the last TID RDMA READ response packet. Consequently, the retry timeout value should be extended to cover the transaction time for the entire length of a segment (default 256K) instead of that for a single packet. This patch addresses the issue by introducing new retry timer functions to account for multiple packets and wrapper functions for backward compatibility. Reviewed-by: Mike Marciniszyn Signed-off-by: Kaike Wan Signed-off-by: Dennis Dalessandro Signed-off-by: Doug Ledford

IB/hfi1: TID RDMA RcvArray programming and TID allocation

2019-02-05T22:53:55Z

TID entries are used by hfi1 hardware to receive data payload from incoming packets directly into a user buffer and thus avoid data copying by software. This patch implements the functions for TID allocation, freeing, and programming TID RcvArray entries in hardware for kernel clients. TID entries are managed via lists of TID groups similar to PSM. Furthermore, to track TID resource allocation for each request, software flows are also allocated and freed as needed. Since software flows consume large amount of memory for tracking TID allocation and freeing, it is generally desirable to allocate them dynamically in the send queue and only for TID RDMA requests, but pre-allocate them for receive queue because the send queue could have thousands of entries while the receive queue has only a limited number of entries. Signed-off-by: Mitko Haralanov Signed-off-by: Ashutosh Dixit Signed-off-by: Mike Marciniszyn Signed-off-by: Kaike Wan Signed-off-by: Dennis Dalessandro Signed-off-by: Doug Ledford