Mercury enables remote procedure calls (RPC) for high performance computing

Mercury enables remote procedure calls (RPC) for high performance computing

summary

Remote Procedure Call (RPC) is a widely used technology for distributed services. This technology is now increasingly used in the context of High Performance Computing (HPC), allowing the execution of routines to be delegated to remote nodes that can be set aside and dedicated to specific tasks. However, existing RPC frameworks employ a socket-based network interface (usually on top of TCP/IP), which is not suitable for HPC systems because this API does not usually map well to the native network transport used on these systems, resulting in low network performance. In addition, existing RPC frameworks usually do not support the handling of large data parameters, such as those found in read or write calls. In this paper, we propose an asynchronous RPC interface, specifically designed for use in HPC systems, that allows asynchronous transmission of parameters and execution requests and direct support for large data parameters. The interface is generic and allows the transmission of any function call. In addition, the network implementation is abstracted, allowing easy porting to future systems and efficient use of existing native transport mechanisms.

1. Introduction

When working in a heterogeneous environment, it is often very useful for an engineer or scientist to be able to distribute the various steps of an application workflow; especially in high-performance computing, it is common to see systems or nodes that embed different types of resources and libraries that can be specialized for specific tasks, such as computation, storage, or analysis and visualization. Remote Procedure Call (RPC) [1] is a technique that follows the client/server model and allows local calls to be transparently performed on remote resources. It consists of serializing local function arguments into a memory buffer and sending that buffer to the remote target, which in turn deserializes the arguments and performs the corresponding function call. Libraries implementing this technique can be found in a variety of areas, such as web services using Google Protocol Buffers [2] or Facebook Thrift [3], or grid computing using GridRPC [4]. RPC can also be implemented using more object-oriented approaches and frameworks, such as CORBA [5] or Java RMI [6], where abstract objects and methods can be distributed across a range of nodes or machines.

However, there are two major limitations to using these standard and common RPC frameworks on HPC systems: 1. Data cannot be efficiently transferred using local transport mechanisms, as these frameworks are primarily designed on top of the TCP/IP protocol; 2. Very large amounts of data cannot be transferred, as the limits imposed by the RPC interface are typically in the order of megabytes (MB). Furthermore, even if there are no enforced limits, transferring large amounts of data through RPC libraries is generally discouraged, primarily due to the overhead introduced by serialization and encoding, which causes the data to be copied multiple times before reaching the remote node.

This paper is organized as follows: We first discuss related work in Section II, and then discuss the network abstraction layer on which the interface is built, as well as the architecture defined for efficient transmission of small and large data in Section III. Section IV gives an overview of the API and shows its advantages in supporting the use of pipelining techniques. We then describe the development of a network transport plugin for our interface and the performance evaluation results. Section V presents conclusions and future work directions.

2. Related Work

The Network File System (NFS) [7] is a good example of using RPC to handle bulk data transfers and is therefore very close to using RPC on HPC systems. It exploits XDR [8] to serialize arbitrary data structures and create a system-independent description, then sends the resulting byte stream to a remote resource, which can deserialize and retrieve the data. It can also transfer data over the RDMA protocol using a separate transport mechanism (on recent versions of NFS), in which case the data is processed outside of the XDR stream. The interface we present in this paper follows similar principles, but additionally handles bulk data directly. It is also not restricted to using XDR for data encoding, which may affect performance, especially when the sender and receiver share a common system architecture. By providing a network abstraction layer, the RPC interface we define enables users to efficiently send both small and large data using small messages or remote memory access (RMA) type transports that fully support the single-sided semantics present on recent HPC systems. In addition, all presented interfaces are non-blocking, thus allowing asynchronous operation modes that prevent the caller from waiting for one operation to execute before issuing another.

The I/O Forwarding Extensibility Layer (IOFSL) [9] is another project on which part of the work presented in this paper is based. IOFSL specifically forwards I/O calls using RPC. It defines an API called ZOIDFS that serializes function arguments locally and sends them to a remote server, where they can in turn be mapped to file system specific I/O operations. One of the main motivations for extending the existing work in IOFSL is to be able to send not only a specific set of calls (such as those defined via the ZOIDFS API), but a wide variety of calls that can be defined dynamically and generically. It is also worth noting that IOFSL is built on top of the BMI [10] network transport layer used in the Parallel Virtual File System (PVFS) [11]. It allows support for dynamic connections and fault tolerance, and also defines two types of messaging, unexpected and expected (described in Section III-B), that enable asynchronous modes of operation. However, BMI is limited in its design in that it does not directly expose the RMA semantics required to explicitly implement RDMA operations from client memory to server memory, which can be a problem and performance limitation (the main advantages of using an RMA approach are described in Section III-B). In addition, while BMI does not provide one-sided operations, it does provide a set of relatively high-level network operations. This makes porting BMI to new network transports (such as the CrayGemini interconnect [12]) a nontrivial task and more time-consuming than it should be, since implementing RPC in our context only requires a subset of the functionality provided by BMI.

Another project, the Sandia National Laboratories NEtworkScalable Service Interface (Nessie) [13] system, provides a simple RPC mechanism that was originally developed for the Lightweight File Systems [14] project. It provides an asynchronous RPC solution designed primarily for overlapping computation and I/O. Nessie's RPC interface directly relies on the Sun XDR solution, which is designed primarily for communication between heterogeneous architectures, even though in reality all high-performance computing systems are homogeneous. Nessie provides a separate mechanism for handling bulk data transfers that can efficiently transfer data from one memory to another using RDMA and supports a variety of network transports. Nessie clients use the RPC interface to push control messages to the server. In addition, Nessie exposes a different, unilateral API (similar to Portals [15]) that users can use to push or pull data between clients and servers. Mercury is different because its interface also natively supports RDMA and can transparently handle providing bulk data to users by automatically generating abstract memory handles that represent remote large data parameters. These handles are easier to operate and do not require any additional work from the user. Mercury also provides fine-grained control over data transfer if needed (for example, to implement pipelining). In addition, Mercury provides a higher-level interface than Nessie, greatly reducing the amount of user code required to implement RPC functionality.

Another similar approach can be seen in the Decoupled and Asynchronous Remote Transfers (DART) [16] project. While DART is not defined as an explicit RPC framework, it allows the transfer of large amounts of data from applications running on compute nodes of an HPC system to local storage or remote locations using a client/server model to enable remote application monitoring, data analysis, code coupling, and data archiving. The key requirements that DART attempts to meet include minimizing the data transfer overhead of applications, achieving high throughput, low latency data transfer, and preventing data loss. To achieve these goals, DART is designed so that dedicated nodes (i.e., separate from the application compute nodes) use RDMA to asynchronously fetch data from the compute nodes' memory. In this way, expensive data I/O and streaming operations from the application compute nodes to the dedicated nodes are offloaded, allowing applications to proceed while data is being transferred. While using DART is opaque and thus requires the user to send explicit requests, there are no inherent limitations to integrating such a framework in our network abstraction layer, so we wrap it in the RPC layer we define, allowing users to use DART to transfer data on their supported platforms.

3. Architecture

As described in the previous section, Mercury’s interface relies on three main components: the network abstraction layer NA, the RPC interface that can handle calls in a generic way, and the bulk data interface (Bulk), which complements the RPC layer and is designed to easily transfer large amounts of data through abstract memory segments. We present the overall architecture and each of its components in this section.

AOverview

The RPC interface follows a client/server architecture. As described in Figure 1, issuing a remote call results in different steps, depending on the size of the data associated with the call. We distinguish two types of transfers: transfers containing typical function parameters, which are usually small, called metadata , and transfers describing function parameters with large amounts of data, called bulk.

Each RPC call sent over the interface results in the serialization of the function arguments into a memory buffer (whose size is typically limited to 1 KB, depending on the interconnect), which is then sent to the server using the Network Abstraction Layer interface. One of the key requirements is to limit memory copies at any stage of the transfer, especially when transferring large amounts of data. So, if the data being sent is small, it is serialized and sent using a small message, otherwise a description of the memory area to be transferred is sent to the server in the same small message, and the server can then start pulling the data (if the data is input to the remote call) or pushing the data (if the data is output from the remote call). Limiting the size of the initial RPC request to the server also helps with scalability, as it avoids unnecessary consumption of server resources in cases where a large number of clients are accessing the same server simultaneously. Depending on the degree of control required, all of these steps can be handled transparently by Mercury or exposed directly to the user.

Figure 1, Architecture overview: Each party uses an RPC handler to serialize and deserialize the parameters sent through the interface. Calling functions with small parameters results in using the short message mechanism exposed by the network abstraction layer, while functions containing large data parameters additionally use the RMA mechanism.

B. Network Abstraction Layer

The main purpose of the network abstraction layer is, as the name implies, to abstract the network protocols exposed to the user, allowing integration of multiple transports through a plugin system. A direct consequence imposed by this architecture is to provide a lightweight interface for which only reasonable effort is required to implement a new plugin. The interface itself must define three main types of data transfer mechanisms: unexpected messaging , expected messaging , and remote memory access , as well as the additional setup required to dynamically establish a connection between client and server (although dynamic connections may not always be possible, depending on the underlying network implementation used). Unexpected and expected messaging are limited to the transmission of short messages and use a bidirectional approach. For performance reasons, the maximum message size is determined by the interconnect and can be as small as a few kilobytes. The concept of unexpected messaging is used in other communication protocols, such as BMI [10]. Sending an unexpected message through the network abstraction layer does not require the publication of a matching receive before completion. By using this mechanism, the client is not blocked and the server can get a new message published each time an unexpected receive is issued. Another difference between expected and unexpected messages is that unexpected messages can arrive from any remote source, while expected messages require knowledge of the remote source. The Remote Memory Access (RMA) interface allows access to remote blocks of memory (both contiguous and non-contiguous). In most unidirectional interfaces and RDMA protocols, memory must be registered with the Network Interface Controller (NIC) before it can be used. The purpose of defining the interface in the network abstraction layer is to create a first level of abstraction and define an API that is compatible with most RMA protocols. Registering a memory segment with the NIC typically results in the creation of a handle to the segment, which contains virtual address information, among other things. The created local handle needs to be communicated to the remote node before a put or get operation can be initiated. The network abstraction is responsible for ensuring that these memory handles can be serialized and transferred over the network. After the handles are exchanged, a non-blocking put or get can be initiated. On most interconnects, puts and gets will map to put and get operations provided by a specific API provided by the interconnect. The network abstraction interface is designed to allow emulation of unidirectional transmissions on top of bidirectional send and receive network protocols (such as TCP/IP, which only supports a bidirectional messaging method). With this network abstraction layer, Mercury can be easily ported to support new interconnects. The relatively limited functionality provided by the network abstraction (e.g., no unbounded sized bidirectional messages) ensures near native performance

C. RPC interface and metadata

Sending a call involving only a small amount of data uses the unexpected/expected messaging defined in III-B. However, at a higher level, sending a function call to a server specifically means that the client must know how to encode the input parameters before it starts sending information, and how to decode the output parameters after receiving the server's response. On the server side, the server must also know what to do when it receives an RPC request, and how to decode and encode the input and output parameters. The framework for describing function calls and encoding/decoding parameters is key to the operation of our interface.

Figure 2: Asynchronous execution flow of an RPC call. The receive buffer is pre-posted, allowing the client to complete other work while the remote call is executing and the response is sent back.

One of the key points is the ability to support a set of function calls that can be sent to the server in a generic way, thus avoiding the limitation of a set of hard-coded routines. The generic framework is shown in Figure 2. During the initialization phase, the client and server register encoding and decoding functions by using unique function names mapped to unique IDs for each operation, shared by both the client and the server. The server also registers callbacks that need to be executed when an operation ID is received through a function call. To send a function call that does not involve bulk data transfer, the client encodes the input parameters together with the operation ID into a buffer and sends it to the server using a non-blocking unanticipated messaging protocol. To ensure full asynchrony, the memory buffer used to receive the response from the server is also pre-posted by the client. For reasons of efficiency and resource consumption, the size of these messages is limited (usually a few kilobytes). However, if the metadata exceeds the size of the unanticipated message, the client will need to transmit the metadata in a separate message, transparently using the bulk data interface described in III-D for exposing additional metadata to the server.

When the server receives a new request ID, it looks up the corresponding callback, decodes the input parameters, executes the function call, encodes the output parameters and starts sending the response back to the client. Sending the response back to the client is also non-blocking, so while receiving new function calls, the server can also test the response request list to check if they are complete and release the corresponding resources when the operation is complete. Once the client knows that the response has been received (using the wait/test call) and the function call has been completed remotely, it can decode the output parameters and free resources for transmission. With this mechanism, it becomes easy to scale to handle large amounts of data

D. Bulk Data Interface

In addition to the previous interface, some function calls may require the transfer of larger amounts of data. For these function calls, the bulk data interface is used, which is built on top of the remote memory access protocol defined in the network abstraction layer. Only the RPC server initiates a one-way transfer so that it can protect its memory from concurrent access while controlling the data flow.

As described in Figure 3, the bulk data transfer interface uses a one-way communication method. The RPC client exposes a memory region to the RPC server by creating a bulk data descriptor, which contains the virtual memory address information, the size of the memory region being exposed, and other parameters that may depend on the underlying network implementation. The bulk data descriptor can then be serialized and sent to the RPC server (using the RPC interface defined in Section III-C) along with the RPC request parameters. When the server decodes the input parameters, it deserializes the bulk data descriptor and obtains the size of the memory buffer that must be transferred.

In the case of RPC requests that consume large data parameters, the RPC server may allocate a buffer of the size of the data that needs to be received, expose its local memory area by creating a bulk data block descriptor and initiate asynchronous read/get operations on that memory area. The RPC server then waits/tests for the completion of the operation and executes the call once the data has been fully received (or partially if the execution call supports it). The response (i.e. the result of the call) is then sent back to the RPC client and the memory handle is released.

Transferring data through this process is transparent to the user, especially since the RPC interface can also take care of serializing/deserializing the memory handle along with other parameters. This is particularly important when non-contiguous memory segments have to be transferred. In both cases, the memory segments are automatically registered with the RPC client and abstracted by the created memory handle. The memory handle is then serialized along with the parameters of the RPC function and large data is transferred using non-contiguous memory areas, thus resulting in the same process as described above. Note that in this case the handle may be of variable size as it may contain more information and also depends on the underlying network implementation which can support direct registration of memory segments

IV. Evaluation

The architecture defined previously enables generic RPC calls to be passed along with handles that describe both contiguous and non-contiguous memory regions when bulk data transfers are needed. In this section we describe how this architecture can be leveraged to build pipeline mechanisms that can easily request chunks of data on demand. Pipelining Bulk Data Transfers Pipelining is a typical use case when one wants to overlap communication and execution. In the architecture we described, a request to process a large amount of data results in an RPC request along with a bulk data transfer from the RPC client to the RPC server.

A. Pipeline batch data transfer

In a common use case, the server might wait to receive all the data before executing the requested call. However, with pipelining, it is possible to actually start processing the data as it is being transmitted, avoiding the latency cost of the entire RMA transmission. Note that while we focus on this in the examples below, using this technique can also be particularly useful if the RPC server does not have enough memory to handle all the data that needs to be sent, in which case it will also need to transmit the data as it is processed.

A simplified version of the RPC client code is shown below

 #define BULK_NX 16 #define BULK_NY 128 int main(int argc, char *argv[]) { hg_id_t rpc_id; write_in_t in_struct; write_out_t out_struct; hg_request_t rpc_request; int buf[BULK_NX][BULK_NY]; hg_bulk_segment_t segments[BULK_NX]; hg_bulk_t bulk_handle = HG_BULK_NULL; /* Initialize the interface */ [...] /* Register RPC call */ rpc_id = HG_REGISTER("write", write_in_t, write_out_t); /* Provide data layout information */ for (i = 0; i < BULK_NX ; i++) { segments[i].address = buf[i]; segments[i].size = BULK_NY * sizeof(int); } /* Create bulk handle with segment info */ HG_Bulk_handle_create_segments(segments, BULK_NX, HG_BULK_READ_ONLY, &bulk_handle); /* Attach bulk handle to input parameters */ [...] in_struct.bulk_handle = bulk_handle; /* Send RPC request */ HG_Forward(server_addr, rpc_id, &in_struct, &out_struct, &rpc_request); /* Wait for RPC completion and response */ HG_Wait(rpc_request, HG_MAX_IDLE_TIME, HG_STATUS_IGNORE); /* Get output parameters */ [...] ret = out_struct.ret; /* Free bulk handle */ HG_Bulk_handle_free(bulk_handle); /* Finalize the interface */ [...] }

When the client initializes, it registers the RPC call it wants to send. Because this call involves a non-contiguous bulk data converter, the memory segment describes the memory area that is created and registered. The resulting Bulk_handle is then passed to the HG_Forward call along with the other CallParameters. The bulk handle can then be waited for a response and freed after the request is completed (a notification may also be sent in the future to allow an earlier bulk handle to be freed, so the memory can be freed). The mechanics of pipe handling happen on the server, to take care of the bulk transfer. The pipe itself has herea fixed pipe size and pipe buffer size. The RPC server code simplifies

 #define PIPELINE_BUFFER_SIZE 256 #define PIPELINE_SIZE 4 int rpc_write(hg_handle_t handle) { write_in_t in_struct; write_out_t out_struct; hg_bulk_t bulk_handle; hg_bulk_block_t bulk_block_handle; hg_bulk_request_t bulk_request[PIPELINE_SIZE]; void *buf; size_t nbytes, nbytes_read = 0; size_t start_offset = 0; /* Get input parameters and bulk handle */ HG_Handler_get_input(handle, &in_struct); [...] bulk_handle = in_struct.bulk_handle; /* Get size of data and allocate buffer */ nbytes = HG_Bulk_handle_get_size(bulk_handle); buf = malloc(nbytes); /* Create block handle to read data */ HG_Bulk_block_handle_create(buf, nbytes, HG_BULK_READWRITE, &bulk_block_handle); /* Initialize pipeline and start reads */ for (p = 0; p < PIPELINE_SIZE; p++) { size_t offset = p * PIPELINE_BUFFER_SIZE; /* Start read of data chunk */ HG_Bulk_read(client_addr, bulk_handle, offset, bulk_block_handle, offset, PIPELINE_BUFFER_SIZE, &bulk_request[p]); } while (nbytes_read != nbytes) { for (p = 0; p < PIPELINE_SIZE; p++) { size_t offset = start_offset + p * PIPELINE_BUFFER_SIZE; /* Wait for data chunk */ HG_Bulk_wait(bulk_request[p], HG_MAX_IDLE_TIME, HG_STATUS_IGNORE); nbytes_read += PIPELINE_BUFFER_SIZE; /* Do work (write data chunk) */ write(buf + offset, PIPELINE_BUFFER_SIZE); /* Start another read */ offset += PIPELINE_BUFFER_SIZE * 51 PIPELINE_SIZE; if (offset < nbytes) { HG_Bulk_read(client_addr, bulk_handle, offset, bulk_block_handle, offset, PIPELINE_BUFFER_SIZE, &bulk_request[p]); } else { /* Start read with remaining piece */ } } start_offset += PIPELINE_BUFFER_SIZE * PIPELINE_SIZE; } /* Free block handle */ HG_Bulk_block_handle_free(bulk_block_handle); free(buf); /* Start sending response back */ [...] out_struct.ret = ret; HG_Handler_start_output(handle, &out_struct); } int main(int argc, char *argv[]) { /* Initialize the interface */ [...] /* Register RPC call */ HG_HANDLER_REGISTER("write", rpc_write, write_in_t, write_out_t); while (!finalized) { /* Process RPC requests (non-blocking) */ HG_Handler_process(0, HG_STATUS_IGNORE); } /* Finalize the interface */ [...] }

After each RPC server is initialized, it must bypass the HG_HANDLER_PROCESS call, which will wait for new RPCRequests and execute the corresponding registered callbacks (in the same thread or a new thread depending on user needs). Upon request, the Bulk_handle parameter used to obtain the total size of the data to be transferred can allocate a buffer of appropriate size and start the bulk DataTransfers. In this example, the pipeline size is set to 4 and the pipeline buffer size is set to 256, which means that 4 RMAreQuests of 256 bytes are started. Then, it can wait for the first 256 bytes to arrive and process it. While it is processing, other parts may arrive. As soon as one piece is processed, a new RMA transfer of iSAT phase 4 is started and it can wait for the next piece, which can then be processed. Note that although the memory area registered on the client is non-contiguous, the hg_bulk_read call to the server shows it as a contiguous area, simplifying the server code. In addition, logical offsets (relative to the beginning of the data) can be given to move pieces of data individually, and the bulk data interface will map from contiguous logical offsets to non-contiguous client memory areas. We continue this process until all process data has been read/processed and the response (i.e. the result of the function call) can be sent back. Again, we only start sending the response by calling the HG_HANDLER_START_OUTPUT call, and only test its completion by calling HG_HANDLER_PROCESS, in which case resources associated with the response will be affected. Note that all functions support asynchronous execution and can be used in event-driven code if desired (HG)

B. Network plugins and test environment

As of the date of this writing, two plugins have been developed to illustrate the functionality of the network abstraction layer. At this time, the performance of the plugins has not been optimized. One is built on top of BMI [10]. However, as we have already pointed out in Section 2, BMI does not effectively exploit the defined network action layer and the one-sided bulk data transfer structure. The other is built on top of MPI [17], which only provides full RMA semantics [18] or the more recent MPI3 [19]. Many MPI implementations, especially those of already installed machines, do not yet provide all MPI3 features. Since BMI has not yet been ported to an in-house HPC system to illustrate the functionality and measure performance results, we only consider the MPI plugin in this paper. The plugin is able to run on existing HPC Systems limited to MPI-2 functionality (e.g. Cray Systems) and implements bulk data transfers on top of two-sided messaging. In practice, this means that for each bulk data transfer, a control message needs to be sent to the client to request to send or receive data. The progress of the transfer can then be implemented using the progress thread input progress function. For the tests, we used two different HPC systems. onis is an Infiniband QDR 4X cluster with mvapich [20] 1.8.1, the other is a Cray Xe6 [21] 5.6.0 with Cray MPT

C. Performance Evaluation


As a performance evaluation of the first experiment, we measured the number of RPC calls (without any bulk data transfers) spent for an empty function (i.e., a function that is about to return). On a Cray XE6 machine, the heterogeneous time of 20 RPC calls was measured, each taking 23 µs. However, as pointed out earlier, most HPC systems are homogeneous and therefore do not need the data portability provided by XDR. When disabling Xdrencoding (performing a simple memory copy instead), THETIME drops to 20 µs. This non-negligible improvement (15%) demonstrates the benefits of designing an RPC framework for HPC environments. The second experiment consisted of testing the pipelining technique for bulk data transfers explained previously between a client and one server. As shown in Table I, on the cray xe6 access transfers, when the requesting blade has already completed, while other pipeline stages are in progress, it is particularly efficient, allowing us to achieve high bandwidth. However, the high injected bandwidth on this system makes it difficult to achieve good performance for small packets (e.g., bulk data control messages due to the simulation of single-sided functions), especially when the data flow is not continuous.

Finally, we evaluate the scalability of the RPC Serverby by evaluating the total data throughput as the number of client hours increases. Figures 4 and 5 show the results for the AQDR Infiniband system (using MVAPICH) and the Cray XE6SYSTEM, respectively. In both cases, due in part to the server-side bulk data flow control mechanism, HG shows superior performance, and throughput increases or remains stable as the number of concurrent clients increases. For comparison, the point message bandwidth on each system is shown. On the Infiniband system, Mercury reaches 70% of the maximum network bandwidth. Considering that the HG times represent the RPC calls for data transfer, this is a good result, compared to the time to send a single message of the OSU benchmark. On the Cray system, performance is poor (about 40% of the peak). We expect this to be mainly due to the smaller message performance of this system, as well as the additional control measures caused by the one-sided simulation. However, the low performance may also be caused by system limitations, considering that the performance of a similar operation (read) on Nessie [22] shows the same low bandwidth even though it bypasses MPI and uses the native uGNI API of a real RDMA interconnect.

V. Conclusion and Future Work

In this paper, we introduced the Mercury framework. Mercury is specifically designed to provide RPC services in high-performance computing environments. Mercury builds a small, easily portable network abstraction layer that matches the capabilities of contemporary HPC network environments. Unlike most other RPC frameworks, Mercury provides direct support for handling large data arguments for remote calls. Mercury's network protocol is designed to scale to thousands of clients. We demonstrate the power of the framework by implementing remote write functionality, including pipelining of large data arguments. We then evaluate our implementation on two different HPC systems, demonstrating both single-client performance and multi-client scalability.

With the availability of high performance, portable, general purpose RPC functionality provided by Mercury, IOFSL can be used to replace and modernize Mercury calls by replacing internal, hard-coded IOFSL code. Since the network abstraction layer is already built on top of HG, existing deployments of IOFSL continue to use BMI Fornetwork connectivity while leveraging the improved scalability and performance of Mercury's NetworkProtocol. RPC calls. Cancellation is important for resilient non-environmental situations where nodes or the network may fail. Future work will include support for cancellation. While Mercury already supports all the functionality required to efficiently execute RPC calls, the amount of user code required for each call can be further reduced. Future versions of Mercury will provide a set of preprocessor macros to reduce the user's workload by automatically generating as much boilerplate code as possible. The network abstraction layer currently has plugins for BMI, MPI-2, and MPI-3. However, as part of the MPI RMA functionality used in a client/server context [23], we intend to add support for Infiniband networks, as well as Cray XT and IBM BG/P and Q networks.

Acknowledgements,The work presented in this paper was supported by the Exascale FastForward,project, LLNS subcontract No. B599860, by the U.S. Department of Energy, Office of Science,,Office of Advanced Scientific Computer Research under Contract DE-AC02-06CH11357.

refer to

Paper link: https://www.mcs.anl.gov/papers/P4082-0613_1.pdf.

[1] AD Birrell and BJ Nelson, “Implementing Remote Procedure Calls,” ACM Trans. Comput. Syst., vol. 2, no. 1, pp. 39–59, Feb. 1984.

[2] Google Inc, “Protocol Buffers,” 2012. [Online]. Available: https://developers.google.com/protocol-buffers.

[3] M. Slee, A. Agarwal, and M. Kwiatkowski, “Thrift: Scalable CrossLanguage Services Implementation,” 2007.

[4] K. Seymour, H. Nakada, S. Matsuoka, J. Dongarra, C. Lee, and H. Casanova, “Overview of GridRPC: A Remote Procedure Call API for Grid Computing,” in Grid Computing—GRID 2002, ser. Lecture Notes in Computer Science, M. Parashar, Ed. Springer Berlin Heidelberg, 2002, vol. 2536, pp. 274–278.

[5] Object Management Group, “Common Object Request Broker Architecture (CORBA),” 2012. [Online]. Available: http://www.omg.org/spec/CORBA.

[6] A. Wollrath, R. Riggs, and J. Waldo, “A Distributed Object Model for the JavaTMSystem,” in Proceedings of the 2nd conference on USENIX Conference on Object-Oriented Technologies (COOTS) - Volume 2, ser. COOTS'96. Berkeley, CA, USA: USENIX Association, 1996, pp.17–17.

[7] R. Sandberg, D. Golgberg, S. Kleiman, D. Walsh, and B. Lyon, “Innovations in internet working,” C. Partridge, Ed. Norwood, MA, USA: Artech House, Inc., 1988, ch. Design and Implementation of the Sun Network Filesystem, pp. 379–390.

[8] Sun Microsystems Inc, “RFC 1014—XDR: External Data Representation Standard,” 1987. [Online]. Available: http://tools.ietf.org/html/rfc1014.

[9] N. Ali, P. Carns, K. Iskra, D. Kimpe, S. Lang, R. Latham, R. Ross, L. Ward, and P. Sadayappan, “Scalable I/O forwarding framework for high-performance computing systems,” in IEEE International Conference on Cluster Computing and Workshops 2009, ser. CLUSTER '09, 2009, pp. 1–10.

[10] P. Carns, I. Ligon, W., R. Ross, and P. Wyckoff, “BMI: a networkabstraction layer for parallel I/O,” in 19th IEEE International Parallel and Distributed Processing Symposium, 2005.

[11] PH Carns, WB Ligon, III, RB Ross, and R. Thakur, “PVFS: A Parallel File System for Linux Clusters,” in In Proceedings of the 4th Annual Linux Showcase and Conference. USENIX Association, 2000, pp. 317–327.

[12] R. Alverson, D. Roweth, and L. Kaplan, “The Gemini System Interconnect,” in IEEE 18th Annual Symposium on High-Performance Interconnects, ser. HOTI, 2010, pp. 83–87.

[13] J. Lofstead, R. Oldfield, T. Kordenbrock, and C. Reiss, “Extending Scalability of Collective IO Through Nessie and Staging,” in Proceedings of the Sixth Workshop on Parallel Data Storage, ser. PDSW '11. New York, NY, USA: ACM, 2011, pp. 7–12.

[14] R. Oldfield, P. Widener, A. Maccabe, L. Ward, and T. Kordenbrock,“Efficient Data-Movement for Lightweight I/O,” in Cluster Computing,2006 IEEE International Conference on, 2006, pp. 1–9。

[15] R. Brightwell, T. Hudson, K. Pedretti, R. Riesen, and K. Underwood, “Implementation and Performance of Portals 3.3 on the Cray XT3,” in Cluster Computing, 2005. IEEE International, 2005, pp. 1–10。

[16] C. Docan, M. Parashar, and S. Klasky, “Enabling High-speed Asynchronous Data Extraction and Transfer Using DART,” Concurr. Comput. : Pract. Exper., vol. 22, no. 9, pp. 1181–1204, Jun. 2010。

[17] W. Gropp, E. Lusk, and R. Thakur, Using MPI-2: Advanced Features of the Message-Passing Interface. Cambridge, MA: MIT Press, 1999。

[18] W. Gropp and R. Thakur, “Revealing the Performance of MPI RMA Implementations,” in Recent Advances in Parallel Virtual Machine and Message Passing Interface, ser. Lecture Notes in Computer Science, F. Cappello, T. Herault, and J. Dongarra, Eds. Springer Berlin / Heidelberg, 2007, vol. 4757, pp. 272–280。

[19] “Message Passing Interface Forum,” September 2012, MPI-3: Extensions to the message-passing interface. [Online]. Available: http://www.mpi-forum.org/docs/docs.html。

[20] The Ohio State University, “MVAPICH: MPI over InfiniBand, 10GigE/iWARP and RoCE.” [Online]. Available: http://mvapich.cse. ohio-state.edu/index.shtml。

[21] H. Pritchard, I. Gorodetsky, and D. Buntinas, “A uGNI-Based MPICH2 Nemesis Network Module for the Cray XE,” in Recent Advances in the Message Passing Interface, ser. Lecture Notes in Computer Science, Y. Cotronis, A. Danalis, D. Nikolopoulos, and J. Dongarra, Eds. Springer Berlin / Heidelberg, 2011, vol. 6960, pp. 110–119。

[22] RA Oldfield, T. Kordenbrock, and J. Lofstead, “Developing integrated data services for cray systems with a gemini interconnect,” Sandia National Laboratories, Tech. Rep., 2012。

[23] JA Zounmevo, D. Kimpe, R. Ross, and A. Afsahi, “On the use of MPI in High-Performance Computing Services,” p. 6, 2013。

This article is reprinted from the WeChat public account "Cloud Native Cloud". You can follow it through the QR code below. Please contact the Cloud Native Cloud official account to reprint this article.

<<:  How to build your private LTE network

>>:  The emergence of Wi-Fi HaLow promotes IoT applications and innovation

Recommend

Three common misunderstandings about SD-WAN

Traditional WANs can no longer keep up. In the br...

A must-read for professionals! Intuitive diagrams of weak current subsystems!

The most direct and effective way to get familiar...

What is missing from licensing 5G for commercial use?

On February 20, South Korea announced the officia...

spinservers New Year promotion: $39/month-E3-1280v5/32GB/1TB NVMe/30TB@10Gbps

spinservers has released several special packages...

Six tips to help you choose the best responsive design framework

【51CTO.com Quick Translation】 For designers and f...

Deep understanding of DNS tunnel communication in practical scenarios

Preface Recently, we conducted an in-depth analys...

What should you do if you forget the wireless router backend login address?

If you want to modify the configuration of the wi...