Enhancing fault tolerance in RDMA systems through hybrid protocol models

Santhosh Katragadda *

Independent Researcher, USA.
 
Review Article
Global Journal of Engineering and Technology Advances, 2024, 20(02), 220-231.
Article DOI: 10.30574/gjeta.2024.20.2.0140
 
Publication history: 
Received on 16 June 2024; revised on 15 August 2024; accepted on 17 August 2024
 
Abstract: 
Remote Direct Memory Access (RDMA) has become a critical technology in high-performance computing (HPC), cloud environments, and distributed systems due to its ability to provide low-latency, high-throughput data transfer by bypassing the operating system. However, ensuring fault tolerance in RDMA systems remains a significant challenge, particularly in scenarios involving network failures, node crashes, or memory corruption. This paper proposes a novel approach to enhancing fault tolerance in RDMA systems using hybrid protocol models. By combining the efficiency of RDMA protocols with the reliability mechanisms of traditional transport protocols (e.g., TCP/IP), the proposed hybrid model aims to minimize the performance impact of fault recovery while maintaining system resilience. We explore the design principles behind the hybrid protocol, outline its integration into existing RDMA systems, and present evaluation results comparing its performance and fault tolerance capabilities to existing methods. Our results demonstrate that hybrid protocols can significantly improve fault tolerance in RDMA networks with minimal degradation in performance, making them a promising solution for mission-critical applications requiring both high throughput and high reliability.
 
Keywords: 
Rdma; Fault Tolerance; Hybrid Protocols; High-Performance Computing (Hpc); Distributed Systems; Network Fault Recovery; Transport Protocols; Reliability Mechanisms; System Resilience; Error Handling
 
Full text article in PDF: 
 

This paper has received best paper award for the Volume 20 - Issue 2 (August 2024).

Click here to download certificate