image
image
user-login
Patent search/

System and method for race detection and performance optimization in GPU-accelerated applications

search

Patent Search in India

  • tick

    Extensive patent search conducted by a registered patent agent

  • tick

    Patent search done by experts in under 48hrs

₹999

₹399

Talk to expert

System and method for race detection and performance optimization in GPU-accelerated applications

ORDINARY APPLICATION

Published

date

Filed on 8 November 2024

Abstract

The present invention discloses a data race detection and performance optimization system (100) for GPU-accelerated parallel processing applications. The system comprises a processing unit (102) configured to receive, preprocess, and compile source code using a Low-Level Virtual Machine (LLVM) compiler (204) with TSan instrumentation (404). A GPU unit (110), operatively connected to the processing unit (102), executes the instrumented code, monitors memory accesses using a shadow memory module (306), and detects data races dynamically with a TSan detection module (308). Detected data races are communicated to the processing unit (102) for logging and reporting via an error reporting and output module (208). The system is optimized for AMD GPUs through the Radeon Open Compute (ROC) platform, ensuring efficient execution and compatibility.

Patent Information

Application ID202441086106
Invention FieldCOMPUTER SCIENCE
Date of Application08/11/2024
Publication Number46/2024

Inventors

NameAddressCountryNationality
Rupesh NasreDepartment of Computer Science and Engineering, Indian Institute of Technology Madras, Sardar Patel Road, Chennai, Tamil Nadu, India, 600036IndiaIndia
Dhruv MarooDepartment of Computer Science and Engineering, Indian Institute of Technology Madras, Sardar Patel Road, Chennai, Tamil Nadu, India, 600036IndiaIndia

Applicants

NameAddressCountryNationality
Indian Institute of Technology Madras (IIT Madras)The Dean, Industrial Consultancy & Sponsored Research (IC&SR), Indian Institute of Technology Madras, Sardar Patel Road, IIT Post, Chennai, Tamil Nadu, India 600036IndiaIndia

Specification

Description:FIELD OF INVENTION
[001] The field of invention generally relates to computing systems and parallel processing architectures. More specifically, it relates to a system and method for race detection and performance optimization in GPU-accelerated applications.

BACKGROUND
[002] The invention pertains to the rapidly advancing field of parallel computing, where Graphics Processing Units (GPUs) are increasingly employed for high-performance applications across diverse sectors, comprising scientific computing, artificial intelligence, and gaming. GPUs are designed to handle thousands of parallel threads, making them highly suitable for intensive computational tasks. However, with the growth in GPU usage, ensuring efficient and reliable parallel processing has become crucial, as errors in concurrent data access, like data races, can significantly degrade system performance and reliability.
[003] Currently, existing systems do not succeed in providing an efficient, real-time solution for detecting data races on GPUs. Many of the traditional data race detection tools are designed for Central Processing Units (CPUs) and fail to adapt effectively to the parallel architecture of GPUs. As a result, data races often go undetected in GPU applications, leading to unpredictable behavior, incorrect results, and compromised performance, especially in high-stakes computations.
[004] Other existing systems have tried to address this problem. However, their scope was limited to static analysis or post-execution detection on CPUs, or they provide limited support for specific GPU architectures. Such methods lack the ability to monitor memory accesses dynamically in real time on GPUs, limiting their effectiveness in preventing data races during actual execution. Furthermore, the available GPU-specific tools often introduce significant performance overhead, making them impractical for continuous, real-time detection in complex applications.
[005] CN115599534A Data competition detection method, system, cluster and medium. The application discloses a data competition detection method by obtaining a detection result for representing whether data competition is generated by the instruction in the candidate instruction pair or by obtaining a memory access attribute of the undefined application programming interface (API) and then carrying out data competition detection based on an operand of the instruction in the candidate instruction pair and the memory access attribute. The disadvantage is this application comprises a complex approach in detecting data race.
[006] Thus, in light of the above discussion, it is implied that there is need for a system and method for detection and performance optimization in GPU-accelerated applications, which is reliable and does not suffer from the problems discussed above.

OBJECT OF INVENTION
[007] The principal object of this invention is to provide a system and method for race detection and performance optimization in GPU-accelerated applications.
[008] Another object of the invention is to minimize false positives in data race detection by employing dynamic analysis during runtime, which provides a more accurate understanding of thread interactions compared to traditional static analysis methods.
[009] Another object of the invention is to optimize the execution of data race detection on AMD GPUs, leveraging the Radeon Open Compute (ROCm) platform for efficient integration with AMD's compiler toolchain and runtime environment.
[0010] Another object of the invention is to maintain constant memory overhead during data race detection, ensuring that the system can handle a large number of concurrent threads without excessive memory consumption, thereby preventing memory overload in GPU-accelerated environments.
[0011] Another object of the invention is to utilize a shadow memory mechanism for tracking memory accesses, which enables precise monitoring of memory operations by storing metadata such as thread IDs, access types, and timestamps.
[0012] Another object of the invention is to enable effective communication between the CPU and GPU for data race detection reporting, ensuring that detected data races are logged and reported in real time for timely debugging and optimization by developers.
[0013] Another object of the invention is to integrate with heterogeneous computing interfaces like HIP, enabling the system to be compatible with both AMD and NVIDIA GPU architectures while maintaining its core functionality for data race detection and performance optimization.
[0014] Another object of the invention is to provide a method for compiling and instrumenting code using the LLVM compiler framework, adding specific hooks for data race detection through Thread Sanitizer (TSan), thereby making the system adaptable to various parallel processing workloads.
[0015] Another object of the invention is to enhance overall system performance during GPU execution by managing thread scheduling and memory allocation efficiently through the execution and control module and HIP runtime module in the GPU.
[0016] Another object of the invention is to enable developers to identify and resolve data races more effectively by providing detailed reports on memory access conflicts, which helps in improving the reliability and stability of parallel processing applications.
[0017] Another object of the invention is to leverage dynamic memory management features like XNACK for optimized handling of unified shared memory (USM), enabling dynamic migration of memory between the CPU and GPU for better performance in data-intensive applications.
[0018] Another object of the invention is to provide a scalable solution for high-performance computing environments, ensuring that the data race detection system can handle increasing numbers of threads and complex workloads without significant performance degradation.

BRIEF DESCRIPTION OF FIGURES
[0019] This invention is illustrated in the accompanying drawings, throughout which, like reference letters indicate corresponding parts in the various figures.
[0020] The embodiments herein will be better understood from the following description with reference to the drawings, in which:
[0021] Figure 1 depicts/illustrates a block diagram of a data race detection system, in accordance with an embodiment;
[0022] Figure 2 depicts/illustrates a block diagram of a processing unit of the data race detection system, in accordance with an embodiment;
[0023] Figure 3 depicts/illustrates a block diagram of a GPU unit 110 of the data race detection system, in accordance with an embodiment;
[0024] Figure 4 depicts/illustrates a block diagram of a Low-Level Virtual Machine (LLVM) compiler module 204 of the data race detection system, in accordance with an embodiment;
[0025] Figure 5 depicts/illustrates a process flow of the data race detection system, in accordance with an embodiment;
[0026] Figure 6 illustrates a method for race detection and performance optimization in a GPU-accelerated system, in accordance with an embodiment.

STATEMENT OF INVENTION
[0027] The present invention discloses a race detection and performance optimization system designed for GPU-accelerated parallel processing applications. The system comprises a processing unit configured to receive source code written in a heterogeneous computing interface language, preprocess the code, and compile it using a Low-Level Virtual Machine (LLVM) compiler.
[0028] The compiler module comprises TSan instrumentation that inserts data race detection hooks into the intermediate representation of the code. The system further comprises a GPU unit operatively connected to the processing unit, which executes the instrumented code, monitors memory accesses in real-time using a shadow memory module, and analyzes memory access patterns with a TSan detection module to identify data races dynamically.
[0029] The GPU unit communicates detected data races back to the processing unit for logging and output, enabling developers to address concurrency issues in real time. The system is optimized for use with AMD GPUs through the Radeon Open Compute (ROCm) platform, ensuring compatibility and efficient execution. Additionally, the system maintains constant memory overhead during data race detection, making it suitable for applications with large-scale parallel computations.

DETAILED DESCRIPTION
[0030] The embodiments herein and the various features and advantageous details thereof are explained more fully with reference to the non-limiting embodiments that are illustrated in the accompanying drawings and/or detailed in the following description. Descriptions of well-known components and processing techniques are omitted so as to not unnecessarily obscure the embodiments herein. The examples used herein are intended merely to facilitate an understanding of ways in which the embodiments herein may be practiced and to further enable those of skill in the art to practice the embodiments herein. Accordingly, the examples should not be construed as limiting the scope of the embodiments herein.
[0031] The present invention is a data race detection and performance optimization system designed for GPU-accelerated parallel processing. The system features a processing unit that compiles source code using the LLVM compiler and instruments the code with Thread Sanitizer (TSan) hooks to detect data races. The GPU unit executes the instrumented code while monitoring memory access through a shadow memory module and detecting data races using a TSan detection module. Detected data races are reported to the processing unit for logging, providing real-time insights into concurrency issues. The system is optimized for AMD GPUs using the Radeon Open Compute (ROCm) platform, enabling efficient execution and memory management, and is particularly suited for applications that require high-performance, parallel computing with reliable thread safety.
[0032] Figure 1 depicts/illustrates a data race detection system 100 comprising a processing unit 102, a memory unit 104, an input/output unit 106, a display unit 108, and a GPU unit 110.
[0033] In an embodiment, the processing unit 102 is configured to perform various processing tasks, comprising receiving and preprocessing source code, compiling the code, and instrumenting it with data race detection hooks. The processing unit 102 comprises a HIP processing module that prepares the source code for compilation, an LLVM compiler module that translates the source code into an intermediate representation, and a Thread Sanitizer (TSan) instrumentation submodule that adds hooks for detecting data races during execution. The processing unit 102 interacts with the memory unit 104 to store intermediate data during the preprocessing and compilation stages and communicates with the GPU unit 110 to transfer the compiled code for execution. Alternative configurations of the processing unit 102 may comprise additional processing modules for supporting different programming languages or processing architectures.
[0034] In an embodiment, the memory unit 104 is operatively connected to the processing unit 102 and the GPU unit 110 and is configured to manage the storage of data during the execution of the data race detection process. The memory unit 104 is optimized to maintain a constant memory overhead, enabling efficient memory usage even in scenarios with a large number of concurrent threads. The memory unit 104 comprises at least one of: volatile memory, non-volatile memory, and unified shared memory (USM), among others, to ensure quick access to critical data during real-time data race detection. The memory unit 104 may be implemented using different types of memory technologies such as DRAM, SRAM, or Flash memory, depending on system requirements.
[0035] In an embodiment, the input/output unit 106 enables communication between the data race detection system 100 and external devices, such as user interfaces, storage devices, and network interfaces. The input/output unit 106 is configured to receive inputs such as source code files or configuration parameters from external sources, and it outputs data race detection reports and performance metrics. The input/output unit 106 may comprise at least one of: a USB port, Ethernet port, or wireless communication interface, among others, to support a wide range of input and output methods. In alternative embodiments, the input/output unit 106 may support enhanced communication protocols to facilitate faster data transfer rates between the system and connected devices.
[0036] In an embodiment, the display unit 108 provides a visual representation of the data race detection results and system performance metrics to the user. The display unit 108 enables the user to view real-time updates on detected data races, thread execution details, and performance optimizations made during the execution process. The display unit 108 may comprise at least one of: an LCD screen, an LED display, or a touch screen interface, among others. It is configured to receive graphical data from the processing unit 102 and convert it into a user-readable format. In alternative configurations, the display unit 108 may comprise advanced features like augmented reality interfaces or multi-touch capabilities to enable a more interactive user experience.
[0037] In an embodiment, the GPU unit 110 is operatively connected to the processing unit 102 and is responsible for executing the instrumented code that has been prepared by the processing unit 102. The GPU unit 110 is configured to monitor memory accesses in real-time through a shadow memory module, analyze memory access patterns to detect data races using a TSan detection module and communicate any detected data races to the processing unit 102 for reporting. The GPU unit 110 is optimized for use with AMD GPUs and integrates with the Radeon Open Compute (ROCm) platform, enabling efficient execution and memory management during data race detection. The GPU unit 110 may comprise at least one of: a multi-core GPU, a tensor processing unit, or a graphics processing accelerator, among others. In alternative embodiments, the GPU unit 110 may support different GPU architectures to extend compatibility with a broader range of computing environments.
[0038] Figure 2 depicts/illustrates the processing unit 102 of the data race detection system 100, comprising a HIP processing module 202, an LLVM compiler module 204, a linker module 206, and an error reporting and output module 208.
[0039] In an embodiment, the HIP processing module 202 is configured to preprocess the source code written in a heterogeneous computing interface language, such as HIP, to prepare it for compilation. The HIP processing module 202 enables the conversion of the source code into a format that is compatible with GPU execution while ensuring that the code adheres to the specifications required by the compiler. The HIP processing module 202 handles tasks such as syntax checking, code optimization, and preparing the data structures needed for the compilation process. The HIP processing module 202 may comprise at least one of: a syntax analyzer, a code parser, or an optimization engine, among others, to ensure that the source code is ready for subsequent stages. Alternative embodiments of the HIP processing module 202 may comprise support for other programming interfaces or languages for broader compatibility.
[0040] In an embodiment, the LLVM compiler module 204 is configured to compile the preprocessed source code into an intermediate representation (IR) that can be further optimized and translated into machine code suitable for execution on the GPU unit 110. The LLVM compiler module 204 uses the Low-Level Virtual Machine (LLVM) framework to convert the high-level source code into an IR, which serves as a bridge between the source code and the executable machine code. The LLVM compiler module 204 comprises a frontend submodule, which translates the preprocessed source code into the intermediate representation, and a backend submodule, which optimizes the IR and generates machine code compatible with the GPU unit 110. This modular approach ensures that the LLVM compiler module 204 can adapt to various GPU architectures, providing flexibility in the compilation process.
[0041] In an embodiment, the linker module 206 is responsible for linking the instrumented intermediate representation with the required libraries to generate an executable file that can be run on the GPU unit 110. The linker module 206 handles the integration of both host libraries and device-specific libraries, ensuring that the generated executable is compatible with the runtime environment of the GPU unit 110. The linker module 206 enables the enclosure of necessary runtime libraries, data race detection hooks, and device-specific optimizations into the final executable. The linker module 206 may comprise at least one of: a symbol resolver, a library integrator, or a binary generator, among others, to facilitate the linking process. In alternative configurations, the linker module 206 may support dynamic linking capabilities to enable more efficient memory usage during execution.
[0042] In an embodiment, the error reporting and output module 208 is configured to log and output the results of the data race detection process. It gathers data from the GPU unit 110 regarding detected data races and generates detailed reports for developers to analyze and resolve concurrency issues. The error reporting and output module 208 enables the output of data race logs, performance metrics, and diagnostic information in a user-friendly format, such as text files or graphical reports. The error reporting and output module 208 may comprise at least one of: a data logger, a report generator, or a visualization tool, among others, to ensure that the output is clear and actionable. The error reporting and output module 208 can be customized to support various reporting formats or integrate with external debugging tools for an enhanced analysis experience.
[0043] Figure 3 depicts/illustrates the GPU unit 110 of the data race detection system 100, comprising an execution and control module 302, a HIP runtime module 304, a shadow memory module 306, a TSan detection module 308, and an interconnect and reporting module 310.
[0044] In an embodiment, the execution and control module 302 is configured to manage the scheduling and parallel execution of threads on the GPU unit 110. It controls how the instrumented code, received from the processing unit 102, is executed across multiple cores of the GPU, ensuring that threads are efficiently scheduled to maximize computational throughput. This module enables load balancing among GPU cores and manages synchronization between threads to avoid conflicts. The execution and control module 302 may comprise at least one of: a thread scheduler, a resource allocator, or a synchronization manager, among others, to ensure smooth operation of parallel processing tasks. In alternative embodiments, the execution and control module 302 may support advanced scheduling algorithms tailored to specific workloads for improved performance.
[0045] In an embodiment, the HIP runtime module 304 is responsible for managing runtime operations on the GPU, such as memory allocation, kernel execution, and inter-thread communication. It enables the execution of code written using the Heterogeneous-Compute Interface for Portability (HIP), enabling the GPU unit 110 to run kernels efficiently while managing memory and communication between threads. The HIP runtime module 304 facilitates seamless interaction with the GPU's memory architecture, ensuring that memory is allocated and deallocated as needed during the execution process. This module may comprise at least one of: a memory manager, a kernel dispatcher, or an inter-thread communication handler, among others. In alternative configurations, the HIP runtime module 304 may support runtime features like unified memory access for simplifying memory management between the GPU unit 110 and the processing unit 102.
[0046] In an embodiment, the shadow memory module 306 is configured to monitor memory accesses by creating a shadow copy of the memory space used during execution. This shadow memory tracks metadata associated with each memory access, such as thread IDs, access types (read/write), and timestamps. The shadow memory module 306 enables the detection of concurrent access patterns by maintaining records of how threads interact with shared memory locations. The shadow memory module 306 is essential for identifying potential data races and enables the system to compare memory access histories and detect conflicts. The shadow memory module 306 may comprise at least one of: a metadata tracker, a shadow memory manager, or a conflict detector, among others, to ensure accurate tracking of memory accesses. In alternative embodiments, the shadow memory module 306 may support varying levels of granularity to adapt to different performance requirements.
[0047] In an embodiment, the TSan detection module 308 is configured to analyze memory access patterns captured by the shadow memory module 306 and identify data races dynamically. The TSan detection module 308 uses instrumentation hooks, such as __tsan_read and __tsan_write, to detect instances where multiple threads access the same memory location concurrently without proper synchronization. The TSan detection module 308 provides real-time analysis of thread interactions, enabling the system to flag data races as they occur during execution. The TSan detection module 308 is critical for reducing false positives by focusing on actual runtime behaviour rather than potential issues suggested by static analysis. The TSan detection module 308 may comprise at least one of: a memory access analyzer, a race condition detector, or a thread interaction monitor, among others. Alternative embodiments of the TSan detection module 308 may support customized detection criteria to focus on specific types of data races.
[0048] In an embodiment, the interconnect and reporting module 310 is responsible for facilitating communication between the GPU unit 110 and the processing unit 102, particularly for transmitting detected data race information and performance metrics. The interconnect and reporting module 310 enables the seamless transfer of data race reports generated by the TSan detection module 308 to the error reporting and output module 208 within the processing unit 102. The interconnect and reporting module 310 ensures that the detected data races are reported in real time, enabling immediate analysis and debugging. The interconnect and reporting module 310 may comprise at least one of: a data communication interface, a reporting bus, or a synchronization interface, among others, to support efficient data transfer between the GPU and processing units. In alternative configurations, the interconnect and reporting module 310 may comprise support for high-speed communication protocols to further reduce latency in reporting detected issues.
[0049] Figure 4 depicts/illustrates the LLVM compiler module 204 of the data race detection system 100, comprising an LLVM frontend submodule 402, a TSan instrumentation submodule 404, and an LLVM backend submodule 406.
[0050] In an embodiment, the LLVM frontend submodule 402 is configured to translate the preprocessed source code into an intermediate representation (IR). This submodule serves as the initial stage in the compilation process, converting high-level source code written in a heterogeneous computing interface language into a form that is suitable for further analysis and optimization by the compiler. The frontend submodule 402 comprises components such as a lexer, parser, and syntax tree generator to convert the source code into IR. It enables platform-agnostic translation of source code, making it suitable for execution on various GPU architectures. Alternative embodiments of the LLVM frontend submodule 402 may comprise support for different source languages and enhanced parsing capabilities for greater compatibility.
[0051] In an embodiment, the TSan instrumentation submodule 404 is configured to insert Thread Sanitizer (TSan) hooks into the intermediate representation to enable dynamic data race detection. This submodule adds special instrumentation hooks, such as __tsan_read and __tsan_write, which monitor memory accesses during program execution. These hooks track read and write operations to shared memory locations, facilitating the detection of race conditions when multiple threads access the same memory without proper synchronization. The TSan instrumentation submodule 404 enhances the IR by embedding these hooks, enabling the GPU unit 110 to perform real-time data race detection during execution. It may comprise at least one of: a memory access instrumenter, a hook injector, or a conflict detection enhancer, among others. In alternative configurations, the TSan instrumentation submodule 404 may support different levels of instrumentation granularity to balance detection accuracy and performance.
[0052] In an embodiment, the LLVM backend submodule 406 is responsible for optimizing the intermediate representation and generating machine code that is compatible with the GPU unit 110. After the IR is instrumented by the TSan submodule 404, the backend submodule 406 performs various optimizations, such as loop unrolling, constant propagation, and inlining, to enhance the performance of the final executable. This submodule translates the optimized IR into machine code that can be executed directly by the GPU hardware. The LLVM backend submodule 406 ensures that the generated code is tailored to the target GPU architecture, enabling efficient execution. It may comprise at least one of: an optimization engine, a code generator, or an architecture-specific translator, among others. In alternative embodiments, the LLVM backend submodule 406 may support specialized optimizations for specific GPU models or applications to further improve execution performance.
[0053] Figure 5 depicts/illustrates a process flow of the data race detection system 100 using the LLVM toolchain for compiling and instrumenting code, followed by the execution on a GPU.
[0054] The process begins with the HIP source code, which defines the parallel computation tasks intended for execution on the GPU unit 110. The code is written using a heterogeneous computing interface language, compatible with AMD and NVIDIA GPUs, to leverage GPU acceleration for high-performance computing tasks.
[0055] The preprocessor prepares the HIP source code for compilation by handling tasks like macro expansion and syntax verification. HIP source code may comprise essential headers like hip/hip_runtime.h, which are needed for running GPU kernels and interacting with the HIP runtime environment. The preprocessed source code is then passed to the LLVM compiler module 204.
[0056] The LLVM compiler module 204 is central to the compilation process, comprising of an LLVM frontend submodule 402, TSan instrumentation submodule 404, and an LLVM backend submodule 406: LLVM Frontend Submodule 402 translates the preprocessed source code into an intermediate representation (IR), a lower-level abstraction of the code that is suitable for further analysis and optimization. The TSan instrumentation submodule 404 adds hooks to the IR, such as __tsan_read and __tsan_write, enabling the system to track memory accesses during runtime. These hooks are critical for detecting data races by monitoring thread interactions with shared memory. The LLVM backend submodule 406 optimizes the IR and generates machine code tailored for execution on the GPU unit 110. The LLVM backend submodule 406 applies optimizations like loop unrolling and instruction scheduling to improve execution performance on the GPU.
[0057] The linker module 206 combines the instrumented code with the required libraries to create an executable. The linker module 206 comprises integrating: HIP Libraries to provide runtime support for executing HIP code on the GPU. AMD Device Libraries which comprise runtime components like HIP TSan hooks, are specific to AMD GPUs and ensure compatibility with the ROCm platform. Host Libraries comprise TSan and ASan host hooks, which support interaction between the host CPU and the GPU during the execution of the instrumented code.
[0058] The system generates an executable that contains the compiled and instrumented code. This executable is then deployed onto the GPU unit 110, where it is executed under the control of the execution and control module 302. During execution, the shadow memory module 306 records memory accesses, using shadow words to maintain a history of read and write operations across different threads.
[0059] The TSan detection module 308 analyzes the memory access patterns recorded by the shadow memory module 306 in real time. It detects potential data races by comparing concurrent memory accesses that occur without proper synchronization. If two threads access the same memory location concurrently (where at least one access is a write), the TSan detection module 308 flags this as a data race.
[0060] Detected data races are communicated back to the processing unit 102 via the interconnect and reporting module 310. The error reporting and output module 208 within the processing unit 102 logs these data races and generates detailed reports, providing information on the memory addresses, access types, thread IDs, and timestamps involved in each detected race. The reports are formatted for easy analysis, enabling developers to identify and resolve concurrency issues in their code.
[0061] Alongside data race detection, the GPU unit 110 continues to produce regular program outputs as defined by the HIP source code. These outputs are provided to the user, along with detailed data race detection reports, to offer insights into the performance and reliability of the parallel processing application.
[0062] Figure 6 illustrates a method 600 for race detection and performance optimization in a GPU-accelerated system. The method begins with receiving, by a processing unit, source code written in a heterogeneous computing interface language, as depicted at step 602. Subsequently, the method 600 discloses preprocessing, by the processing unit, the source code using a HIP processing module configured to prepare the source code for compilation, as depicted at step 604. Thereafter, the method 600 discloses Compiling, by the processing unit, the preprocessed source code into an intermediate representation using a Low-Level Virtual Machine (LLVM) compiler module, as depicted at step 606. Thereafter, the method 600 discloses instrumenting, by the processing unit, the intermediate representation with data race detection hooks using a Thread Sanitizer (TSan) instrumentation submodule, as depicted at step 608. Thereafter, the method 600 discloses linking, by the processing unit (102), the instrumented code with required libraries and generating an executable using a linker module, as depicted at step 610. Thereafter, the method 600 discloses executing, by a GPU unit, the executable generated by the processing unit, as depicted at step 612. Thereafter, the method 600 discloses monitoring, by the GPU unit, memory accesses in real-time through a shadow memory module to detect concurrent access conflicts during dynamic execution, as depicted at step 614. Thereafter, the method 600 discloses analyzing, by the GPU unit, memory access patterns to dynamically detect data races using a TSan detection module, as depicted at step 616. Thereafter, the method 600 discloses communicating, by the GPU unit, detected data races to the processing unit for reporting, as depicted at step 618.
[0063] The advantages of the current invention include:
[0064] Real-Time Data Race Detection: The invention provides real-time detection of data races during the execution of GPU-accelerated programs, enabling developers to identify and address concurrency issues as they occur. This ensures that potential errors are caught and mitigated during the actual runtime, leading to more stable and reliable applications.
[0065] Dynamic Analysis for Reduced False Positives: Unlike traditional static analysis methods that can produce many false positives, the system uses dynamic analysis to monitor memory accesses as they happen. This approach results in more accurate data race detection, focusing only on issues that manifest during execution, thus reducing unnecessary debugging efforts.
[0066] Optimized for AMD GPUs: The system is specifically optimized for AMD GPUs using the Radeon Open Compute (ROCm) platform, enabling seamless integration with AMD's toolchains and runtime libraries. This results in improved performance and compatibility when executing GPU-accelerated tasks on AMD hardware, offering a tailored solution for AMD's ecosystem.
[0067] Cross-Platform Compatibility: By utilizing HIP (Heterogeneous-Compute Interface for Portability), the system enables developers to write code that can be compiled for both AMD and NVIDIA GPUs. This cross-platform capability ensures that developers can use the same codebase across different GPU architectures, simplifying code maintenance and improving flexibility.
[0068] Efficient Memory Management: The memory unit of the system is designed to maintain a constant memory overhead, even as the number of concurrent threads increases. This ensures that the system can handle large-scale parallel computations without facing memory overload, making it ideal for applications that require intensive memory operations.
[0069] Advanced Thread Sanitizer (TSan) Instrumentation: The system leverages TSan instrumentation to monitor memory access patterns, providing a detailed view of read and write operations across different threads. This advanced instrumentation enables for precise detection of data races and helps developers gain deeper insights into potential synchronization issues in their code.
[0070] Detailed Reporting and Logging: The error reporting and output module generates comprehensive reports that detail the detected data races, comprising memory addresses, thread IDs, and timestamps. This makes it easier for developers to pinpoint the exact location and nature of concurrency issues, facilitating faster debugging and optimization processes.
[0071] Scalability for High-Performance Computing: The system is designed to scale with the needs of high-performance computing (HPC) environments, where large numbers of threads and complex workloads are common. Its ability to maintain efficient performance even with increased thread counts makes it suitable for use in scientific computing, machine learning, and other data-intensive fields.
[0072] Enhanced Performance through LLVM Optimizations: The use of the LLVM compiler module enables the system to apply various code optimizations during the compilation process. This ensures that the generated machine code is highly efficient, resulting in faster execution times on the GPU unit and improved overall system performance.
[0073] Seamless Integration with Existing Development Workflows: The system is designed to integrate smoothly into existing GPU development workflows, enabling developers to use familiar tools like the LLVM toolchain, HIP, and ROCm. This minimizes the learning curve and enables developers to adopt the system without disrupting their existing practices.
[0074] High Adaptability with Multiple GPU Architectures: While optimized for AMD GPUs, the system's design enables it to be adapted for use with other GPU architectures. This adaptability makes it a versatile solution for developers working in heterogeneous computing environments, ensuring that the system can be tailored to meet various hardware and software requirements.
[0075] Minimized Overhead for Data Race Detection: The shadow memory module and TSan detection module are designed to monitor memory accesses with minimal performance overhead, ensuring that the data race detection process does not significantly slow down program execution. This is crucial for maintaining high performance in real-time and latency-sensitive applications.
[0076] Applications of the current invention include:
[0077] High-Performance Computing (HPC): The invention is well-suited for HPC environments that involve large-scale computations and require efficient handling of multiple concurrent threads. It can be used in fields like weather modelling, climate simulations, and molecular dynamics, where data races and synchronization issues can affect the accuracy and reliability of simulations.
[0078] Machine Learning and Deep Learning: With its ability to monitor and detect data races in real-time, the system is ideal for training and deploying neural networks on GPU-accelerated platforms. It ensures that parallel processing during model training is free from concurrency issues, thus improving the stability and accuracy of machine learning models in image recognition, natural language processing, and autonomous driving applications.
[0079] Scientific Research and Simulations: The system is valuable in scientific research that relies on parallel computing for simulations, such as particle physics, quantum mechanics, and astrophysics. Detecting data races dynamically, helps researchers ensure the correctness of their simulations and models, which often involve a significant number of parallel computations.
[0080] Graphics Rendering and Animation: The invention can be used in graphics rendering, 3D animation, and visual effects where GPUs are employed to perform complex computations in parallel. The invention ensures that threads accessing shared memory locations during rendering processes do not create data races, leading to smoother and more accurate rendering of graphics and animations.
[0081] Real-Time Data Analysis and Processing: In applications such as real-time financial analysis, high-frequency trading, and sensor data processing, where speed and accuracy are critical, the system helps to maintain thread safety and data integrity. It can be used to ensure that concurrent operations on incoming data streams do not result in data races, thus maintaining the reliability of real-time analytics.
[0082] Autonomous Systems and Robotics: The system is applicable in robotics and autonomous systems where parallel processing is used for sensor fusion, path planning, and object recognition. By providing real-time detection of data races, the invention ensures that these systems can operate safely without synchronization issues, making it valuable for applications like self-driving cars and drones.
[0083] Video Processing and Encoding: In video encoding, compression, and real-time streaming, the system can help optimize GPU-based encoding pipelines by detecting concurrency issues that might otherwise affect the quality of the output. This makes it useful for applications in broadcasting, live streaming, and video-on-demand services, where maintaining performance is crucial.
[0084] Augmented Reality (AR) and Virtual Reality (VR): The system can be used in AR and VR applications, where low latency and real-time rendering are essential for a smooth user experience. It ensures that parallel computations related to rendering and tracking are free from data races, contributing to seamless interactions in virtual environments.
[0085] Telecommunications and Network Simulation: In telecommunications and network simulation, where parallel simulations are used to model network traffic, protocols, and infrastructure, the system helps maintain thread safety. It can be employed in 5G simulations, network performance analysis, and signal processing, ensuring that concurrent access to shared data does not result in race conditions.
[0086] Game Development: The invention is applicable in game development, where GPU acceleration is used to manage physics engines, AI behaviours, and real-time graphics. It helps game developers detect and fix data races in multithreaded game logic, thus improving game stability and user experience.
[0087] Healthcare and Medical Imaging: In medical imaging and computational biology, where parallel computing is used for processing MRI scans, CT images, and genomic data, the system ensures accurate data analysis by preventing race conditions during data processing. This is crucial for generating precise diagnostic results and speeding up analysis processes.
[0088] Cloud Computing and Data Centers: The system is highly relevant in cloud computing environments and data centres that use GPU clusters to process large datasets in parallel. It can be used to monitor and optimize GPU-based workloads, ensuring that distributed computations are free from synchronization issues, which is critical for maintaining uptime and performance in cloud services.
[0089] Artificial Intelligence (AI) Framework Development: The invention is beneficial for developers creating AI frameworks that support GPU acceleration, such as TensorFlow, PyTorch, and Keras. It helps to ensure that the underlying parallel operations are free from data races, thus enhancing the stability and performance of these frameworks when running on multi-GPU setups.
[0090] Compiler Development and Toolchains: The system can also be applied in the development of compilers and toolchains that target parallel computing environments. Integrating data race detection into the compilation process, helps compiler developers ensure that the code generated for GPUs is optimized and free from concurrency errors.
[0091] The foregoing description of the specific embodiments will so fully reveal the general nature of the embodiments herein that others can, by applying current knowledge, readily modify and/or adapt for various applications such specific embodiments without departing from the generic concept, and, therefore, such adaptations and modifications should and are intended to be comprehended within the range of equivalents of the disclosed embodiments. It is to be understood that the phraseology or terminology employed herein is for the purpose of description and not of limitation. Therefore, while the embodiments herein have been described in terms of preferred embodiments, those skilled in the art will recognize that the embodiments herein can be practiced with modification within the scope of the embodiments as described here.
, Claims:We claim:
1. A race detection and performance optimization system (100), comprising:
a processing unit (102) configured to:
receive source code written in a heterogeneous computing interface language;
preprocess the source code to prepare the preprocessed source code for compilation by using a HIP processing module (202) ;
compile the preprocessed source code into an intermediate representation using a Low-Level Virtual Machine (LLVM) compiler module (204);
instrument the intermediate representation with data race detection hooks using a Thread Sanitizer (TSan) instrumentation submodule (404), to create instrumented code;
link the instrumented code with one or more libraries and generate an executable using a linker module (206);
a GPU unit (110) operatively connected to the processing unit (102), the GPU unit (110) being configured to:
execute the executable generated by the processing unit (102);
monitor memory accesses in real-time through a shadow memory module (306) configured to detect concurrent access conflicts during dynamic execution at run-time;
analyze memory access patterns to dynamically detect data races using a TSan detection module (308); and
communicate detected data races to the processing unit (102) for reporting; and
a memory unit (104) operatively connected to the processing unit (102) and the GPU unit (110), wherein the memory unit (104) is configured to optimize memory usage during dynamic execution of data race detection by maintaining a constant memory overhead, for enabling better utilization of available memory resources and preventing degraded performances from data race detection in memory-constrained environments.

2. The system as claimed in claim 1, wherein the processing unit (102) comprises an error reporting and output module (208) configured to log and output data race detection reports received from the GPU unit (110).

3. The system as claimed in claim 1, wherein the compiler module (204) comprises:
a frontend submodule (402) configured to translate the preprocessed source code into an intermediate representation; and
a backend submodule (406) configured to optimize the intermediate representation and generate machine code compatible with the GPU unit (110).

4. The system as claimed in claim 1, wherein the shadow memory module (306) in the GPU unit (110) is configured to store metadata associated with each memory access, comprising a thread ID, access type, and timestamp, to facilitate dynamic, real-time detection of data races through comparison of concurrent access patterns.

5. The system as claimed in claim 1, wherein the GPU unit (110) comprises:
an execution and control module (302) configured to manage the scheduling and parallel execution of threads on the GPU unit (110); and
a HIP runtime module (304) configured to handle runtime memory allocation, kernel execution, and inter-thread communication.

6. The system as claimed in claim 1, wherein the linker module (206) is configured to link the instrumented code with both host libraries and device-specific libraries required for execution of the GPU unit (110).

7. The system as claimed in claim 1, wherein the TSan detection module (308) identifies data races by comparing memory access patterns real-time recorded in the shadow memory module (306) and flags concurrent access conflicts without synchronization.

8. The system as claimed in claim 1, wherein the GPU unit (110) comprises an interconnect and reporting module (310) configured to facilitate communication between the GPU unit (110) and the processing unit (102) for real-time reporting of data race detections.

9. The system as claimed in claim 1, wherein the TSan instrumentation module (404) within the processing unit (102) inserts hooks that track read and write operations to shared memory locations, enabling runtime detection of data races during GPU execution.

10. The system as claimed in claim 1, wherein the GPU unit (110) is optimized for Advanced Micro Devices (AMD) GPUs using a Radeon Open Compute (ROCm) platform, enabling integration with AMD's compiler toolchain for efficient, real-time data race detection and memory management during parallel execution.


11. The system as claimed in claim 1, wherein the TSan detection module (308) is configured to perform dynamic data race detection by monitoring thread interactions and memory access patterns during execution of the instrumented code on the GPU unit (110), for providing real-time analysis and minimizing false positives associated with static analysis methods.

12. A method for race detection and performance optimization in a GPU-accelerated system (100), comprising:
receiving, by a processing unit (102), source code written in a heterogeneous computing interface language;
preprocessing, by the processing unit (102), the source code using a HIP processing module (202) configured to prepare the source code for compilation;
compiling, by the processing unit (102), the preprocessed source code into an intermediate representation using a Low-Level Virtual Machine (LLVM) compiler module (204);
instrumenting, by the processing unit (102), the intermediate representation with data race detection hooks using a Thread Sanitizer (TSan) instrumentation submodule (404);
linking, by the processing unit (102), the instrumented code with required libraries and generating an executable using a linker module (206);
executing, by a GPU unit (110), the executable generated by the processing unit (102);
monitoring, by the GPU unit (110), memory accesses in real-time through a shadow memory module (306) to detect concurrent access conflicts during dynamic execution;
analyzing, by the GPU unit (110), memory access patterns to dynamically detect data races using a TSan detection module (308); and
communicating, by the GPU unit (110), detected data races to the processing unit (102) for reporting.

13. The method as claimed in claim 12, comprising logging and outputting data race detection reports received from the GPU unit (110) by using an error reporting and output module (208) within the processing unit (102).

14. The method as claimed in claim 12, wherein compiling the preprocessed source code into an intermediate representation comprises:
translating the preprocessed source code into an intermediate representation using a frontend submodule (402) of the compiler module (204); and
optimizing the intermediate representation and generating machine code compatible with the GPU unit (110) using a backend submodule (406) of the compiler module (204).

15. The method as claimed in claim 12, wherein monitoring memory accesses comprises:
storing metadata associated with each memory access in the shadow memory module (306), the metadata comprising thread ID, access type, and timestamp, to facilitate dynamic real-time data race detection.

16. The method as claimed in claim 12, comprising:
managing, by an execution and control module (302) in the GPU unit (110), the scheduling and parallel execution of threads on the GPU; and
handling runtime memory allocation, kernel execution, and inter-thread communication using a HIP runtime module (304) in the GPU unit (110).

17. The method as claimed in claim 12, wherein linking the instrumented code comprises:
linking the instrumented code with host libraries and device-specific libraries required for GPU execution using the linker module (206).

18. The method as claimed in claim 12, wherein analyzing memory access patterns to detect data races comprises:
identifying data races by comparing memory access patterns recorded in the shadow memory module (306) and flagging concurrent access conflicts without synchronization.

19. The method as claimed in claim 12, comprising facilitating real-time communication between the GPU unit (110) and the processing unit (102) by using an interconnect and reporting module (310) in the GPU unit (110) to report data race detections.

20. The method as claimed in claim 12, comprising instrumenting the intermediate representation with data race detection hooks by inserting hooks that track read and write operations to shared memory locations, enabling runtime detection of data races during GPU execution.



Date: 07th November, 2024 Signature:
Name of signatory: Nishant Kewalramani
(Patent Agent) IN/PA number: 1420

Documents

NameDate
202441086106-FER.pdf10/12/2024
202441086106-FORM-8 [13-11-2024(online)].pdf13/11/2024
202441086106-EVIDENCE OF ELIGIBILTY RULE 24C1f [12-11-2024(online)].pdf12/11/2024
202441086106-FORM 18A [12-11-2024(online)].pdf12/11/2024
202441086106-COMPLETE SPECIFICATION [08-11-2024(online)].pdf08/11/2024
202441086106-DECLARATION OF INVENTORSHIP (FORM 5) [08-11-2024(online)].pdf08/11/2024
202441086106-DRAWINGS [08-11-2024(online)].pdf08/11/2024
202441086106-EDUCATIONAL INSTITUTION(S) [08-11-2024(online)].pdf08/11/2024
202441086106-EVIDENCE FOR REGISTRATION UNDER SSI [08-11-2024(online)].pdf08/11/2024
202441086106-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [08-11-2024(online)].pdf08/11/2024
202441086106-FORM 1 [08-11-2024(online)].pdf08/11/2024
202441086106-FORM FOR SMALL ENTITY(FORM-28) [08-11-2024(online)].pdf08/11/2024
202441086106-FORM-9 [08-11-2024(online)].pdf08/11/2024
202441086106-POWER OF AUTHORITY [08-11-2024(online)].pdf08/11/2024
202441086106-REQUEST FOR EARLY PUBLICATION(FORM-9) [08-11-2024(online)].pdf08/11/2024
202441086106-STATEMENT OF UNDERTAKING (FORM 3) [08-11-2024(online)].pdf08/11/2024

footer-service

By continuing past this page, you agree to our Terms of Service,Cookie PolicyPrivacy Policy  and  Refund Policy  © - Uber9 Business Process Services Private Limited. All rights reserved.

Uber9 Business Process Services Private Limited, CIN - U74900TN2014PTC098414, GSTIN - 33AABCU7650C1ZM, Registered Office Address - F-97, Newry Shreya Apartments Anna Nagar East, Chennai, Tamil Nadu 600102, India.

Please note that we are a facilitating platform enabling access to reliable professionals. We are not a law firm and do not provide legal services ourselves. The information on this website is for the purpose of knowledge only and should not be relied upon as legal advice or opinion.