Consult an Expert
Trademark
Design Registration
Consult an Expert
Trademark
Copyright
Patent
Infringement
Design Registration
More
Consult an Expert
Consult an Expert
Trademark
Design Registration
Login
HARDWARE-AWARE WIDTH AND DEPTH SHRINKING WITH CONVOLUTIONAL AND FULLY CONNECTED LAYER MERGING
Extensive patent search conducted by a registered patent agent
Patent search done by experts in under 48hrs
₹999
₹399
Abstract
Information
Inventors
Applicants
Specification
Documents
ORDINARY APPLICATION
Published
Filed on 26 October 2024
Abstract
[0088] The present invention discloses a method and system for hardware-aware width and depth shrinking with convolutional and fully connected layer merging. The method includes receiving (302) a trained neural network having a plurality of Convolutional (CONV) Layer and Fully Connected (FC) Layer, wherein each CONV layer includes a plurality of filters; determining (304) a set of retainable filters from the plurality of filters to retain in a specific layer through empirical evaluation, wherein the set of retainable filters satisfies a predetermined resource constraint. The method (300) includes retaining (306) the set of retainable filters and pruning a set of remaining filters; merging (308) two consecutive CONV layers from a plurality of CONV Layers using digital arithmetic techniques. The method (300) includes removing (310) activation and dropouts between first and last fully connected layers and merging (312) fully connected layers using matrix algebra. . TO BE PUBLISHED WITH FIGURE. 1
Patent Information
Application ID | 202441081768 |
Invention Field | COMPUTER SCIENCE |
Date of Application | 26/10/2024 |
Publication Number | 44/2024 |
Inventors
Name | Address | Country | Nationality |
---|---|---|---|
PRATIBHA VERMA | IIT Hyderabad Road, Near NH-65, Sangareddy, Kandi, Telangana 502284, India. | India | India |
TARUN GUPTA | IIT Hyderabad Road, Near NH-65, Sangareddy, Kandi, Telangana 502284, India. | India | India |
PABITRA DAS | IIT Hyderabad Road, Near NH-65, Sangareddy, Kandi, Telangana 502284, India. | India | India |
AMIT ACHARYYA | IIT Hyderabad Road, Near NH-65, Sangareddy, Kandi, Telangana 502284, India. | India | India |
Applicants
Name | Address | Country | Nationality |
---|---|---|---|
INDIAN INSTITUTE OF TECHNOLOGY HYDERABAD | IIT Hyderabad Road, Near NH-65, Sangareddy, Kandi, Telangana 502284, India. | India | India |
Specification
Description:TECHNICAL FIELD
[0001] The present invention relates to optimization techniques for neural networks, with a particular focus on Convolutional Neural Networks (CNNs), aimed at improving their compatibility and efficiency for deployment on specific hardware platforms. More specifically, the invention relates to hardware-aware width and depth shrinking with convolutional and fully connected layer merging.
BACKGROUND
[0002] Background description includes information that may be useful in understanding the present subject matter. It is not an admission that any of the information provided herein is prior art or relevant to the presently claimed invention, or that any publication specifically or implicitly referenced is prior art.
[0003] Deploying deep Convolutional Neural Networks (CNNs) on resource-constrained edge devices presents several significant challenges due to limitations in power, memory, and processing capabilities. CNNs are highly effective for various tasks, but their architecture inherently demands substantial memory and computational resources. Understanding these requirements is crucial for deploying CNNs effectively, especially in resource-constrained environments.
[0004] Convolutional Neural Networks (CNNs) have huge memory requirements and computation demands. Deploying them on edge devices is challenging because of their limited resources and battery life. Most of the state-of-the-art compression techniques for CNNs ignore the hardware-specific constraints, which leads to ineffective memory management, hampering the real-time performance. Moreover, such techniques often ignore the operational conditions and unique architectures of edge devices, resulting in increased energy consumption and latency. This highlights an urgent requirement for compression mechanisms which are algorithmically effective as well as hardware aware.
[0005] The Convolutional Neural Networks (CNN) are employed in a widespread way across domains such as speech recognition, image recognition, medical diagnostics, defect detection, and metrology. Their efficacy in feature recognition and pattern matching excels the conventional methods due to their capabilities of deep learning. However, CNNs demand substantial resources for intricate computations. To overcome their inherent complexity, considerable efforts have been made to mitigate the computational burden, especially for edge devices and mobile applications, as highlighted in recent studies.
[0006] Reducing CNN models can be accomplished through a variety of techniques, including model compression, quantization, weight pruning, and the fusion of convolutional layers. The concept of convolutional layer fusion demonstrates how data is efficiently managed on the chip. CNN fusion designs often employ a pyramid-like structure with sliding layers, optimizing data storage and evaluation efficiency, as observed in. The approach discussed in utilizes the Multimodal Transfer Module for CNN fusion. However, these algorithmic compression techniques are primarily implemented at the software level, leading to variable performance outcomes across different hardware platforms. Researchers have established that the performance of a neural network could be affected by platform-specific factors which could be difficult to mitigate by mere algorithmic compression. Hence, the focus of the recent research has been on integrating hardware-awareness into compression techniques. The platform-aware algorithms consider the model size and inference speed (for e.g., Latency) into their optimizing processes alongside the pruning techniques. This approach aims to generate concise and efficient CNNs while maintaining the accuracy, circumventing the platform-dependent limitations in the deployment of neural networks.
[0007] One of the existing solutions address the compression of deep neural networks and discloses methods to optimize the implementation of Convolutional Neural Networks on devices with limited memory and available computational resources. It basically focuses on decreasing the size and complexity of the neural network while maintaining accuracy.
[0008] Another state of art of solution addresses Neural network compression and discloses the technique for optimizing deep learning models, including quantization and pruning, with an emphasis on enhancing the performance on edge devices.
[0009] Another state of art of solution addresses "Compression In Machine Learning And Deep Learning Processing" and describes an embodiment of an apparatus for compression of untyped data includes a graphical processing unit (GPU) including a data compression pipeline, the data compression pipeline including a data port coupled with one or more shader cores, wherein the data port is to allow transfer of untyped data without format conversion, and a 3D compression/decompression unit to provide for compression of untyped data to be stored to a memory subsystem and decompression of untyped data from the memory subsystem.
[0010] Another literature titled, "MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications" by Andrew G. Howard et al., introduces MobileNets, a class of CNNs specifically designed for mobile and embedded vision applications. This literature discusses techniques such as depth wise separable convolutions, which significantly reduce computational costs and model size.
[0011] In the publication "EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks" by Mingxing Tan and Quoc V. Le presents EfficientNet, a method for scaling CNNs efficiently by balancing network depth, width, and resolution. It explores the trade-offs between accuracy and resource consumption and offers insights into optimizing networks for deployment on edge devices.
[0012] Many existing methods, such as pruning and quantization, do not incorporate hardware awareness into the optimization process, which can result in suboptimal performance during edge deployment. Techniques like Neural Architecture Search (NAS) often require substantial computational resources and are challenging to implement, making them less practical for edge devices. Additionally, many current optimization solutions fail to adequately address energy consumption, which is a critical concern for battery-powered edge devices.
[0013] Therefore, there is a need to find a solution for the above-mentioned drawbacks.
OBJECTS OF THE INVENTION
[0014] The objectives are provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. These objectives are not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
[0015] It is one of the primary objectives of the present invention to develop an efficient, hardware-aware optimization framework for deploying deep CNNs on resource-constrained edge platforms.
[0016] It is another objective of the present invention to maintain a trade-off between model complexity, size and its accuracy while performing for complex tasks like image recognition etc.
[0017] It is another objective of the present invention to provide a lower energy consumption and longer battery life on battery powered edge devices by reducing computational load and surpassing redundant operations.
[0018] It is another objective of the present invention to create a versatile framework which can adapt CNN models to different types of hardware, including CPUs, GPUs, FPGAs, and specialized AI accelerators.
[0019] It is another objective of the present invention to optimize each model according to the specific characteristics and constraints of the target platform.
[0020] These and other objects and advantages of the present subject matter will be apparent to a person skilled in the art after consideration of the following detailed description taking into consideration with accompanied drawings in which preferred embodiments of the present subject matter are illustrated.
SUMMARY
[0021] Solution to one or more drawbacks of existing technology and additional advantages are provided through the present disclosure. Additional features and advantages are realized through the technicalities of the present disclosure. Other embodiments and aspects of the disclosure are described in detail herein and are considered to be a part of the claimed disclosure.
[0022] According to an embodiment, the present invention discloses a method for hardware-aware width and depth shrinking with convolutional and fully connected layer merging, executed by one or more processors. The method includes steps of providing a trained neural network. The trained neural network includes a plurality of Convolutional (CONV) Layers and a plurality of Fully Connected (FC) Layers, wherein each CONV layer of the plurality of the CONV layer and each FC layer of the plurality of FC layers include a plurality of filters; determining a set of retainable filters from the plurality of filters to retain a specific layer of at least one of the plurality of CONV and the plurality FC layers through empirical evaluation, wherein the set of retainable filters satisfies a predetermined resource constraint; retaining the set of retainable filters and pruning a set of remaining filters from the plurality of filters from the plurality of CONV and the plurality FC layers, merging two consecutive CONV layers from a plurality of CONV Layers using digital arithmetic techniques; removing activation and dropouts between first and last fully connected layers and merging fully connected layers using matrix algebra.
[0023] In an embodiment, determining the set of retainable filters further includes the steps of generating N sub-networks per iteration, wherein N is the number of CONV and FC layers, wherein each sub-network proposes a modification to one layer from the previous iteration, retaining a sub-network for the next iteration, wherein the retained sub-network has maximum accuracy and fine tuning the retained sub-network.
[0024] In another embodiment, the sub-network is retained based on L1-norm magnitude.
[0025] In another embodiment, the digital arithmetic techniques to compute the output feature map Y and its bias b′
wherein P denote the number of output channels for convolutional layer B, I denote the input feature map, Q denote the number of input channels, and R denote the number of output channels for convolutional layer A.
[0026] In another embodiment, the matrix algebra is as follows
wherein the model includes three fully connected layers labelled as fc1, fc2, and fc3.
[0027] According to another embodiment, the present invention discloses a computing system. The computing system includes of a processor, a non-transitory computer-readable medium includes instructions which, when executed by the processor, perform processing which includes of providing a trained neural network including a plurality of Convolutional (CONV) Layer and Fully Connected (FC) Layer, wherein each CONV layer includes a plurality of filters; determining a set of retainable filters from the plurality of filters to retain in a specific layer of the plurality of CONV and FC layers through empirical evaluation, wherein the set of retainable filters satisfies a predetermined resource constraint; retaining the set of retainable filters and pruning a set of remaining filters from the plurality of filters from a plurality of CONV and FC layers, merging two consecutive CONV layers from a plurality of CONV Layers using digital arithmetic techniques; removing activation and dropouts between first and last fully connected layers and merging fully connected layers using matrix algebra.
BRIEF DESCRIPTION OF THE DRAWINGS
[0028] It is to be noted, however, that the appended drawings illustrate only typical embodiments of the present subject matter and are therefore not to be considered for limiting its scope, for the present disclosure may admit to other equally effective embodiments. The detailed description is described with reference to the accompanying figures. In the figures, a reference number identifies the figure in which the reference number first appears. The same numbers are used throughout the figures to reference like features and components. Some embodiments of system or method or structure in accordance with embodiments of the present subject matter are now described, by way of example, and with reference to the accompanying figures, in which:
[0029] Figure 1 illustrates a basic block diagram of a system for hardware-aware width and depth shrinking with convolutional and fully connected layer merging, in accordance with an embodiment of the present invention.
[0030] Figure 2 illustrates an example of a computing system hardware-aware width and depth shrinking with convolutional and fully connected layer merging, in accordance with an embodiment of the present invention.
[0031] Figure 3 illustrates a flow chart of a method for hardware-aware width and depth shrinking with convolutional and fully connected layer merging, in accordance with an embodiment of the invention.
[0032] Figure 4 illustrates an exemplary process flow for width shrinking method for neural network, in accordance with an embodiment of the present invention.
[0033] Figure 5 illustrates an exemplary process flow for depth shrinking method for neural network, in accordance with an embodiment of the present invention.
[0034] Figure 6 illustrates an implementation of proposed width shrinking and depth shrinking (fcmerge and convmrege) for AlexNet model, in accordance with an embodiment of the present invention.
[0035] Figure 7 illustrates another implementation of proposed width shrinking and depth shrinking (fcmerge and convmrege) for VGG16 model, in accordance with an embodiment of the present invention.
[0036] The figures depict embodiments of the present subject matter for illustration only. A person skilled in the art will easily recognize from the following description that alternative embodiments of the structures and methods illustrated herein may be employed without departing from the principles of the disclosure described herein.
DETAILED DESCRIPTION
[0037] While the embodiments of the disclosure are subject to various modifications and alternative forms, a specific embodiment thereof has been shown by way of example in the figures and will be described below. It should be understood, however, that it is not intended to limit the disclosure to the particular forms disclosed, but on the contrary, the disclosure is to cover all modifications, equivalents, and alternatives falling within the scope of the disclosure.
[0038] The terms "comprises", "comprising", or any other variations thereof used in the disclosure, are intended to cover a non-exclusive inclusion, such that a device, system, assembly that comprises a list of components does not include only those components but may include other components not expressly listed or inherent to such system, or assembly, or device. In other words, one or more elements in a system or device proceeded by "comprises… a" does not, without more constraints, preclude the existence of other elements or additional elements in the system or device.
[0039] The present invention provides a solution for deploying deep CNNs on resource-constrained edge-devices, such as IoT devices, mobile phones and embedded systems. These devices often have constraints of power consumption, energy resources and memory, which makes them difficult to run complex and large CNN models efficiently. The present invention provides a hardware-aware optimization framework that uses shrinking of width and depth including layer merging strategies, to generate compact, efficient CNN models that are tailored for specific hardware platforms. This optimization technique enhances data flow and maintains model performance, which makes it feasible to run critical AI applications on edge devices in real-time.
[0040] For a CNN, its width and depth are key architectural parameters which influence its capabilities and performance significantly. The width of a CNN denotes the number of channels or filters in each of its layers. Increasing the width generally enhances the network's ability to recognize patterns within the data as it enables capturing more complex features at each level. The depth of a CNN refers to the number of layers through which the input data passes. These may include Convolutional, Pooling, Fully Connected, etc. layers. Higher depth allows the network to learn hierarchical features, beginning from simple edges at initial layers to more and more complicated objects and shapes at deeper layers. Nonetheless, increasing both width and depth has a considerable impact on the requirement of computational resources, and can also cause potential overfitting. While designing effective and efficient CNNs, especially for edge devices with limited computation power and memory, it is crucial to set a trade-off between the network's parameters and its performance.
[0041] A CNN extracts features through convolutional layers, typically followed by dense layers and a classification layer. Increasing the CNN's depth generally enhances recognition accuracy. However, for edge devices, compact model sizes and quick inference are crucial. This methodology aims to reduce the number of convolutional and fully connected layers while preserving original accuracy. While fully connected (FC) layers have lower computational costs compared to convolutional layers due to their efficient weight usage, merging FC layers further reduces the model size. Additionally, the system and method merge two consecutive convolutional layers in a CNN using digital arithmetic techniques.
[0042] Figure 1 illustrates a basic block diagram of a system for hardware-aware width and depth shrinking with convolutional and fully connected layer merging, in accordance with an embodiment of the present invention. The system (100) includes a width shrinking engine (110) to reduce the number of channels (filters) in each convolutional layer of a trained neural network thereby decreasing the model's width. The system (100) includes a depth shrinking engine (150) to reduce the number of layers in the network, decreasing the model's depth by using ConvMerge and FcMerge formulae. The system (100) also includes a hardware-aware optimization module (160) to provide hardware specification to the width shrinking engine (150). The width shrinking engine (110) includes a filter quantity selection (112), a specific filter decision (114), a filter pruning (116) and a calibration module (118).
[0043] According to an embodiment, the filter quantity selection (112) determines the number of filters to retain in a specific layer through empirical evaluation to gradually reduce the filter count. And also evaluates the resource usage of resulting network. The maximum filter count that meets the required resource constraints is selected. The removal of filters from a specific layer requires the removal of associated channels in subsequent layers. Therefore, adjustments to resource usage in other layers are also performed.
[0044] According to an embodiment, the specific filter decision (114) determines the filters to be retained based on the output of the filter quantity selection (112) is based on the predetermined resource constraint. The retention of filters is based on the architecture determined in filter quantity selection. A magnitude-based method is used to simplify the process. A magnitude-based method is a method for reducing the size and complexity of neural networks by considering weights with larger absolute values to be more important. This method assumes that lower magnitude parameters have minimal influence on network output.
[0045] According to an embodiment of the present invention, a hardware-aware optimization module (160) to provide hardware specification to the width shrinking engine (150) to reduce the trained neural network (model) size width wise by using the hardware latency information to prune the redundant weights
[0046] According to an embodiment, the predetermined resource constraint is the hardware specification of the target platform where the neural network has to be deployed. Typically, the predetermined resource constraint refers to the hardware limits of the device or platform on which the neural network (model) is deployed. These constraints may include memory, processing power, energy consumption, latency, or other resource limits. The optimization process tailors the model to meet these hardware constraints while maintaining acceptable performance.
[0047] According to an example embodiment, deploying a neural network on a mobile device. Here, deploying a deep learning model (e.g., for image recognition) on a mobile phone, wherein the mobile devices have much stricter hardware specifications compared to powerful GPUs used during model training. The model might be too large or slow to run efficiently on the phone. In such optimization is performed based on the hardware specification of the mobile device.
[0048] According to an embodiment, the filter pruning (116) selects the top N filters with the largest L1-norm magnitude for retention, wherein N is predefined. In an embodiment, Advanced techniques, such as filtering based on their collective impact on output feature maps, could further enhance accuracy.
[0049] According to an embodiment, L1-norm magnitude for retention refers to a technique used in neural network optimization, particularly in pruning, where the L1-norm of the model's weights is used to determine which parts of the network should be retained or removed (pruned).
[0050] According to an embodiment, calibration module (118) performs end-to-end fine-tuning of the entire network calibration. The calibration module (118) performs both quick and prolonged calibration during the width shrinking of the neural network. During width reduction each layer of the neural network is optimized for reduced resource usage through filter pruning, wherein optimization of the neural network is performed one layer at a time. To maintain accuracy, width shrinking engine (110) optimizes each layer separately and selects the optimized model with the best accuracy. Once the desired resource threshold is reached, the chosen model undergoes prolonged calibration. According to an embodiment, quick calibration is the process of fine tuning a model for a smaller number of iterations (such as 5 to 10 only). In accordance with an embodiment, prolonged calibration is the process of fine tuning a model for large number of iterations such as 60 to 100 iterations.
[0051] According to an embodiment, the depth shrinking engine (150) reduces the number of convolutional and fully connected layers with minimal impact on the original accuracy. The depth shrinking engine (150) merges two consecutive CONV layers from a plurality of CONV Layers using Digital Arithmetic techniques, removes activation and dropouts between first and last fully connected layers and merges fully connected layers using matrix algebra.
[0052] According to an embodiment, the digital Arithmetic techniques to compute the output feature map Y and its bias b′′ (derived from b and l) using the ConvMerge formula as follows:
wherein P denote the number of output channels for convolutional layer B, I denote the input feature map, Q denote the number of input channels, and R denote the number of output channels for convolutional layer A.
[0053] According to an embodiment, the convolutional layer merge and fully connected layer merge together contribute significantly to reducing the overall size of the output model.
[0054] According to an embodiment, the fully connected layers are exclusively responsible for classification tasks, allowing to merge their linear matrices to form the final classification layer. It is crucial to remove activation and dropout layers between fully connected layers beforehand. This leads to define the "FcMerge" Formula as follows:
Equation 2 assumes the model includes three fully connected layers labelled as fc1, fc2, and fc3. The techniques ConvMerge and FcMerge together contribute significantly to reducing the overall size of the output model.
[0055] According to an embodiment, the FcMerge formula is derived from matrix algebra
[0056] According to an embodiment of the present invention, the trained neural network is a convolutional neural network (CNN) model.
[0057] Any suitable computing system or group of computing systems may be used for performing the operations described herein. For example, Figure. 2 depicts examples of a computing system (200) that executes hardware-aware width and depth shrinking with convolutional and fully connected layer merging. In other embodiments, a separate computing system having devices similar to those depicted in Figure 2 (e.g., a processor, a memory, etc.) executes one or more of the blocks of the width shrinking engine (110) or the depth shrinking engine (150).
[0058] The depicted examples of a computing system (200) includes a processor (202) communicatively coupled to one or more memory devices (204). The processor (202) executes computer-executable program code stored in a memory device (204), accesses information stored in the memory device (204), or both. Examples of the processor (202) include a microprocessor, an application-specific integrated circuit ("ASIC"), a field-programmable gate array ("FPGA"), or any other suitable processing device. The processor (202) may include any number of processing devices, including a single processing device.
[0059] The memory device (204) includes any suitable non-transitory computer-readable medium for storing data, program code, or both. A computer-readable medium may include any electronic, optical, magnetic, or other storage device capable of providing a processor with computer-readable instructions or other program code. Non-limiting examples of a computer-readable medium include a magnetic disk, a memory chip, a ROM, a RAM, an ASIC, optical storage, magnetic tape or other magnetic storage, or any other medium from which a processing device can read instructions. The instructions may include processor-specific instructions generated by a compiler or an interpreter from code written in any suitable computer-programming language, including, for example, C, C++, C#, Visual Basic, Java, Python, Perl, JavaScript, and ActionScript.
[0060] The computing system (200) may also include a number of external or internal devices, such as input or output devices. For example, the computing system (200) is shown with one or more input/output ("I/O") interfaces (208). An I/O interface (208) may receive input from input devices or provide output to output devices. One or more buses (206) are also included in the computing system (200). The bus (206) communicatively couples one or more components of a respective one of the computing systems (200).
[0061] The computing system (200) executes program code that configures the processor (202) to perform one or more of the operations described herein. The program code includes, for example, the width shrinking engine (110) to reduce the number of channels (filters) in each convolutional layer of the trained neural network thereby decreasing the model's width, depth shrinking engine (150) to reduce the number of layers in the network, decreasing the model's depth by using ConvMerge and FcMerge formulae and the hardware-aware optimization module (160) to provide hardware specification to the width shrinking engine (150) and other suitable applications that perform one or more operations described herein. The program code may be resident in the memory device (204) or any suitable computer-readable medium and may be executed by the processor (202) or any other suitable processor. In some embodiments, the transformer encoder is stored in the memory device (204), as depicted in Figure. 2. In additional or alternative embodiments, one or more of the inputs for the width shrinking engine or the depth shrinking engine are stored in different memory devices of different computing systems. In additional or alternative embodiments, the program code described above is stored in one or more other memory devices accessible via a data network.
[0062] The computing system (200) may access data in any suitable manner. In some embodiments, some or all of one or more of these data sets, models, and functions are stored in the memory device (204), as in the example depicted in Figure 2. For example, a computing system (200) that width shrinking engine or the depth shrinking engine may access input data stored by an external system.
[0063] In additional or alternative embodiments, one or more of these data sets, models, and functions are stored in the same memory device (e.g., one of the memory devices (204). For example, a common computing system can host the width shrinking engine and the depth shrinking engine. In additional or alternative embodiments, one or more of the programs, data sets, models, and functions described herein are stored in one or more other memory devices accessible via a data network.
[0064] The computing system (200) also includes a network interface device (210). The network interface device (210) includes any device or group of devices suitable for establishing a wired or wireless data connection to one or more data networks. Non-limiting examples of the network interface device (210) include an Ethernet network adapter, a modem, and the like. The computing system (200) is able to communicate with one or more other computing devices via a data network using the network interface device (210).
[0065] Figure 3 illustrates a flow chart of a method for hardware-aware width and depth shrinking with convolutional and fully connected layer merging, in accordance with an embodiment of the invention. The method (300) includes the step (302) of receiving a trained neural network including a plurality of Convolutional (herein after referred as CONV) layer and a plurality of Fully Connected (herein after referred as FC) layers, wherein each CONV layer of the plurality of CONV layers and each FC layer of the plurality of FC layers include a plurality of filters. The method (300) includes the step (304) determining a set of retainable filters from the plurality of filters to retain a specific layer of at least one of the plurality of CONV or the plurality of FC layers through empirical evaluation, wherein the set of retainable filters satisfies a predetermined resource constraint. The method (300) also includes the step (306) of retaining the set of retainable filters and pruning a set of remaining filters from the plurality of filters from the plurality of CONV and the plurality of FC layers and step (308) of merging two consecutive CONV layers from a plurality of CONV layers using digital arithmetic techniques. The method (300) further includes the step (310) of removing activation and dropouts between first and last fully connected layers and the step (312) of merging fully connected layers using matrix algebra.
[0066] According to an embodiment, the step of determining the set of retainable filters includes the sub step of generating N sub-networks per iteration, wherein N is the number of CONV layers and FC layers, wherein each sub-network proposes a modification to one layer from the previous iteration. The sub step of retaining a sub-network for the next iteration, wherein the retained sub-network has maximum accuracy and sub step of fine tuning the retained sub-network.
[0067] According to an embodiment, the sub step of retaining the subnetwork is based on L1-norm magnitude.
[0068] According to an embodiment, the step (308) merging two consecutive CONV layers uses digital arithmetic techniques to compute the output feature map Y and its bias b as follows:
wherein P denote the number of output channels for convolutional layer B, I denote the input feature map, Q denote the number of input channels, and R denote the number of output channels for convolutional layer A.
[0069] According to an embodiment, the step (312) of merging fully connected layers uses the following matrix algebra
[0070] According to an embodiment, the empirical evaluation refers to testing or experimenting with different configurations of these convolutional layers to see which setup performs best. Empirical evaluation involves running experiments to evaluate the model's accuracy, efficiency, or other performance metrics rather than relying solely on theoretical design.
[0071] Figure 4 illustrates an exemplary process flow for width shrinking method (400) for neural network, in accordance with an embodiment of the present invention. The width shrinking method (400) receives the trained neural network (402) and also the hardware specification of the target platform (404). The trained neural network (402) includes a plurality of convolutional layers. The width shrinking method (400) processes one convolutional layer at a time. To maintain accuracy, the width shrinking method (400) simplifies each layer separately and selects the optimized model with the best accuracy. Once the desired resource threshold is reached, the chosen model undergoes prolonged calibration. Each layer of the convolutional layer is subjected to filter quantity selection, specific filter decision, filter pruning and quick calibration. For each layer 1, layer 2 to layer n, the width shrinking method (400) performs filter quantity selection (406-1, 406-2.. 406-n), specific filter decision (408-1, 408-2, …408-n), filter pruning (410-1, 410-2,….410-n), and quick calibration (412-1, 412-2,…. 412-n). The output from the quick calibration (412-1, 412-2,…. 412-n) is fed back to the filter quantity selection (406-1, 406-2.. 406-n). The width shrinking method (400) further includes the step (414) of measuring accuracy and resource threshold and the method at step (416) choose best accuracy model. The width shrinking method (400) at step (418) checks whether the trained neural network meets the constraint of the target hardware, if yes then the method (400) performs prolonged calibration at (420) and provides the width shrinking model (421). If at (418) the trained neural network does not meet the constraint, optimization of each layer is performed again from (406) to (416). According to an embodiment, when the trained neural network meets the constraints of the target hardware in turn ensures optimal use of processing units and memory hierarchies.
[0072] According to an embodiment, during the magnitude-based pruning method for the full connected layers, the magnitude of each neurons/ weights which is filter of dimension [1x1] is being checked. In convolutional layers, filter dimension may be [3x3], [5x5], [11x11], etc.
[0073] Figure 5 illustrates an exemplary process flow for depth shrinking method (500) for neural network, in accordance with an embodiment of the present invention. The depth shrinking method (500) receives the trained neural network (502), removes activation between two consecutive layers at (504) and perform quick calibration at (506). The depth shrinking method (500) further performs merging two consecutive convolutional layers using ConvMerge formula at (508) followed by quick calibration. The depth shrinking method (500) check whether there is a need to merge other convolutional layers at (512), if there is no need to merge other convolutional layers then the depth shrinking method (500) removes all activation and dropouts between first and last fully connected layers at (514), merges all fully connected layers using FC merge formula at (516) and performs prolonged calibration at (518) to result in depth shrinking model at (520).
[0074] According to an embodiment, neural network's (model's) accuracy after quick calibration needs to remain at a reasonable level. For example, if the original accuracy is 89%, the accuracy after quick calibration may still be above 75%. Otherwise, it may be challenging to recover the original accuracy during further merging or optimization steps. In another embodiment, when the model is implemented in a hardware that has sufficient memory, there may be no need to merge multiple convolutional layers.
[0075] Figure 6 illustrates the example implementation of width shrinking and depth shrinking (fcmerge and convmerge), in accordance with an embodiment of the present invention. The model is AlexNet trained on CIFAR10 dataset. The original trained model (602) includes five convolutional layers (CONV1, CONV2, CONV3, CONV4, CONV5) and three fully connected layers (FC1, FC2, FC3). During the initial depth shrinking methodology (604), the convolutional layer CONV4 and CONV5 merges to form the merged convolutional layer CONV4_5. And the fully connected layers FC1, FC2 and FC3 merges to form FC'. In the next stage (606) the convolutional layers CONV3 and CONV4_5 merges to form CONV3_4_5. Hence the depth shrinking (DS) method provide the convolutional layers CONV1, CONV2, CONV3_4_5 and FC' as input. The width shrinking (WS) method is applied to various convolutional layers (CONV1, CONV2, CONV3, CONV4, CONV5) and fully connected layers FC1, FC2 and FC3. The original trained model (602) undergoes width shrinking and results in creation of width shrinked CONV layers (608) such as CONV1, CONV2, CONV3, CONV4, CONV5 and FC layers such as FC1, FC2, FC3. Further the width shrinked CONV layers and FC layers further undergoes fcmerge and convmerge at 610 to generate width shrinked and depth shrinked conv layers and fc layers, which further undergoes one more round of conv merge at 612.
[0076] Figure 7 illustrates another example implementation Width Shrinking and Depth Shrinking (FcMerge and ConvMerge) implementation), in accordance with an embodiment of the present invention. The implementation is on VGG16 model trained with Fruit dataset (131 classes). The original trained model (702) includes thirteen convolutional layers (CONV1, CONV2, CONV3, CONV4, CONV5, CONV6, CONV7, CONV8, CONV9, CONV10, CONV11, CONV12, CONV13) and three fully connected layers (FC1, FC2, FC3). During the initial depth shrinking methodology (704), the CONV1 and CONV2 merges to form CONV1_2, CONV3 and CONV4 merged to form CONV3_4, CONV5, CONV6 and CONV7 merges to form CONV5_6_7, CONV8, CONV9 and CONV10 merges to form CONV8_9_10, CONV11, CONV12 and CONV13 merges to form CONV11_12_13. And the fully connected layers FC1, FC2 and FC3 merges to form FC'. Further the original trained model (702) undergoes width shrinking and creates width shrinked CONV layers and FC layers at (706), The width shrinked CONV layers and FC layers further undergoes fcmerge at (708) to generate width shrinked depth shrinked layers. The width shrinked depth shrinked layers at (708) further undergoes convmerge to generate next level of width shrinked depth shrinked layers at (710).
[0077] According to an embodiment, the present invention, strategically manipulates the width and depth of the CNN model, it's size is considerably compressed, making the CNN model more compatible with edge devices. This optimization reduces the memory footprint as well as the need to store the intermediate data off the chip. As a result, the efficiency and responsiveness of the CNN model is enhanced in resource-constrained environments. Such adaptability is vital to enable real-time AI applications on IoT and mobile devices on which hardware limitations are a major concern.
[0078] According to an embodiment, the present invention provides valuable utility across numerous domains where deploying deep Convolutional Neural Networks (CNNs) on resource-constrained edge devices is a necessity. Its capacity to optimize CNN models for efficiency, real-time performance, and energy consumption makes it applicable to various key sectors.
[0079] According to an embodiment, the present invention is particularly advantageous for IoT devices, which are often limited in computational power and memory. By reducing the width and depth of CNNs, the invention enables real-time data processing and decision-making at the edge, a critical requirement for applications such as smart home systems, industrial automation, and environmental monitoring. This allows IoT devices to perform complex tasks, including anomaly detection, predictive maintenance, and environmental sensing, without relying on cloud-based processing. As a result, it reduces latency and enhances data privacy, making edge computing more efficient and responsive.
[0080] According to an embodiment, the present invention is used in mobile and wearable devices. In mobile phones, smartwatches, and other wearable technologies, the present invention provides platform-aware optimization techniques to facilitate the deployment of advanced AI features like facial recognition, gesture recognition, and health monitoring in a resource-efficient manner. The reduced model size, along with improved energy efficiency, extends battery life and boosts performance. This enhances the user experience by enabling continuous operation of AI-driven functionalities, which is particularly beneficial for devices requiring long periods of uptime or intensive processing capabilities.
[0081] According to an embodiment, the present invention finds applications in autonomous vehicles, where the real-time processing of visual data is critical for navigation, obstacle detection, and decision-making. By optimizing CNNs to be compatible with specific hardware platforms, the invention ensures that these models can operate effectively under the strict time and resource constraints characteristic of automotive environments. This enhances the safety and reliability of autonomous driving systems, which need to process large volumes of data in real time while minimizing power consumption-an essential factor in overall vehicle efficiency.
[0082] According to another embodiment, the present invention enables the deployment of AI models on portable medical devices, such as diagnostic tools, wearable health monitors, and point-of-care systems. These devices often require on-device processing to provide real-time feedback to healthcare professionals and patients. The present invention's ability to optimize CNNs allows for accurate and fast analysis of medical data, including imaging and bio signal monitoring, without the need for constant cloud connectivity. This capability leads to better patient outcomes, enabling timely interventions and improved medical decision-making.
[0083] According to an embodiment, the present invention is highly compatible with edge computing platforms, which are designed to bring processing closer to the data sources. By optimizing CNNs for these platforms, it ensures that complex AI tasks can be processed locally, reducing the reliance on central servers and lowering latency. This is critical for applications like smart cities, real-time analytics, and distributed AI systems, as it enables more efficient and scalable edge computing solutions.
[0084] According to an embodiment, present invention enables the deployment of complex deep CNN models on a wide range of devices, from edge devices to cloud servers, while ensuring efficiency and real-time performance.
ADVANTAGES OF THE INVENTION
[0085] All in all, the present invention is having the following advantages:
• Model Complexity Reduction: By shrinking the width and depth of the network, the number of parameters and computations reduces which leads to a smaller and less complex model.
• Increased Hardware Efficiency: Since the adaptation is hardware-aware which means, it is considering the hardware resource limitations. So, the model optimization makes the best use of available resources, ensuring efficient execution.
• Inference Time Acceleration: By reducing the number of operations through width and depth shrinking, the model's inference time is significantly decreased. This is crucial for real-time applications where latency is a critical factor.
• Reduced Memory Usage: Compression of network reduces the need for storing intermediate data, which is crucial for edge devices with limited on-chip memory.
• Reduced Compile Time: Since model Size is reducing drastically which also reduces the compile time.
• Available Resource Constraints for Edge Devices: Mostly edge devices have limited memory, computational power and energy resources. Deploying large, complex CNNs on such edge devices without proper optimization can lead to inefficiency, increased power consumption, and reduced battery life.
• Requirement of Real-Time Processing: Applications, such as autonomous driving, surveillance, and robotics, requires real-time processing. High computational demands and large models can introduce delays, making it difficult to meet these strict timing requirements.
• Efficient Energy Requirement: Since AI applications are increasing everyday especially in mobile and battery powered devices, so energy efficiency becomes a critical concern. Reducing in model size and complexity helps in energy conservation, extending the operational life of the device.
• Improve Deployment Feasibility: Due to hardware constraints, large models may not be feasible to deploy on every hardware. But shrinking of model size and optimization for specific hardware constraints makes it more practical and accessible.
• Maintaining Performance: This invention aims to maintain the model's performance and accuracy, despite of its size and complexity reduction.
[0086] It will be understood by those within the art that, in general, terms used herein, and especially in the appended claims (e.g., bodies of the appended claims) are generally intended as "open" terms (e.g., the term "including" should be interpreted as "including but not limited to," the term "having" should be interpreted as "having at least," the term "includes" should be interpreted as "includes but is not limited to," etc.). It will be further understood by those within the art that if a specific number of an introduced claim recitation is intended, such an intent will be explicitly recited in the claim, and in the absence of such recitation, no such intent is present. For example, as an aid to understanding, the following appended claims may contain usage of the introductory phrases "at least one" and "one or more" to introduce claim recitations. However, the use of such phrases should not be construed to imply that the introduction of a claim recitation by the indefinite articles "a" or "an" limits any particular claim containing such introduced claim recitation to disclosures containing only one such recitation, even when the same claim includes the introductory phrases "one or more" or "at least one" and indefinite articles such as "a" or "an" (e.g., "a" and/or "an" should typically be interpreted to mean "at least one" or "one or more"); the same holds true for the use of definite articles used to introduce claim recitations. Also, even if a specific number of an introduced claim recitation is explicitly recited, those skilled in the art will recognize that such recitation should typically be interpreted to mean at least the recited number (e.g., the bare recitation of "two recitations," without other modifiers, typically means at least two recitations or two or more recitations). Furthermore, in those instances where a convention analogous to "at least one of A, B, and C, etc." is used, in general, such construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B, and C" would include but not be limited to systems that have A alone, B alone, C alone, A and B together, A and C together, B and C together, and/or A, B, and C together, etc.). In those instances, where a convention analogous to "at least one of A, B, or C, etc." is used, in general, such construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B, or C" would include but not be limited to systems that have A alone, B alone, C alone, A and B together, A and C together, B and C together, and/or A, B, and C together, etc.). It will be further understood by those within the art that virtually any disjunctive word and/or phrase presenting two or more alternative terms, whether in the description, claims, or drawings, should be understood to contemplate the possibilities of including one of the terms, either of the terms or both terms. For example, the phrase "A or B" will be understood to include the possibilities of "A" or "B" or "A and B."
[0087] It will be further appreciated that functions or structures of a plurality of components or steps may be combined into a single component or step, or the functions or structures of one-step or component may be split among plural steps or components. The present disclosure contemplates all of these combinations. Unless stated otherwise, dimensions and geometries of the various structures depicted herein are not intended to be restrictive of the disclosure, and other dimensions or geometries are possible. Also, while a feature of the present disclosure may have been described in the context of only one of the illustrated embodiments, such feature may be combined with one or more other features of other embodiments, for any given application. It will also be appreciated from the above that the fabrication of the unique structures herein and the operation thereof also constitute methods in accordance with the present disclosure. The present disclosure also encompasses intermediate and end products resulting from the practice of the methods herein. It will be apparent to one of ordinary skill in the art that methods, devices, device elements, materials, procedures, and techniques other than those specifically described herein may be applied to the practice of the invention as broadly disclosed herein without resort to undue experimentation. All art-known functional equivalents of methods, devices, device elements, materials, procedures, and techniques described herein are intended to be encompassed by this invention. The use of "comprising" or "including" also contemplates embodiments that "consist essentially of" or "consist of" the recited feature.
, Claims:We Claims:
1) A method (300) for hardware-aware width and depth shrinking with convolutional and fully connected layer merging, the method comprising:
receiving (302) a trained neural network comprising a plurality of convolutional (CONV) layers and a plurality of fully connected (FC) layers, wherein each CONV layer of the plurality of Convolutional (CONV) layer and each FC layer of the plurality of FC layers comprises a plurality of filters;
determining (304) a set of retainable filters from the plurality of filters to retain in a specific layer of at least one of the plurality of CONV layers or the plurality of FC layers through empirical evaluation, wherein the set of retainable filters satisfies a predetermined resource constraint;
retaining (306) the set of retainable filters and pruning a set of remaining filters from the plurality of filters from at least one of the plurality of CONV layers or the plurality of FC layers;
merging (308) two consecutive CONV layers from a plurality of CONV layers using digital arithmetic techniques;
removing (310) activation and dropouts between first and last fully connected layers; and
merging (312) fully connected layers using matrix algebra.
2) The method (300) as claimed in claim 1, wherein determining the set of retainable filters by:
generating N sub-networks per iteration, wherein N is the number of CONV layers and FC layers, wherein each sub-network proposes a modification to one layer from the previous iteration;
retaining a sub-network for the next iteration, wherein the retained sub-network has maximum accuracy; and
fine tuning the retained sub-network.
3) The method (300) as claimed in claim 2, wherein the sub-network is retained based on L1-norm magnitude.
4) The method as claimed in claim 1, wherein the digital arithmetic techniques to compute the output feature map Y and its bias b′
wherein P denote the number of output channels for convolutional layer B, I denote the input feature map, Q denote the number of input channels, and R denote the number of output channels for convolutional layer A.
5) The method (300) as claimed in claim 1, wherein the matrix algebra is as follows:
wherein the model comprises three fully connected layers labelled as fc1, fc2, and fc3.
6) A computing system comprising:
a processor (402);
a non-transitory computer-readable medium comprising instructions which, when executed by the processor (402), perform processing comprising:
receiving a trained neural network comprising a plurality of Convolutional (CONV) Layer and Fully Connected (FC) Layer, wherein each CONV layer comprises a plurality of filters;
determining a set of retainable filters from the plurality of filters to retain in a specific layer of the plurality of CONV and FC layers through empirical evaluation, wherein the set of retainable filters satisfies a predetermined resource constraint; (Please explain what is the empirical evaluation);
retaining the set of retainable filters and pruning a set of remaining filters from the plurality of filters from a plurality of CONV and FC layers;
merging two consecutive CONV layers from a plurality of CONV layers using digital arithmetic techniques;
removing activation and dropouts between first and last fully connected layers; and
merging fully connected layers using matrix algebra.
7) The system as claimed in claim 6 wherein determining the set of retainable filters by:
generating N sub-networks per iteration, wherein N is the number of CONV and FC layers, wherein each sub-network proposes a modification to one layer from the previous iteration;
retaining a sub-network for the next iteration, wherein the retained sub-network has maximum accuracy; and
fine tuning the retained sub-network.
8) The system as claimed in claim 7, wherein the sub-network is retained based on L1-norm magnitude.
9) The system as claimed in claim 6, wherein the digital arithmetic techniques to compute the output feature map Y and its bias b′
wherein P denote the number of output channels for convolutional layer B, I denote the input feature map, Q denote the number of input channels, and R denote the number of output channels for convolutional layer A.
10) The system as claimed in claim 6, wherein the matrix algebra is as follows:
wherein the model comprises three fully connected layers labelled as fc1, fc2, and fc3.
Dated this 26th day of October 2024
Documents
Name | Date |
---|---|
202441081768-COMPLETE SPECIFICATION [26-10-2024(online)].pdf | 26/10/2024 |
202441081768-DECLARATION OF INVENTORSHIP (FORM 5) [26-10-2024(online)].pdf | 26/10/2024 |
202441081768-DRAWINGS [26-10-2024(online)].pdf | 26/10/2024 |
202441081768-EDUCATIONAL INSTITUTION(S) [26-10-2024(online)].pdf | 26/10/2024 |
202441081768-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [26-10-2024(online)].pdf | 26/10/2024 |
202441081768-EVIDENCE OF ELIGIBILTY RULE 24C1f [26-10-2024(online)].pdf | 26/10/2024 |
202441081768-FIGURE OF ABSTRACT [26-10-2024(online)].pdf | 26/10/2024 |
202441081768-FORM 1 [26-10-2024(online)].pdf | 26/10/2024 |
202441081768-FORM 18A [26-10-2024(online)].pdf | 26/10/2024 |
202441081768-FORM FOR SMALL ENTITY(FORM-28) [26-10-2024(online)].pdf | 26/10/2024 |
202441081768-FORM-9 [26-10-2024(online)].pdf | 26/10/2024 |
202441081768-POWER OF AUTHORITY [26-10-2024(online)].pdf | 26/10/2024 |
202441081768-PROOF OF RIGHT [26-10-2024(online)].pdf | 26/10/2024 |
Talk To Experts
Calculators
Downloads
By continuing past this page, you agree to our Terms of Service,, Cookie Policy, Privacy Policy and Refund Policy © - Uber9 Business Process Services Private Limited. All rights reserved.
Uber9 Business Process Services Private Limited, CIN - U74900TN2014PTC098414, GSTIN - 33AABCU7650C1ZM, Registered Office Address - F-97, Newry Shreya Apartments Anna Nagar East, Chennai, Tamil Nadu 600102, India.
Please note that we are a facilitating platform enabling access to reliable professionals. We are not a law firm and do not provide legal services ourselves. The information on this website is for the purpose of knowledge only and should not be relied upon as legal advice or opinion.