image
image
user-login
Patent search/

A SYSTEM AND A METHOD FOR CANCER CLASSIFICATION USING SEQUENTIAL GENE EXPRESSION DATA

search

Patent Search in India

  • tick

    Extensive patent search conducted by a registered patent agent

  • tick

    Patent search done by experts in under 48hrs

₹999

₹399

Talk to expert

A SYSTEM AND A METHOD FOR CANCER CLASSIFICATION USING SEQUENTIAL GENE EXPRESSION DATA

ORDINARY APPLICATION

Published

date

Filed on 18 November 2024

Abstract

ABSTRACT A SYSTEM AND A METHOD FOR CANCER CLASSIFICATION USING SEQUENTIAL GENE EXPRESSION DATA The present disclosure discloses a system and a method for cancer classification using sequential gene expression data; The system(100) comprises a data import module(102) comprising an encoding module(104) to receive gene expression data, a label encoder(104a) to apply class label encoding to expression values; a data preprocessing module(106) to normalize gene expression values, encode class labels into a numerical format suitable for processing by at least one deep learning model; a data splitting module(108) to divide the preprocessed data into training dataset and testing dataset; a recurrent neural network (RNN) training module(110) to initiate and compile a first deep learning model and a second deep learning model to build an RNN model and receive the training dataset to identify long-term dependencies; a classification module(112) to classify the preprocessed gene expression data into a cancer class; a user interface(114) to display the classified cancer class on a display unit(114a). Figure 1

Patent Information

Application ID202441089264
Invention FieldCOMPUTER SCIENCE
Date of Application18/11/2024
Publication Number48/2024

Inventors

NameAddressCountryNationality
KOTA SRAVANISRM University-AP, Neerukonda, Mangalagiri Mandal, Guntur-522502, Andhra Pradesh, IndiaIndiaIndia
DONTHIREDDY CHANDRASAI REDDYSRM University-AP, Neerukonda, Mangalagiri Mandal, Guntur-522502, Andhra Pradesh, IndiaIndiaIndia
GHANTA SWETHASRM University-AP, Neerukonda, Mangalagiri Mandal, Guntur-522502, Andhra Pradesh, IndiaIndiaIndia
ASHOK KUMAR PRADHANSRM University-AP, Neerukonda, Mangalagiri Mandal, Guntur-522502, Andhra Pradesh, IndiaIndiaIndia

Applicants

NameAddressCountryNationality
SRM UNIVERSITYAmaravati, Mangalagiri, Andhra Pradesh-522502, IndiaIndiaIndia

Specification

Description:FIELD OF INVENTION
The present disclosure generally relates to the field of recognition systems. More particularly, the present disclosure relates to a system and a method for cancer classification using sequential gene expression data.
DEFINITIONS
As used in the present disclosure, the following terms are generally intended to have the meaning as set forth below, except to the extent that the context in which they are used indicates otherwise.
The term "gene expression data" refers to the information that represents the levels of gene activity within a cell or tissue at a specific time.
The term "set of preprocessing rules" refers to a structured collection of guidelines or algorithms applied to raw data before it is used in further analysis or modeling. These rules are designed to transform the data into a format that is more suitable for processing by machine learning models, ensuring consistency, accuracy, and reliability of the result.
The term "machine learning processor" refers to a specialized computational unit or hardware designed to efficiently execute machine learning algorithms, particularly those involving large-scale data processing and complex mathematical computations.
The term "iterative optimization method" refers to a computational technique used to find the best solution (often the minimum or maximum) of a function or a set of parameters by gradually improving an initial estimate through a series of iterations.
The above definitions are in addition to those expressed in the art.

BACKGROUND
The background information herein below relates to the present disclosure but is not necessarily prior art.
Sequential data analysis has become a cornerstone in various domains, particularly in fields like natural language processing (NLP) and time-series forecasting. In NLP, sequential data analysis underpins tasks such as language translation, sentiment analysis, and speech recognition, enabling machines to understand and generate human language in a meaningful way. Similarly, in time-series forecasting, sequential data analysis is used to predict future trends based on past observations, which is critical for applications ranging from financial market analysis to weather prediction.
One of the primary challenges is the vanishing and exploding gradient problem that occurs during the training of deep learning models. Another significant limitation is the computational complexity associated with training deep learning models on large sequential datasets. Overfitting is another concern in sequential data analysis. When models are trained on limited or noisy data, they may become overly tuned to the training data and fail to generalize well to new, unseen data. Furthermore, the interpretability of models used for sequential data analysis is often limited. Deep learning models, particularly those used in NLP and time-series analysis, are often considered "black boxes," making it difficult for users to understand how decisions are made or why certain predictions are generated. This lack of transparency is a significant drawback, particularly in fields where understanding the rationale behind predictions is crucial.
Therefore, there is felt a need for a system and a method for cancer classification using sequential gene expression data that alleviates the aforementioned drawbacks.
OBJECTS
Some of the objects of the present disclosure, which at least one embodiment herein satisfies, are as follows:
It is an object of the present disclosure to ameliorate one or more problems of the prior art or to at least provide a useful alternative.
An object of the present disclosure is to provide a system for cancer classification using sequential gene expression data.
Another object of the present disclosure is to provide a system that ensures accurate and reliable predictions.
Still another object of the present disclosure is to provide a system that is able to capture complex patterns and long-term dependencies in the data, leading to more accurate cancer classification.
Yet another object of the present disclosure is to provide a system that enhances the system's scalability and flexibility.
Still another object of the present disclosure is to provide a system that has robust and well-generalized models, minimizing the risk of overfitting.
Yet another object of the present disclosure is to provide a system that classifies gene expression data into cancer classes.
Other objects and advantages of the present disclosure will be more apparent from the following description, which is not intended to limit the scope of the present disclosure.
SUMMARY
The present disclosure envisages a system and a method for cancer classification using sequential gene expression data. The system comprises a data import module, an encoding module, a data preprocessing module, a data splitting module, a recurrent neural network (RNN) training module, a classification module, and a user interface.
The data import module comprises a network interface configured to access and retrieve gene expression data from a data source, including databases, file systems, and remote or online repositories.
The encoding module is configured to cooperate with the data import module to receive the gene expression data and include a label encoder to apply class label encoding to expression values in the gene expression data for generating an encoded gene expression data.
The data preprocessing module is configured to cooperate with the encoding module to receive the encoded gene expression data, further configured to normalize gene expression values in the encoded gene expression data by means of a set of preprocessing rules, and further configured to encode class labels into a numerical format suitable for processing by at least one deep learning model based on the normalized gene expression values for generating a pre-processed gene expression data.
The data splitting module is configured to cooperate with the data preprocessing module to receive the preprocessed gene expression data and divide the preprocessed data into training dataset and testing dataset.
The recurrent neural network (RNN) training module is configured to cooperate with the data splitting module, and comprises a machine learning processor to:
• initialize a first deep learning model and a second deep learning model,
• compile the first deep learning model and the second deep learning model to build an RNN model using an iterative optimization method and a categorical cross-entropy loss (CCE) function, and
• receive the training dataset for training the RNN model to identify long-term dependencies within the preprocessed gene expression data for a predefined number of epochs.
The classification module is configured to cooperate with the data splitting module and the RNN training module to receive the testing dataset, and further configured to implement the testing dataset on the trained RNN model to classify the preprocessed gene expression data into a cancer class.
The user interface is configured to cooperate with the classification module to receive the classified cancer class and display the classified cancer class on a display unit.
In an embodiment, the RNN training module further includes:
• an evaluation module configured to cooperate with the classification module to evaluate the performance of the trained RNN model based on classification metrics for each cancer class and generate results based on evaluation; and
• an optimization module configured to cooperate with the evaluation module to receive the evaluation result, and further configured to optimize the performance of the trained RNN model by adjusting hyperparameters including learning rates, number of neurons, batch sizes, number of epochs, and dropout layer.
In an embodiment, the data preprocessing module further comprises:
• a normalization module configured to normalize the gene expression data to a specific range before being processed by the RNN training module; and
• a feature extraction module configured to identify and extract relevant features from the gene expression data before being processed by the RNN training module.
In an embodiment, the RNN training module is configured to train the RNN model on the training dataset to identify patterns in the pre-trained gene expression data that correlate with cancer prediction, and further configured to implement early stopping techniques to prevent overfitting during the training process.
In an embodiment, the first deep learning model is a Long Short-Term Memory (LSTM) model, and the second classification model is a Gated Recurrent Unit (GRU) model.
In an embodiment, the user interface further comprises a visualization tool configured to display model performance metrics, such as receiver operating characteristic (ROC) curves, precision-recall curves, and other relevant charts, to aid in the analysis of the results.
In an embodiment, the user interface is further configured to allow users to export the model predictions and performance metrics for further analysis or reporting.
In an embodiment, the set of preprocessing rules is a set of instructions used to perform normalization, scaling, or other preprocessing techniques to ensure consistency and accuracy.
In an embodiment, the classified cancer class is selected from a group of breast cancer (BRCA), kidney renal clear cell carcinoma (KIRC), prostate adenocarcinoma (PRAD), lung adenocarcinoma (LUAD), and colon adenocarcinoma (COAD).
The present disclosure also envisages a method for cancer classification using sequential gene expression data. The method comprises the following steps:
• accessing and retrieving, by a data import module comprising a network interface, gene expression data from a data source, including databases, file systems, and remote or online repositories;
• receiving, by an encoding module, the gene expression data, and including a label encoder to apply class label encoding to expression values in the gene expression data for generating an encoded gene expression data;
• receiving, by a data preprocessing module, the encoded gene expression data, normalizing gene expression values in the encoded gene expression data by means of a set of preprocessing rules, and encoding class labels into a numerical format suitable for processing by at least one deep learning model based on the normalized gene expression values for generating a pre-processed gene expression data;
• receiving, by a data splitting module, the preprocessed gene expression data and dividing the preprocessed data into a training dataset and a testing dataset;
• cooperating, by a recurrent neural network (RNN) training module configured to cooperate with the data splitting module;
• initializing, by the recurrent neural network (RNN) training module, a first deep learning model and a second deep learning model;
• compiling, by the recurrent neural network (RNN) training module, the first deep learning model and the second deep learning model to build an RNN model using an iterative optimization method and a categorical cross-entropy loss (CCE) function;
• receiving, by the recurrent neural network (RNN) training module, the training dataset for training the RNN model to identify long-term dependencies within the preprocessed gene expression data for a predefined number of epochs;
• receiving and implementing, by a classification module the testing dataset on the trained RNN model to classify the preprocessed gene expression data into a cancer class; and
• receiving and displaying, by a user interface, the classified cancer class on a display unit.
BRIEF DESCRIPTION OF THE ACCOMPANYING DRAWING
A system and a method for cancer classification using sequential gene expression data of the present disclosure will now be described with the help of the accompanying drawing, in which:
Figure 1 illustrates a block diagram of a system cancer classification using sequential gene expression data;
Figures 2A-2B illustrate a flow chart depicting the steps involved in a method for cancer classification using sequential gene expression data in accordance with an embodiment of the present disclosure; and
Figure 3 illustrates a flowchart of testing of unseen data in accordance with an embodiment of the present disclosure.
LIST OF REFERENCE NUMERALS
100 - System
102 - Data Import Module
104 - Encoding Module
106 - Data Preprocessing Module
106a - Normalization Module
106b - Feature Extraction Module
108 - Data Splitting Module
110 - Recurrent Neural Network (RNN) Training Module
112 - Classification Module
114 - User Interface
114a- Display Unit
116 - Evaluation Module
118 - Optimization Module
DETAILED DESCRIPTION
Embodiments, of the present disclosure, will now be described with reference to the accompanying drawing.
Embodiments are provided so as to thoroughly and fully convey the scope of the present disclosure to the person skilled in the art. Numerous details, are set forth, relating to specific components, and methods, to provide a complete understanding of embodiments of the present disclosure. It will be apparent to the person skilled in the art that the details provided in the embodiments should not be construed to limit the scope of the present disclosure. In some embodiments, well-known processes, well-known apparatus structures, and well-known techniques are not described in detail.
The terminology used, in the present disclosure, is only for the purpose of explaining a particular embodiment and such terminology shall not be considered to limit the scope of the present disclosure. As used in the present disclosure, the forms "a," "an," and "the" may be intended to include the plural forms as well, unless the context clearly suggests otherwise. The terms "including," and "having," are open ended transitional phrases and therefore specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not forbid the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. The particular order of steps disclosed in the method and process of the present disclosure is not to be construed as necessarily requiring their performance as described or illustrated. It is also to be understood that additional or alternative steps may be employed.
When an element is referred to as being "engaged to," "connected to," or "coupled to" another element, it may be directly engaged, connected, or coupled to the other element. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed elements.
Sequential data analysis has significantly advanced fields like natural language processing (NLP) and time-series forecasting, particularly with the development of Recurrent Neural Networks (RNNs) and their variants like LSTMs and GRUs. However, existing systems face technical limitations, including issues with vanishing and exploding gradients, high computational complexity, overfitting, and limited model interpretability. Additionally, the steep learning curve required to develop and implement these models restricts their accessibility. These challenges highlight the need for further improvements to enhance the effectiveness, scalability, and usability of sequential data analysis techniques. The technical limitations such as gradient issues, computational demands, overfitting, limited interpretability, and the complexity of model development continue to challenge the effectiveness and accessibility of these systems.
To address the issues of the existing systems and methods, the present disclosure envisages a system (hereinafter referred to as "system 100") for cancer classification using sequential gene expression data and a method for cancer classification using sequential gene expression data (hereinafter referred to as "method 200"). The system 100 will now be described with reference to Figure 1 and the method 200 will be described with reference to Figures 2A-2B.
Referring to Figure 1, the system 100 comprises a data import module 102, an encoding module 104, a data preprocessing module 106, a data splitting module 108, a recurrent neural network (RNN) training module 110, a classification module 112, and a user interface 114.
The data import module 102 comprises a network interface 102a is configured to access and retrieve gene expression data from a data source, including databases, file systems, and remote or online repositories.
The encoding module 104 is configured to cooperate with the data import module 102 to receive the gene expression data and includes a label encoder 104a to apply class label encoding to expression values in the gene expression data for generating encoded gene expression data.
The data preprocessing module 106 is configured to cooperate with the encoding module 104 to receive the encoded gene expression data, further configured to normalize gene expression values in the encoded gene expression data by means of a set of preprocessing rules, and further configured to encode class labels into a numerical format suitable for processing by at least one deep learning model based on the normalized gene expression values for generating a pre-processed gene expression data.
In an embodiment, the set of preprocessing rules is a set of instructions used to perform normalization, scaling, or other preprocessing techniques to ensure consistency and accuracy.
In an embodiment, the data preprocessing module 106 further comprises:
• a normalization module 106a configured to normalize the gene expression data to a specific range before being processed by the RNN training module 110; and
• a feature extraction module 106b configured to identify and extract relevant features from the gene expression data before being processed by the RNN training module 110.
The data splitting module 108 is configured to cooperate with the data preprocessing module 106 to receive the preprocessed gene expression data and divide the preprocessed data into a training dataset and a testing dataset.
The recurrent neural network (RNN) training module 110 is configured to cooperate with the data splitting module 108, and comprises a machine learning processor to:
• initialize a first deep learning model and a second deep learning model,
• compile the first deep learning model and the second deep learning model to build an RNN model using an iterative optimization method and a categorical cross-entropy loss (CCE) function, and
• receive the training dataset for training the RNN model to identify long-term dependencies within the preprocessed gene expression data for a predefined number of epochs;
In an embodiment, the first deep learning model is a Long Short-Term Memory (LSTM) model, and the second classification model is a Gated Recurrent Unit (GRU) model.
In an embodiment, the iterative optimization method is an Adam optimizer method.
In an embodiment, the RNN training module 110 further includes:
• an evaluation module 116 configured to cooperate with the classification module 112 to evaluate the performance of the trained RNN model based on classification metrics for each cancer class and generate results based on evaluation; and
• an optimization module 118 configured to cooperate with the evaluation module 116 to receive the evaluation result, and further configured to optimize the performance of the trained RNN model by adjusting hyperparameters including learning rates, number of neurons, batch sizes, number of epochs, and dropout layer.
In an embodiment, the predefined number of epochs ranges from 10 to 100 epochs.
In an embodiment, the RNN training module 110 is configured to train the RNN model on the training dataset to identify patterns in the pre-trained gene expression data that correlate with cancer prediction, and further configured to implement early stopping techniques to prevent overfitting during the training process.
The classification module 112 is configured to cooperate with the data splitting module 108 and the RNN training module 110 to receive the testing dataset, and further configured to implement the testing dataset on the trained RNN model to classify the preprocessed gene expression data into a cancer class.
In an embodiment, the classified cancer class is selected from a group of breast cancer (BRCA), kidney renal clear cell carcinoma (KIRC), prostate adenocarcinoma (PRAD), lung adenocarcinoma (LUAD), and colon adenocarcinoma (COAD).
In an embodiment, the classification metrics include accuracy, precision, recall, and F1-score.
The user interface 114 is configured to cooperate with the classification module 112 to receive the classified cancer class and display the classified cancer class on a display unit 114a.
In an embodiment, the user interface 114 further comprises a visualization tool configured to display model performance metrics, such as receiver operating characteristic (ROC) curves, precision-recall curves, and other relevant charts, to aid in the analysis of the results.
In one embodiment, the user interface 114 is further configured to allow users to export the model predictions and performance metrics for further analysis or reporting.
In an embodiments, the classifying gene expression data into five different cancer classes: breast cancer (BRCA), kidney renal clear cell carcinoma (KIRC), prostate adenocarcinoma (PRAD), lung adenocarcinoma (LUAD), and colon adenocarcinoma (COAD). The system features of are detailed below:
• Application of Recurrent Neural Networks (RNNs): The application of RNNs, particularly LSTM (Long Short-Term Memory) networks and GRUs (Gated Recurrent Units), to gene expression data. These neural network architectures are typically used in fields such as natural language processing and time-series analysis, where sequential data modelling is crucial. Utilizing the inherent sequential nature of gene expression data, which evolves over different stages of cancer, to improve classification accuracy is an unusual feature. This contrasts with the traditional machine learning models that may not fully exploit temporal dependencies in the data.
• Sequential Data Processing: Gene expression data is treated as sequential data, which is not a common approach in the field. Traditional methods often treat gene expression profiles as static vectors. By leveraging the temporal dynamics and dependencies between gene expressions, the RNN models can potentially capture subtle patterns and variations that are indicative of different cancer types.
• Multi-Class Cancer Classification: The task involves classifying gene expression data into five distinct cancer classes. Most existing approaches might focus on binary classification (e.g., cancer vs. no cancer) or differentiate between fewer cancer types. The ability to accurately distinguish between multiple cancer types using a single RNN model showcases the robustness and capability of the approach. This multi-class classification enhances the utility of the model in practical clinical diagnostics.
• Integration of Advanced Preprocessing Techniques: Preprocessing gene expression data to make it suitable for RNNs involves advanced techniques that may include normalization, and sequencing. The integrated preprocessing pipeline ensures that the data fed into the RNN model retains its sequential characteristics and is optimized for learning, which is a significant improvement over traditional preprocessing method.
• Improved Prediction Accuracy and Robustness: The use of LSTM and GRU networks improves prediction accuracy by capturing long-range dependencies and mitigating the vanishing gradient problem commonly encountered in standard RNNs.
This approach results in a more accurate and reliable model for cancer classification, which is crucial for early diagnosis and treatment planning. The robustness of the model against variations in gene expression data from different patients is an additional benefit.
Figures 2A-2B illustrate a flow chart depicting the steps involved in a method for cancer classification using sequential gene expression data in accordance with an embodiment of the present disclosure. The order in which method 200 is described is not intended to be construed as a limitation, and any number of the described method steps may be combined in any order to implement method 200, or an alternative method. Furthermore, method 200 may be implemented by processing resource or computing device(s) through any suitable hardware, non-transitory machine-readable medium/instructions, or a combination thereof. The method 200 comprises the following steps:
At step 202, the method 200 includes accessing and retrieving, by a data import module 102 comprising a network interface 102, gene expression data from a data source, including databases, file systems, and remote or online repositories.
At step 204, the method 200 includes receiving, by an encoding module 104, the gene expression data, and including a label encoder 104a to apply class label encoding to expression values in the gene expression data for generating an encoded gene expression data.
At step 206, the method 200 includes receiving, by a data preprocessing module 106, the encoded gene expression data, normalizing gene expression values in the encoded gene expression data by means of a set of preprocessing rules, and encoding class labels into a numerical format suitable for processing by at least one deep learning model based on the normalized gene expression values for generating a pre-processed gene expression data.
At step 208, the method 200 includes receiving, by a data splitting module 108, the preprocessed gene expression data and dividing the preprocessed data into a training dataset and a testing dataset.
At step 210, the method 200 includes cooperating, by a recurrent neural network (RNN) training module 110 configured to cooperate with the data splitting module 108.
At step 212, the method 200 includes initializing, by the recurrent neural network (RNN) training module 110, a first deep learning model and a second deep learning model.
At step 214, the method 200 includes compiling, by the recurrent neural network (RNN) training module 110, the first deep learning model and the second deep learning model to build an RNN model using an iterative optimization method and a categorical cross-entropy loss (CCE) function.
At step 216, the method 200 includes receiving, by the recurrent neural network (RNN) training module 110, the training dataset for training the RNN model to identify long-term dependencies within the preprocessed gene expression data for a predefined number of epochs.
At step 218, the method 200 includes receiving and implementing, by a classification module 112 the testing dataset on the trained RNN model to classify the preprocessed gene expression data into a cancer class.
At step 220, the method 200 includes receiving and displaying, by a user interface 114, the classified cancer class on a display unit 114a.
Figure 3 illustrates a flowchart of testing of unseen data in accordance with an embodiment of the present disclosure. Figure 3 shows a systematic process for cancer classification using sequential gene expression data, employing deep learning models such as LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit).
• Import Libraries: The process begins with importing the necessary libraries. These typically include libraries for data manipulation (like Pandas and NumPy), machine learning (like TensorFlow or PyTorch), and data visualization (like Matplotlib or Seaborn).
• Reading Data: After importing the libraries, the system reads the gene expression data from various sources such as CSV files, databases, or online repositories. This data forms the foundation for subsequent processing and analysis.
• Label Encoding: The next step involves converting categorical labels (e.g., cancer types) into a numerical format using label encoding techniques. This step is crucial for enabling machine learning models to process the categorical data.
• Data Preprocessing: In this stage, the data undergoes preprocessing to make it suitable for model training. Preprocessing steps typically include normalization or standardization of gene expression values, handling missing data, and possibly further encoding or transformation of features.
• Splitting Data: The preprocessed data is then divided into training and testing datasets. The training dataset is used to train the model, while the testing dataset is reserved for evaluating the model's performance.
• Build the Model: The flowchart shows the model-building phase, where two types of recurrent neural networks, LSTM and GRU, are constructed. These models are designed to capture the sequential nature of gene expression data, which is essential for accurate cancer classification.
• Compile the Model: After building the models, the next step is to compile them. This involves configuring the models with an optimizer (such as Adam or SGD), a loss function (commonly categorical cross-entropy for classification tasks), and evaluation metrics (like accuracy).
• Train the Model: The compiled models are then trained using the training dataset. During this phase, the models learn to recognize patterns in the gene expression data that are indicative of different cancer types. Training typically involves running the data through the models for multiple epochs, with iterative optimization to minimize the loss function.
• Test the Model: Finally, the trained models are tested using the testing dataset to evaluate their performance. This step helps in assessing the accuracy and generalization ability of the models in classifying cancer types based on gene expression data.
In an operative configuration, the cancer classification system 100 using sequential gene expression data is designed to perform a series of operations, starting with the retrieval of data and culminating in the classification of cancer types, with results presented to the user. The system is composed of several interconnected modules, each playing a crucial role in ensuring accurate and efficient classification. The process begins with the Data Import Module 102, which is equipped with a network interface that allows the system to access and retrieve gene expression data from various sources, including local file systems, remote servers, databases, and online repositories. This module ensures that the data, which might be in formats like CSV or JSON, is securely and efficiently fetched for further processing.
Next, the encoding module 104 takes over, receiving the raw gene expression data and applying necessary transformations. This module includes a label encoder that converts categorical labels, particularly class labels like cancer types, into numerical formats suitable for machine learning models. This transformation is crucial as it prepares the data for the subsequent stages, ensuring that the categorical data is encoded into a format that the deep learning models can process.
The data preprocessing module 106 then receives the encoded data, where it undergoes several preprocessing steps. This includes normalization of gene expression values to ensure they are within a standard range, which is essential for the effective functioning of deep learning models. Additionally, the module further encodes the class labels into a format suitable for categorical cross-entropy loss computation, resulting in a preprocessed dataset ready for splitting.
Following preprocessing, the data splitting module divides 108 the dataset into training and testing subsets. This division is typically done in a ratio such that 70-80% of the data is used for training the model, while the remaining 20-30% is reserved for testing. The splitting is done carefully to ensure that both the training and testing sets are representative of the entire dataset, which is critical for the model's ability to generalize.
The core of the system lies in the RNN Training Module 110, where a recurrent neural network (RNN) model is trained using the training dataset. This module initializes and compiles two deep learning models, applying an iterative optimization method and categorical cross-entropy loss function. The training process involves feeding the training data into the RNN model, which then learns to identify long-term dependencies within the sequential gene expression data across multiple epochs. This learning process is vital for the model to accurately classify different types of cancer based on gene expression patterns.
Once the model is trained, the classification module 112 uses the testing dataset to evaluate the model's performance. The trained RNN model processes the testing data and outputs predicted class labels, representing the classified cancer types. This module may also provide confidence scores, offering insights into the model's certainty regarding its predictions.
Finally, the results are presented to the user through the user interface 114. This interface displays the classified cancer types and any associated confidence scores in a clear and accessible manner. The interface may include data visualization features, allowing users to explore the results in greater detail and export them for further analysis or reporting. The user interface ensures that the complex processes of data processing, model training, and classification are accessible and interpretable to the end-user, completing the system's workflow.

This structured approach ensures that the cancer classification system can accurately classify cancer types based on gene expression data, leveraging deep learning techniques, particularly recurrent neural networks, to handle the sequential nature of the data.
Advantageously, the system 100 offers high accuracy in identifying cancer types by leveraging Recurrent Neural Networks (RNNs) to capture long-term dependencies in sequential gene expression data. It is highly scalable and flexible, capable of handling large datasets from diverse sources and automating crucial preprocessing tasks like normalization and label encoding. The system 100 supports real-time classification and user interaction through an intuitive interface, providing immediate and interpretable results with confidence scores. Additionally, its customizable training and testing workflows, combined with iterative optimization methods, enhance model performance and adaptability, making it a powerful tool for both research and clinical applications.
The functions described herein may be implemented in hardware, software executed by a processor, firmware, or any combination thereof. If implemented in software executed by a processor, the functions may be stored on or transmitted over as one or more instructions or codes on a computer-readable medium. Other examples and implementations are within the scope and spirit of the disclosure and appended claims. For example, due to the nature of software, functions described above can be implemented using software executed by a processor, hardware, firmware, hardwiring, or combinations of any of these. Features implementing functions may also be physically located at various positions, including being distributed such that portions of functions are implemented at different physical locations.
The foregoing description of the embodiments has been provided for purposes of illustration and is not intended to limit the scope of the present disclosure. Individual components of a particular embodiment are generally not limited to that particular embodiment, but are interchangeable. Such variations are not to be regarded as a departure from the present disclosure, and all such modifications are considered to be within the scope of the present disclosure.
TECHNICAL ADVANCEMENTS
The present disclosure described herein above has several technical advantages including, but not limited to, the realization of a system and a method for cancer classification using sequential gene expression data that:
• real-time processing;
• user-friendly interface for result interpretation;
• ability to handle complex sequential data;
• robust model training through iterative optimization;
• provide scalability and flexibility in data integration;
• provide multi-class cancer classification; and
• provides an accurate and reliable model.
The embodiments herein and the various features and advantageous details thereof are explained with reference to the non-limiting embodiments in the following description. Descriptions of well-known components and processing techniques are omitted so as to not unnecessarily obscure the embodiments herein. The examples used herein are intended merely to facilitate an understanding of ways in which the embodiments herein may be practiced and to further enable those of skill in the art to practice the embodiments herein. Accordingly, the examples should not be construed as limiting the scope of the embodiments herein.
The foregoing description of the specific embodiments so fully reveals the general nature of the embodiments herein that others can, by applying current knowledge, readily modify and/or adapt for various applications such specific embodiments without departing from the generic concept, and, therefore, such adaptations and modifications should and are intended to be comprehended within the meaning and range of equivalents of the disclosed embodiments. It is to be understood that the phraseology or terminology employed herein is for the purpose of description and not of limitation. Therefore, while the embodiments herein have been described in terms of preferred embodiments, those skilled in the art will recognize that the embodiments herein can be practiced with modification within the spirit and scope of the embodiments as described herein.
The use of the expression "at least" or "at least one" suggests the use of one or more elements or ingredients or quantities, as the use may be in the embodiment of the disclosure to achieve one or more of the desired objects or results.
While considerable emphasis has been placed herein on the components and component parts of the preferred embodiments, it will be appreciated that many embodiments can be made and that many changes can be made in the preferred embodiments without departing from the principles of the disclosure. These and other changes in the preferred embodiment as well as other embodiments of the disclosure will be apparent to those skilled in the art from the disclosure herein, whereby it is to be distinctly understood that the foregoing descriptive matter is to be interpreted merely as illustrative of the disclosure and not as a limitation.
, C , Claims:WE CLAIM:
1. A system (100) for cancer classification using sequential gene expression data, said system (100) comprises:
• a data import module (102) comprising a network interface (102a) configured to access and retrieve gene expression data from a data source, including databases, file systems, and remote or online repositories;
• an encoding module (104) configured to cooperate with said data import module (102) to receive said gene expression data, and including a label encoder (104a) to apply class label encoding to expression values in said gene expression data for generating an encoded gene expression data;
• a data preprocessing module (106) configured to cooperate with said encoding module (104) to receive said encoded gene expression data, further configured to normalize gene expression values in said encoded gene expression data by means of a set of preprocessing rules, and further configured to encode class labels into a numerical format suitable for processing by at least one deep learning model based on the normalized gene expression values for generating a pre-processed gene expression data;
• a data splitting module (108) configured to cooperate with said data preprocessing module (106) to receive said preprocessed gene expression data and divide the preprocessed data into training dataset and testing dataset;
• a recurrent neural network (RNN) training module (110) configured to cooperate with said data splitting module (108), and comprising a machine learning processor to:
o initialize a first deep learning model and a second deep learning model,
o compile the first deep learning model and the second deep learning model to build an RNN model using an iterative optimization method and a categorical cross-entropy loss (CCE) function, and
o receive said training dataset for training the RNN model to identify long-term dependencies within the preprocessed gene expression data for a predefined number of epochs;
• a classification module (112) configured to cooperate with said data splitting module (108) and said RNN training module (110) to receive said testing dataset, and further configured to implement said testing dataset on the trained RNN model to classify said preprocessed gene expression data into a cancer class; and
• a user interface (114) configured to cooperate with said classification module (112) to receive said classified cancer class and display said classified cancer class on a display unit (114a).
2. The system (100) as claimed in claim 1, wherein said RNN training module (110) further includes:
• an evaluation module (116) configured to cooperate with said classification module (112) to evaluate the performance of said trained RNN model based on classification metrics for each cancer class and generate results based on evaluation; and
• an optimization module (118) configured to cooperate with said evaluation module (116) to receive the evaluation result, and further configured to optimize the performance of said trained RNN model by adjusting hyperparameters including learning rates, number of neurons, batch sizes, number of epochs, and dropout layer.
3. The system (100) as claimed in claim 1, wherein the data preprocessing module (106) further comprises:
a. a normalization module (106a) configured to normalize the gene expression data to a specific range before being processed by the RNN training module (110); and
b. a feature extraction module (106b) configured to identify and extract relevant features from the gene expression data before being processed by the RNN training module (110).
4. The system (100) as claimed in claim 1, wherein said RNN training module (110) is configured to train said RNN model on the training dataset to identify patterns in said pre-trained gene expression data that correlate with cancer prediction, and further configured to implement early stopping techniques to prevent overfitting during the training process.
5. The system (100) as claimed in claim 1, wherein said first deep learning model is a Long Short-Term Memory (LSTM) model and said second classification model is a Gated Recurrent Unit (GRU) model.
6. The system (100) as claimed in claim 1, wherein said user interface (114) further comprises a visualization tool configured to display model performance metrics, such as receiver operating characteristic (ROC) curves, precision-recall curves, and other relevant charts, to aid in the analysis of the results.
7. The system (100) as claimed in claim 1, wherein said user interface (114) is further configured to allow users to export the model predictions and performance metrics for further analysis or reporting.
8. The system (100) as claimed in claim 1, wherein said set of preprocessing rules is a set of instructions used to perform normalization, scaling, or other preprocessing techniques to ensure consistency and accuracy.
9. The system (100) as claimed in claim 1, wherein said classified cancer class is selected from a group of breast cancer (BRCA), kidney renal clear cell carcinoma (KIRC), prostate adenocarcinoma (PRAD), lung adenocarcinoma (LUAD), and colon adenocarcinoma (COAD).
10. A method (200) for cancer classification using sequential gene expression data, said method (200) comprises the following steps:
• accessing and retrieving, by a data import module (102) comprising a network interface (102), gene expression data from a data source, including databases, file systems, and remote or online repositories;
• receiving, by an encoding module (104), said gene expression data, and including a label encoder (104a) to apply class label encoding to expression values in said gene expression data for generating an encoded gene expression data;
• receiving, by a data preprocessing module (106), said encoded gene expression data, normalizing gene expression values in said encoded gene expression data by means of a set of preprocessing rules, and encoding class labels into a numerical format suitable for processing by at least one deep learning model based on the normalized gene expression values for generating a pre-processed gene expression data;
• receiving, by a data splitting module (108), said preprocessed gene expression data and dividing the preprocessed data into a training dataset and a testing dataset;
• cooperating, by a recurrent neural network (RNN) training module (110) configured to cooperate with said data splitting module (108);
• initializing, by said recurrent neural network (RNN) training module (110), a first deep learning model and a second deep learning model;
• compiling, by said recurrent neural network (RNN) training module (110), the first deep learning model and the second deep learning model to build an RNN model using an iterative optimization method and a categorical cross-entropy loss (CCE) function;
• receiving, by said recurrent neural network (RNN) training module (110), said training dataset for training the RNN model to identify long-term dependencies within the preprocessed gene expression data for a predefined number of epochs;
• receiving and implementing, by a classification module (112) said testing dataset on the trained RNN model to classify said preprocessed gene expression data into a cancer class; and
• receiving and displaying, by a user interface (114), said classified cancer class on a display unit (114a).
Dated this 18th Day of November, 2024

_______________________________
MOHAN RAJKUMAR DEWAN, IN/PA - 25
OF R. K. DEWAN & CO.
AUTHORIZED AGENT OF APPLICANT

TO,
THE CONTROLLER OF PATENTS
THE PATENT OFFICE, AT CHENNAI

Documents

NameDate
202441089264-FORM-26 [19-11-2024(online)].pdf19/11/2024
202441089264-COMPLETE SPECIFICATION [18-11-2024(online)].pdf18/11/2024
202441089264-DECLARATION OF INVENTORSHIP (FORM 5) [18-11-2024(online)].pdf18/11/2024
202441089264-DRAWINGS [18-11-2024(online)].pdf18/11/2024
202441089264-EDUCATIONAL INSTITUTION(S) [18-11-2024(online)].pdf18/11/2024
202441089264-EVIDENCE FOR REGISTRATION UNDER SSI [18-11-2024(online)].pdf18/11/2024
202441089264-FORM 1 [18-11-2024(online)].pdf18/11/2024
202441089264-FORM 18 [18-11-2024(online)].pdf18/11/2024
202441089264-FORM FOR SMALL ENTITY(FORM-28) [18-11-2024(online)].pdf18/11/2024
202441089264-FORM-9 [18-11-2024(online)].pdf18/11/2024
202441089264-PROOF OF RIGHT [18-11-2024(online)].pdf18/11/2024
202441089264-REQUEST FOR EARLY PUBLICATION(FORM-9) [18-11-2024(online)].pdf18/11/2024
202441089264-REQUEST FOR EXAMINATION (FORM-18) [18-11-2024(online)].pdf18/11/2024

footer-service

By continuing past this page, you agree to our Terms of Service,Cookie PolicyPrivacy Policy  and  Refund Policy  © - Uber9 Business Process Services Private Limited. All rights reserved.

Uber9 Business Process Services Private Limited, CIN - U74900TN2014PTC098414, GSTIN - 33AABCU7650C1ZM, Registered Office Address - F-97, Newry Shreya Apartments Anna Nagar East, Chennai, Tamil Nadu 600102, India.

Please note that we are a facilitating platform enabling access to reliable professionals. We are not a law firm and do not provide legal services ourselves. The information on this website is for the purpose of knowledge only and should not be relied upon as legal advice or opinion.