Consult an Expert
Trademark
Design Registration
Consult an Expert
Trademark
Copyright
Patent
Infringement
Design Registration
More
Consult an Expert
Consult an Expert
Trademark
Design Registration
Login
IMAGE CAPTION GENERATOR USING VGG MODEL AND LSTM
Extensive patent search conducted by a registered patent agent
Patent search done by experts in under 48hrs
₹999
₹399
Abstract
Information
Inventors
Applicants
Specification
Documents
ORDINARY APPLICATION
Published
Filed on 23 November 2024
Abstract
ABSTRACT “IMAGE CAPTION GENERATOR USING VGG MODEL AND LSTM” The present invention provides image caption generator using VGG model and LSTM. The task of crafting captions with appropriate linguistic qualities poses a significant challenge, as it demands a nuanced level of comprehension of images that extends far beyond mere categorization and recognition of objects within them. Subsequently, the user is furnished with an audio file containing a spoken rendition of the written image description, thereby offering them a perceptual understanding of their surroundings. To assess the quality and accuracy of the generated captions, the BLEU score metric is employed for evaluation purposes. The primary focus of this study revolves around the realms of image captioning, deep learning methodologies, neural networks, and the assessment of captioning accuracy through the BLEU Score metric. Figure 1
Patent Information
Application ID | 202431091365 |
Invention Field | COMPUTER SCIENCE |
Date of Application | 23/11/2024 |
Publication Number | 48/2024 |
Inventors
Name | Address | Country | Nationality |
---|---|---|---|
Anushka Raj | School of Computer Science and Engineering, Kalinga Institute of Industrial Technology (Deemed to be University), Patia Bhubaneswar Odisha India 751024 | India | India |
Abhishek Saurav | School of Computer Science and Engineering, Kalinga Institute of Industrial Technology (Deemed to be University), Patia Bhubaneswar Odisha India 751024 | India | India |
Raghav Naulakha | School of Computer Science and Engineering, Kalinga Institute of Industrial Technology (Deemed to be University), Patia Bhubaneswar Odisha India 751024 | India | India |
Akhilesh Prakash | School of Computer Science and Engineering, Kalinga Institute of Industrial Technology (Deemed to be University), Patia Bhubaneswar Odisha India 751024 | India | India |
Prof. Mohit Ranjan Panda | School of Computer Science and Engineering, Kalinga Institute of Industrial Technology (Deemed to be University), Patia Bhubaneswar Odisha India 751024 | India | India |
Applicants
Name | Address | Country | Nationality |
---|---|---|---|
Kalinga Institute of Industrial Technology (Deemed to be University) | Patia Bhubaneswar Odisha India 751024 | India | India |
Specification
Description:TECHNICAL FIELD
[0001] The present invention relates to the field of artificial intelligence and automated systems, and more particularly, the present invention relates to the image caption generator using VGG model and LSTM.
BACKGROUND ART
[0002] The following discussion of the background of the invention is intended to facilitate an understanding of the present invention. However, it should be appreciated that the discussion is not an acknowledgment or admission that any of the material referred to was published, known, or part of the common general knowledge in any jurisdiction as of the application's priority date. The details provided herein the background if belongs to any publication is taken only as a reference for describing the problems, in general terminologies or principles or both of science and technology in the associated prior art.
[0003] The aim of this project revolves around the creation of precise and suitable captions corresponding to a given image with a specific focus on ensuring that these captions accurately reflect the contextual information encapsulated within the images themselves. To accomplish this ambitious objective, modern techniques heavily depend on the use of cutting-edge technologies like convolutional neural networks (CNNs) and recurrent neural networks (RNNs) or their variations, which collectively function within an encoder-decoder framework to support the creation of precise linguistic explanations. In this intricate process, RNNs function as decoders responsible for providing detailed textual descriptions, while VGGs play a pivotal role in encoding the visual content of the image into structured feature vectors, highlighting the intricate interplay between diverse components within the system architecture.
[0004] In light of the foregoing, there is a need for Image caption generator using VGG model and LSTM that overcomes problems prevalent in the prior art associated with the traditionally available method or system, of the above-mentioned inventions that can be used with the presented disclosed technique with or without modification.
[0005] All publications herein are incorporated by reference to the same extent as if each individual publication or patent application were specifically and individually indicated to be incorporated by reference. Where a definition or use of a term in an incorporated reference is inconsistent or contrary to the definition of that term provided herein, the definition of that term provided herein applies, and the definition of that term in the reference does not apply.
OBJECTS OF THE INVENTION
[0006] The principal object of the present invention is to overcome the disadvantages of the prior art by providing image caption generator using VGG model and LSTM.
[0007] Another object of the present invention is to provide image caption generator using VGG model and LSTM that develops an automated image caption generator that leverages a VGG16 model and LSTM for accurate caption generation.
[0008] Another object of the present invention is to provide image caption generator using VGG model and LSTM that employs the Flickr8k dataset for creating diverse caption descriptions, efficiently processing images for effective caption representation.
[0009] Another object of the present invention is to provide image caption generator using VGG model and LSTM that provides a system that utilizes data generators, reducing memory usage and enhancing computational efficiency, especially with limited storage and processing power.
[0010] Another object of the present invention is to provide image caption generator using VGG model and LSTM that achieves high accuracy in image captioning by using BLEU scores as a performance metric, incorporating batch normalization and dropout layers to prevent overfitting.
[0011] Another object of the present invention is to provide image caption generator using VGG model and LSTM that establishes a framework that uses image feature extraction and text processing for efficient caption generation on limited hardware resources.
[0012] Another object of the present invention is to provide image caption generator using VGG model and LSTM that designs an image captioning framework that can be extended to other datasets and increased in scale, allowing the use of both small and large image datasets like Flickr8k and Flickr30k.
[0013] The foregoing and other objects of the present invention will become readily apparent upon further review of the following detailed description of the embodiments as illustrated in the accompanying drawings.
SUMMARY OF THE INVENTION
[0014] The present invention relates to image caption generator using VGG model and LSTM. The methodology employed for addressing the challenge of image captioning can be categorized into several key stages. Firstly, it involves the extraction of features from the image, followed by the loading of the pre-trained Convolutional Neural Network (CNN) VGG16 model. Additionally, there is a crucial step of preprocessing the textual data, which serves as the foundational phase within the image captioning workflow. Subsequently, the network architectures that exhibit significant potential in delivering high-quality outcomes are established to commence the training process. The preprocessed dataset is then partitioned into training and testing subsets, with the training data being employed to train the designated network architectures. Subsequently, the model undergoes evaluation using a distinct set of test data to assess its performance accurately. The project extensively leveraged Google's Ten- sorFlow library along with its deep learning counterpart, Keras, across all phases of the project. The decision to utilize Jupyter Notebook was motivated by its user-friendly interface, providing a convenient mechanism for data storage, and its compatibility with GPUs, which significantly accelerated the training of network models.
[0015] The observation reveals that the initial stage involves preprocessing a given image, followed by the encoder processing the input image in the jpg format utilizing pre-existing computer vision models such as the VGG-16. Subsequently, the third stage entails employing the LSTM model to produce captions for the image processed by the encoder. This process enables the identification and labeling of the objects depicted in the image, culminating in the creation of a coherent and grammatically correct sentence. The decoder assumes the responsibility of merging both the image and the captions generated by the LSTM model to formulate a comprehensive caption. Despite the limited number of conversion stages, each stage necessitates an equivalent level of data processing to effectively predict a caption that meets the primary objective of this study. Initially, at the outset of the process, each image designated for captioning undergoes scrutiny with an input image in the format of .jpg being taken into consideration, accompanied by a corresponding list of words linked to said image. The previously mentioned input image is then fed into a feature extraction module where a Convolutional Neural Network (CNN) model such as VGG16 is utilized to extract unique features from the image, subsequently saving these features in a vector format. Through the utilization of layers like Dense, the node count is reduced down to 256, aiming to streamline the data processing. Conversely, the list of associated words is processed by a sequence processor, which in turn forwards this list as input to the system; here, the embedding layer effectively reduces the dimensions to 256 nodes before passing the data onto an LSTM layer, thereby generating a more precise and contextually accurate sentence corresponding to the image in question. The resultant outputs from the feature extractor and the Sequence processor are then amalgamated via a Dense layer, with a SoftMax layer serving as the activation function prior to reaching the output layer, ultimately culminating in the provision of a coherent and effective caption for the image.
[0016] The study proposes an enhanced picture captioning model based on encoder-decoder technology, aiming for versatility in recognizing objects and colors, particularly beneficial for visually impaired individuals. It analyzes training procedures, metrics, and model adaptability, emphasizing the importance of diverse datasets for optimal performance. Future directions include expanding datasets and possibly integrating a genera- tive adversarial network (GAN) to improve caption accuracy, highlighting the role of deep learning in fostering inclusivity and accessibility.
[0017] While the invention has been described and shown with reference to the preferred embodiment, it will be apparent that variations might be possible that would fall within the scope of the present invention.
BRIEF DESCRIPTION OF DRAWINGS
[0018] So that the manner in which the above-recited features of the present invention can be understood in detail, a more particular description of the invention, briefly summarized above, may have been referred by embodiments, some of which are illustrated in the appended drawings. It is to be noted, however, that the appended drawings illustrate only typical embodiments of this invention and are therefore not to be considered limiting of its scope, for the invention may admit to other equally effective embodiments.
[0019] These and other features, benefits, and advantages of the present invention will become apparent by reference to the following text figure, with like reference numbers referring to like structures across the views, wherein:
[0020] Fig.2 Schematic representation of image captioning model.
[0021] Fig.3 Bi-Directional Multi-Model Diagram
[0022] Fig .4 Dataset collection
[0023] Fig .5 Captions Dataset
[0024] Fig. 6 Summary of VGG16 model strata
[0025] Fig .7 LSTM layer
[0026] Fig .8 LSTM Architecture
[0027] Fig.9 Training loss over EPOCHS
[0028] Fig .10 Block diagram
[0029] Fig.11 BLEU score
DETAILED DESCRIPTION OF THE INVENTION
[0030] While the present invention is described herein by way of example using embodiments and illustrative drawings, those skilled in the art will recognize that the invention is not limited to the embodiments of drawing or drawings described and are not intended to represent the scale of the various components. Further, some components that may form a part of the invention may not be illustrated in certain figures, for ease of illustration, and such omissions do not limit the embodiments outlined in any way. It should be understood that the drawings and the detailed description thereto are not intended to limit the invention to the particular form disclosed, but on the contrary, the invention is to cover all modifications, equivalents, and alternatives falling within the scope of the present invention as defined by the appended claim.
[0031] As used throughout this description, the word "may" is used in a permissive sense (i.e. meaning having the potential to), rather than the mandatory sense, (i.e. meaning must). Further, the words "a" or "an" mean "at least one" and the word "plurality" means "one or more" unless otherwise mentioned. Furthermore, the terminology and phraseology used herein are solely used for descriptive purposes and should not be construed as limiting in scope. Language such as "including," "comprising," "having," "containing," or "involving," and variations thereof, is intended to be broad and encompass the subject matter listed thereafter, equivalents, and additional subject matter not recited, and is not intended to exclude other additives, components, integers, or steps. Likewise, the term "comprising" is considered synonymous with the terms "including" or "containing" for applicable legal purposes. Any discussion of documents, acts, materials, devices, articles, and the like are included in the specification solely for the purpose of providing a context for the present invention. It is not suggested or represented that any or all these matters form part of the prior art base or were common general knowledge in the field relevant to the present invention.
[0032] In this disclosure, whenever a composition or an element or a group of elements is preceded with the transitional phrase "comprising", it is understood that we also contemplate the same composition, element, or group of elements with transitional phrases "consisting of", "consisting", "selected from the group of consisting of, "including", or "is" preceding the recitation of the composition, element or group of elements and vice versa.
[0033] The present invention is described hereinafter by various embodiments with reference to the accompanying drawing, wherein reference numerals used in the accompanying drawing correspond to the like elements throughout the description. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiment set forth herein. Rather, the embodiment is provided so that this disclosure will be thorough and complete and will fully convey the scope of the invention to those skilled in the art. In the following detailed description, numeric values and ranges are provided for various aspects of the implementations described. These values and ranges are to be treated as examples only and are not intended to limit the scope of the claims. In addition, several materials are identified as suitable for various facets of the implementations. These materials are to be treated as exemplary and are not intended to limit the scope of the invention.
[0034] The present invention relates to image caption generator using VGG model and LSTM. The 8,000 pictures from Kaggle used in this emerging framework collection for caption-focused image representation and investigation include five distinct interpretations, each of which emphasizes the main aspects and interactions within the image. The compilation recognized as flickr8k is familiar. The images commonly lack notable individuals or objects and were manually chosen to encompass a range of events and scenarios from six discrete Flickr groups. Images constitute Flickr datasets, and the magnitude of the dataset can be chosen based on the platform. Flickr8k and Flickr30k have been recognized as the two most prevalent image caption instruction datasets in the Computer Vision research domain. Every caption-annotated image in these datasets, comprising 31,000 and 8,091 images respectively, is associated with five diverse explanations. Owing to our restricted storage capacity and computational capabilities, we presently opt for the Flickr8k dataset, which features the fewest images, as our primary data reservoir over the remaining two.
[0035] Importing Necessary Modules: Initially, we will import multiple libraries, including those essential for data representation, visualization, and configura- tion, as well as those compatible with model execution. All these libraries are employed at various stages of a project. In this endeavor, each package holds significance in generating models and constructing the graphs that we employ with tqdm. For future reference, we have also brought in libraries such as keras.
- Numpy: For computational operations.
- Tqdm: To create an intelligent progress bar for the iterations which indicates advancement, tqdm library is employed.
- Pickle: Pickle in Python is predominantly utilized in serializing and deserializing a Python object structure.
- Tensorflow: A Python bundle encompassing Tensor- Flow optimizer implementations for constructing machine learning models.
- Keras: Focusing on modern deep learning, a highly efficient interface for tackling machine learning obstacles.
- Matplotlib: Matplotlib serves as a plotting library for the python programming language. The pyplot function executes modifications for figures that have been generated, establishes a plotting region within the figure, and decorates the plots with labels.
[0036] Import Data Directory and Work Directory: Determining the base directory for the Flickr8k collection, designating the specific placement of the Flickr dataset. Defin- ing the working directory to the position of the code, denoting the present directory where the code is run.
[0037] Extracting Features from an Image: In this section, the VGG16 model is imported initially, followed by the establishment of an image mapping and the subsequent importation of an image from the document. The dimensions of the image are then adjusted to 224 by 224. Post the alteration of the image, it is converted into a numpy array. This data necessitates reshaping to extract the characteristics essential for model formulation. With the image being in RGB format, three instances are generated. Preprocessing of the image for the VGG16 model is conducted to obtain the inherent characteristics within the image, including the features extracted for each specific image. Each image is then assigned an image identifier to retain its features. Every image in the dataset undergoes this intricate process, demanding a considerable amount of time for completion. To expedite this process, the utilization of GPU is crucial to enable simultaneous processing of multiple images as opposed to a sequential approach. Post feature extraction, the features are stored in a pickle file to preserve the established mapping. These features can be readily accessed whenever needed, resulting in significant time savings. This phase involves the utilization of a predefined CNN network model renowned for its precision in addressing computer vision tasks.
[0038] Text Data Preprocessing: We peruse the textual information contained in captions.txt document and partition the captions sequentially. Following the division of the captions, each picture is linked with five different captions. By utilizing the image identifier, we con- struct a mapping dictionary wherein the keys are composed of the image title sans the jpg suffix, and the corresponding value is the collection of captions linked to that particular image. Hence, for 8091 pictures, each having five captions, we possess a total of 40455 captions for processing. Subsequently, the mapping dictionary is populated with captions, whereby each caption is systematically pre-processed. Initially, the captions are converted to lowercase letters, following which all special characters are substituted to ensure fluent articulation without the necessity of special characters. Any excess spaces amidst words are also eliminated. Commencing with the utilization of start and end tags facilitates the model in determining the sentence initiation and conclusion.
[0039] Create Data Generator Function: The data has been divided into two subsets known as the training set and the testing set, where 90 percent of the dataset is allocated for training the model and the remaining 10 percent is reserved for testing its performance. To ensure that the model can be trained without overwhelming the memory resources, a data generator function is implemented, which allows loading data into the model incrementally instead of all at once. By doing so, it prevents memory occupation and reduces the likelihood of system crashes during the training process. The caption generator operates in a loop that systematically analyzes each caption individually before employing a Tokenizer to transform the textual data into a sequence of encrypted tokens. Subsequently, the input sequence is padded to a fixed length, followed by encoding the output sequence for processing. The resulting output sequence is then stored for future reference.
[0040] Within the function, an infinite loop is established along with three distinct lists to facilitate the encoding and processing of sequences. These sequences are segmented into X, y pairs, where the input sequence is padded and the output sequence is encoded accordingly. These processed sequences are retained to generate new captions within the model. A data generator is employed for both training and validation purposes, ensuring that the model is trained effectively using the data generated. The data generator function plays a crucial role in producing batches of training and validation data to feed into the model during the training phase. The rationale behind utilizing a generator function lies in the need to have flexibility in modifying the data input for the models to enhance performance.
[0041] Given the impracticality of using a predetermined data shape, a specialized output sequence is constructed to align with the requirements of the prediction process. The data generator iterates through the data batch size times, retrieving captions for each iteration and creating a new window for the number of captions. Subsequently, a sequence variable is assigned to each caption, leading to the creation of another iteration window that spans the maximum length of captions (31 elements). This particular window sequentially populates the training input and labels across the necessary 31 time steps of the LSTM. An output sequence (out seq) is formulated as a one-hot-encoded vector representing the predicted output of the final softmax layer, while an input sequence (in seq) comprises the last k words of the caption padded with zeros. These sequences are defined and added to a list during each iteration within this window.
[0042] Model Creation: Here we produce strata; initially we establish depiction characteristic strata with input magnitude of 4096 as recommended in VGG model. We also generate succession characteristic strata. In succession characteristic strata we employ LSTM stratum, Exclusion stratum, concentrated stratum and others. This is the encoder part moving to decoder in decoder we utilize concentrated strata, and the design is formulated.
[0043] Herein, the process of model training encompasses the meticulous configuration of various parameters such as epochs, batch size, among others. The preparation of a data generator beforehand proves to be advantageous during the training phase, as it efficiently manages memory usage. Upon completion of the training process, the model is then saved for future use and reference.
[0044] Additionally, an alternative model has been developed through the modification of specific layers, with a particular focus on comparing Batch normalization and layer normalization methodologies. Notably, Batch normalization is widely utilized, especially within Convolutional Neural Networks (CNNs). Despite the preprocessing of input images, the iterative nature of training instigates changes in parameters, thereby necessitating a fresh normalization approach for layer inputs. Moreover, the incorporation of Batch normalization renders models more resilient to high learning rates and suboptimal parameter initialization, thereby mitigating challenges associated with saturating nonlinearities.
[0045] On the other hand, layer normalization has emerged as a viable alternative, showcasing superiority over batch normalization particularly within Recurrent Neural Networks (RNNs). The implementation of normalization techniques in RNNs, especially layer normalization, is deemed more straightforward and efficient. Furthermore, the performance comparison between layer normalization and batch normalization in RNNs distinctly favors the former, as it not only streamlines training procedures but also enhances the establishment of hidden states within the network.
[0046] ENCODER MODEL: The encoder is required to extract visual attributes of varying dimensions and encode them into a vector space for subsequent presentation to Recurrent Neural Networks (RNNs). The model expects input features of length 4096.
[0047] The Dropout layer serves as a mechanism to mask the impact of certain neurons on the subsequent layer while preserving the remaining neurons unchanged. Utilizing a Dropout layer on the input vector results in the suppression of specific features; alternatively, it can be implemented on a hidden layer to neutralize particular hidden neurons. The addition of Dropout layers in Convolutional Neural Network (CNN) training is pivotal in addressing the risk of overfitting to the training dataset. In the absence of Dropout layers, the initial set of training instances exerts a disproportionately high influence on the learning process. This, in turn, would prevent the learning of features that appear only in later samples or batches.
[0048] Dense layer with 256 units and ReLU activation function applied to the output of the dropout layer, further processing the image features. Processing the sequential data, such as the text data or time-series data is done by the sequence feature layers in the model.
[0049] The embedding layer converts integer-encoded tokens into dense vectors of fixed size, called embedding's. Each token is represented by a unique vector, and similar tokens have similar embedding's. This layer helps the model learn meaningful representations of words. Dropout is applied after the embedding layer to prevent over fitting. It randomly drops a fraction of the input units during training, forcing the model to learn more robust representations.
[0050] LSTM layer with 256 units is applied to the output of the dropout layer to process the sequence features.
[0051] Long Short-Term Memory (LSTM) is a type of recurrent neural network (RNN) architecture designed to address the vanishing gradient problem and capture long-term dependencies in sequential data.
[0052] The mitigation of overfitting is achieved by incorporating the dropout layer into the model. Termed as a fully connected layer, a dense layer is an important feature within the structure of a neural network. Termed "dense" due to the direct connections between each neuron in the preceding layer and those in the subsequent layer.
[0053] The adjusted linear activation function, or ReLU, is described by its piecewise linear characteristics, where it produces a positive input as is, and any negative input leads to an output of zero. ReLU has emerged as the preferred activation function in various neural network applications owing to its facilitation of easier model training and superior performance outcomes.
[0054] A RNN is a type of neural network for sequential data. It is used in language translation, NLP, speech recognition, and image captioning. RNNs, like CNNs, learn from training data and have memory. They differ from traditional networks by considering prior inputs. RNN outputs depend on prior elements, unlike traditional networks. Unidirectional RNNs cannot predict based on future events.
[0055] DECODER MODEL: The decoder takes combined features as input, processes them through a dense layer with ReLU activation, and then generates a probability distribution over the vocabulary using a softmax activation function. The model is compiled with categorical cross-entropy loss and Adam optimizer for training.
[0056] Mathematically, Softmax is defined as it will incorporate the loaded textual descriptions, photo features, tokenizer, and maximum length. Subsequently, the fit.generator() function can be applied to the model for training with this data generator. Upon completion, the model underwent training for 20 epochs, and the model was consistently saved following each training epoch. The models characterized by the lowest loss are identified as the optimal models, which are subsequently utilized for evaluation and testing purposes.
[0057] Training and Testing the Model: The division between training and testing datasets holds great significance within the realm of machine learning and data analysis as it serves as a critical tool for assessing the efficacy of predictive models. This division enables the evaluation of model performance through the training of the model on one segment of the data (training set) and the subsequent assessment of its performance on the other segment (test set), offering insights into its ability to generalize to new and unseen datasets. One of the primary benefits of this practice is its role in preventing overfitting. Additionally, the test set helps in adjusting model hyperparameters while ensuring the credibility of the evaluation process, thereby boosting model performance. Furthermore, the separation of data allows for the confirmation of assumptions formed during model training, with a high-performing model on the test set indicating the validity of the underlying assumptions. The application of the train-test split is a foundational principle in model development, presenting a reliable method for evaluating the performance of machine learning algorithms on data that has not been previously seen.
[0058] EVALUATION: Although the pipeline has been established, there remains a lack of clarity regarding its complete accuracy and effectiveness in delivering the intended learning outcomes. It is essential to proceed with the next phase, which involves the training of our model to guarantee its ability to acquire knowledge accurately and effectively. Once the model has achieved a high level of accuracy, there is still the requirement to complete another crucial aspect of this project. The most optimal method to showcase the effectiveness of our caption generator is by engaging in the process of sentence generation and conducting a thorough evaluation of its performance.
[0059] Evaluation Metrics: Evaluation criteria play a vital role in evaluating the quality of text generation in tasks related to natural language processing. We utilize BLEU (Bilingual Evaluation Understudy) which represents a metric for a group of potential expected sentences compared to source ground truth sentences. It evaluates the resemblance between the potential translations and source translations in relation to n-grams and is commonly utilized to evaluate the caliber of machine-generated expressions. Between 0 and 1 is the BLEU Rating spectrum. Enhanced values suggest references that are more closely aligned. The BLEU Rating equation is: Among them, BP stands for short penalty factor. It penalizes sentences that are too short to prevent short training outcomes, and its expression is with the BLEU-1 metrics being among the most commonly utilized measures. The term BLEU stands for Bilingual Evaluation Understudy, highlighting its significance in the field. In addition to the evaluation based on 1gram, the BLEU-1 and BLEU-2 metrics are considered essential for determining the effectiveness of a model in producing text. Moving forward, it is crucial to explore additional strategies aimed at optimizing the performance of our model and aligning it more effectively with the desired outcomes from the articles.
[0060] Model Performance via BLEU-1 and BLEU-2 score: Currently, our team has achieved a BLEU-1 score of 0.55, which stands as the highest score attained within a span of 20 epochs, noting that the BLEU-1 metric operates within a defined range of 0 to 1. Comparison with the BLEU-2 score of 0.33 for the flickr8k dataset reveals that our model's performance falls below that of a cutting-edge caption generator. The feedback provided in the critique of our progress report prompted us to make a strategic decision to refrain from excessive allocation of time and resources towards mere augmentation of our BLEU-1 score to align with the results outlined in a specific article, given the potential for these outcomes to have been subject to multiple alterations by various individuals, possibly incorporating unverified methodologies.
[0061] RESULT AND DISCUSSION: The process of encoding data into integers involves the utilization of a dictionary containing various words, as mentioned previously. Following the consumption of the data, the resulting output within the pipeline is also presented in the encoded form. Subsequently, the encoded output requires translation back into understandable English phrases for human comprehension. It is important to note that the RNN network generates an output consisting of a sequence of word likelihoods, which plays a crucial role in the overall process. Opting for the word with the highest likelihood at each stage of decoding within an RNN typically yields suboptimal outcomes. Instead of this approach, LSTM, a technique widely favored, is utilized to determine the most effective strategy for interpreting words within natural language constructs. Presented below are several captions that have been generated specifically for test photographs. It is evident that specific captions automatically generated by the network have a tendency to omit essential details related to the image, while in other cases; there exist inaccuracies in the identification of particular visual elements.
[0062] Various modifications to these embodiments are apparent to those skilled in the art from the description and the accompanying drawings. The principles associated with the various embodiments described herein may be applied to other embodiments. Therefore, the description is not intended to be limited to the 5 embodiments shown along with the accompanying drawings but is to be providing the broadest scope consistent with the principles and the novel and inventive features disclosed or suggested herein. Accordingly, the invention is anticipated to hold on to all other such alternatives, modifications, and variations that fall within the scope of the present invention and appended claims.
, Claims:CLAIMS
We Claim:
1) A method for generating captions for images, the method comprising:
- Importing an image dataset and extracting image features using a VGG16 convolutional neural network (CNN),
- Processing text captions using natural language preprocessing techniques to establish a mapping between images and text descriptions,
- Utilizing an LSTM-based Recurrent Neural Network (RNN) to generate sequential text data based on the extracted image features, and
- Creating a prediction model that outputs captions based on combined CNN and LSTM feature data, wherein said method improves the accuracy and relevance of generated captions.
2) The method as claimed in claim 1, wherein the VGG16 CNN is configured to resize each input image to a 224x224 pixel dimension for standardization, enhancing the model's ability to extract meaningful visual features from each image.
3) The method as claimed in claim 1, wherein the LSTM model processes text input after tokenizing and padding the sequences, thereby allowing the RNN to maintain temporal consistency and context between words within captions.
4) The method as claimed in claim 1, wherein the method further comprising the use of a dropout layer in both the CNN and RNN components to mitigate over fitting, wherein the dropout layer randomly deactivates neurons to enhance generalization and model robustness.
5) A caption generation system, the system comprising:
- A feature extraction module using a pre-trained VGG16 CNN model to obtain feature vectors of images,
- A text processing module that reads captions, processes them into lowercase, removes special characters, and tokenizes each word to form a sequence,
- A data generator function that provides training and testing batches, configured to handle input-output pairs for model training,
- An LSTM-based RNN module configured to generate text descriptions for images by predicting word sequences based on previously generated words.
6) The system as claimed in claim 5, wherein the model architecture further includes a fully connected dense layer using a ReLU activation function, positioned after the feature extraction module to ensure effective integration of image features before processing in the LSTM network.
7) The system as claimed in claim 5, wherein the data generator function divides the dataset into training and testing subsets, ensuring 90% of the data is used for training, while 10% is used for validation to optimize model accuracy and generalization.
Documents
Name | Date |
---|---|
202431091365-COMPLETE SPECIFICATION [23-11-2024(online)].pdf | 23/11/2024 |
202431091365-DECLARATION OF INVENTORSHIP (FORM 5) [23-11-2024(online)].pdf | 23/11/2024 |
202431091365-DRAWINGS [23-11-2024(online)].pdf | 23/11/2024 |
202431091365-EDUCATIONAL INSTITUTION(S) [23-11-2024(online)].pdf | 23/11/2024 |
202431091365-EVIDENCE FOR REGISTRATION UNDER SSI [23-11-2024(online)].pdf | 23/11/2024 |
202431091365-EVIDENCE FOR REGISTRATION UNDER SSI(FORM-28) [23-11-2024(online)].pdf | 23/11/2024 |
202431091365-FORM 1 [23-11-2024(online)].pdf | 23/11/2024 |
202431091365-FORM FOR SMALL ENTITY(FORM-28) [23-11-2024(online)].pdf | 23/11/2024 |
202431091365-FORM-9 [23-11-2024(online)].pdf | 23/11/2024 |
202431091365-POWER OF AUTHORITY [23-11-2024(online)].pdf | 23/11/2024 |
202431091365-REQUEST FOR EARLY PUBLICATION(FORM-9) [23-11-2024(online)].pdf | 23/11/2024 |
Talk To Experts
Calculators
Downloads
By continuing past this page, you agree to our Terms of Service,, Cookie Policy, Privacy Policy and Refund Policy © - Uber9 Business Process Services Private Limited. All rights reserved.
Uber9 Business Process Services Private Limited, CIN - U74900TN2014PTC098414, GSTIN - 33AABCU7650C1ZM, Registered Office Address - F-97, Newry Shreya Apartments Anna Nagar East, Chennai, Tamil Nadu 600102, India.
Please note that we are a facilitating platform enabling access to reliable professionals. We are not a law firm and do not provide legal services ourselves. The information on this website is for the purpose of knowledge only and should not be relied upon as legal advice or opinion.