image
image
user-login
Patent search/

SIGN LANGUAGE CONVERSION INTO VOICE AND TEXT BY CNN

search

Patent Search in India

  • tick

    Extensive patent search conducted by a registered patent agent

  • tick

    Patent search done by experts in under 48hrs

₹999

₹399

Talk to expert

SIGN LANGUAGE CONVERSION INTO VOICE AND TEXT BY CNN

ORDINARY APPLICATION

Published

date

Filed on 4 November 2024

Abstract

ABSTRACT OF THE INVENTION Sign Language conversion into voice and text using CNN The speech and hearing-impaired conununity people face difficulty in conveying their needs through sign language to normal people in public places like hospitals, police stations, railway stations, etc. In places of inquiry, common people have a very short period to listen to the needs of the people but the speech and hearing people will not be able to convey their needs at the appropriate time, and common people don't have enough time to satisfy their needs. It creates a communication gap between the people. The impaired people's needs are not satisfied and it builds low self-esteem. This further makes the impaired people avoid themselves in the conversations. This can be solved when sign language is converted into text and voice it will be easy for both conunon and impaired people to understand and satisfy their needs in a short time-gathering a diverse dataset of sign language gestures recorded through video or sensors, covering various gestures, expressions, and variations. Preparing the data for training. which may involve normalization, segmentation of gestures, and annotation of corresponding textual or spoken language. They use the collected and preprocessed data to train a CNN model to recognize and classify different sign language gestures. The CNN learns to extract features and patterns from the visual input. Once the C:NN model identifies a gestur.e, it converts it into either text or voice. This translation step can involve mapping recognized gestures to a pre-defined vocabulary or converting them into spoken language through text-to-speech systems. So, both the SA H community and common people will have a solution for their problem.

Patent Information

Application ID202441083988
Invention FieldPHYSICS
Date of Application04/11/2024
Publication Number46/2024

Inventors

NameAddressCountryNationality
Reshma B SSRI SAI RAM ENGINEERING COLLEGE , SAI LEO NAGAR, WEST TAMBARAM, CHENNAI-600044IndiaIndia
Karthika Auraum ASDEPARTMENT OF INFORMATION TECHNOLOGY, SRI SAI RAM ENGINEERING COLLEGE , SAI LEO NAGAR, WEST TAMBARAM, CHENNAI-600044IndiaIndia
Saradha K RDEPARTMENT OF INFORMATION TECHNOLOGY, SRI SAI RAM ENGINEERING COLLEGE , SAI LEO NAGAR, WEST TAMBARAM, CHENNAI-600044IndiaIndia

Applicants

NameAddressCountryNationality
SRI SAI RAM ENGINEERING COLLEGESARADHA K R , ASSISTANT PROFESSOR, DEPARTMENT OF INFORMATION TECHNOLOGY, SRI SAI RAM ENGINEERING COLLEGE , SAI LEO NAGAR, WEST TAMBARAM, CHENNAI-600044, 9843727494 saradha.it@edu.inIndiaIndia
Reshma B SDEPARTMENT OF INFORMATION TECHNOLOGY, SRI SAI RAM ENGINEERING COLLEGE , SAI LEO NAGAR, WEST TAMBARAM, CHENNAI-600044IndiaIndia
Karthika Auraum ASDEPARTMENT OF INFORMATION TECHNOLOGY, SRI SAI RAM ENGINEERING COLLEGE , SAI LEO NAGAR, WEST TAMBARAM, CHENNAI-600044IndiaIndia
Saradha K RDEPARTMENT OF INFORMATION TECHNOLOGY, SRI SAI RAM ENGINEERING COLLEGE , SAI LEO NAGAR, WEST TAMBARAM, CHENNAI-600044IndiaIndia

Specification

COMPLETE SPECIFICATION
FIELD DESCRIPTION
The invention relates to the domain of deep learning and communication systems tor the speech and
hearing-impaired community. Specifically, it focuses on gesture recognition and conversion
technologies that use Convolutional Neural Networks (CNN) for real-time translation of sign language
into text and voice outputs. This invention bridges communication gaps by enabling the seamless
interpretation of sign language and facilitating interaction between the impaired community and the
general public in various public and social settings, such as hospitals, police stations, and service
centers. It integrates machine learning, computer vision, and speech synthesis technologies for
accessible, inclusive communication
BACKGROUND INVENTION
Communication is a fundamental human need, but for individuals who are speech- and hearingimpaired,
effective communication with the general public can often be challenging. Sign language is
the primary mode of communication for many in the hearing- and speech-impaired community.
However, most people who do not have such impairments are unfamiliar with sign language. which
leads to a significant communication gap between these two groups. This gap is especially problematic
in public spaces where timely and clear communication is critical, such as hospitals, police stations,
and transport hubs. In such settings, speech and hearing-impaired individuals may find it difficult to
convey their needs, often resulting in misunderstandings, ti·ustration, and unaddressed needs.
The lack of understanding of signlanguage by the majority of the population leads to communication
barriers that can have profound consequences for the speech- and hearing-impaired community:
• In hospitals, for instance, an inability to communicate medical symptoms or concerns quickly
and accurately can delay treatment or lead to incorrect diagnoses.
• In police stations or legal settings, misunderstandings between the impaired individual and
authorities may lead to frustration. miscommunication, or even mishandling of critical
situations.
• In public transportation hubs like railway stations or airpotts, impaired individuals may
struggle to inquire about schedules, directions, or services.
Recent advances in machine learning and compttter vision have opened up new possibilities for
solving this problem. Specifically, Convolutional Neural Networks (CNNs) have proven to be highly
effective in recognizing and classifying images. making them an ideal technology for recognizing
hand gestures and body movements used in sign language. By using CNNs to identify sign language
gestures in real-time, it is now possible to develop a solution that automatically translates these
gestures into text and speech for seamless communication.
This invention addresses a significant gap in communication for the speech- and hearing-impaired
community by using CNN-based gesture recognition to translate sign language into text and voice. By
integrating machine learning, computer vision, and speech synthesis technologies, this system bridges
the communication gap 111 real-time, enabling the impaired community to communicate more
effectively with the general public in various social and public settings.
4. DETAILED DESCRIPTION OF THE INVENTION
The proposed system is organized into six interdependent modules, each contributing specific tasks
toward the seamless translation of sign language gestures into text and voice output. These modules
work collaboratively to ensure a comprehensive and efficient conversion process.
1. Data Acquisition Module:
The Data Acquisition Module forms the foundation of the system by gathering a diverse and
representative dataset of sign language gestures rrom platforms like Kaggle. This module tocuses on
collecting and curating image and video data of individuals performing various sign language gestures,
including different hand positions, movements, and expressions. The emphasis is on ensuring the
inclusion of a wide range of gestures to enhance the system's ability to recognize complex sign
language vocabularies. The dataset is designed to capture the intricacies of sign language gestures,
which is essential for training the CNN to leam and accurately classify the patterns necessary for
effective gesture recognition.
2. Preprocessing Module:
The Preprocessing Module is crucial in uptituizing raw 3ign language data fi)r C:NN trammg. This
stage involves several preprocessing techniques, such as image resizing, normalization, and
augmentation. Resizing the images ensures uniformity in input dimensions, while normalization
star.dardizes pixel values to reduce data variations, facilitating better model convergence. Data
augmentation techniques, such as flipping, rotation, and adding noise, increase the robustness of the
model by introducing variability that simulates real-world scenarios. The Preprocessing Module
refines the dataset, making it well-suited for the gesture recognition task.
3. Gesture Recognition Module:
At the core of the system is the Gesture Recognition Module. responsible tor identifying and
interpreting sign language gestures in real time. Leveraging the trained Convolutional Neural Network
(CNN), this module detects hand gestures from images or video streams, recognizing specific hand
positions and movements. The CNN extracts features and patterns rrom the input data, and using
techniques like object localization and tracking, it identifies the areas of interest (e.g., hands) and
determines the corresponding gesture. This recognition torms the basis for further translation into
textual and voice formats. The precision and efficiency of this module are critical to ensuring accurate
gesture identification .
4. Translation Module:
The Translation Module serves as a bridge between the recognized gestures and their corresponding
textual representations. Upon receiving the recognized gestures !rom the CNN, this module converts
them into meaningful text using mapping techniques that link the gestures to specitic words or
phrases. By interpreting sequences of gestures, the Translation Module generates coherent sentences
that capture the intended meaning of the sign language input. Its ability to provide contextually
accurate text is essential for ensuring that the converted output is clear and comprehensible.
5. Translation to Text Module:
Building upon the Translation Module, the Translation to Text lVIodule is specifically responsible for
transforming the recognized gestures into structured textual output. By mapping gestures to their
corresponding linguistic equivalents, this module generates accurate, meaningful text that represents
the sign language gestures in a readable format. It interprets the recognized gestures as a sequence,
producing sentences or phrases that convey the full message of the sign language input. This module
plays a critical role in ensuring that the final textual output is precise and easily interpretable, serving
as the basis for further voice synthesis.
6. Text-to-Speech (TTS) Module:
The final component is the Text-to-Speech (TTS) Module, which converts the generated text into
spoken words. This module leverages TTS engines and linguistic processing techniques to synthesize
natural-sounding speech fi·om the textual input. The TTS Module processes the text output to produce
clear and coherent spoken language, ensuring that the final result is both understandable and
contextually accurate. The quality of the generated speech is crucial for delivering a seamless auditory
output for end users, enabling effective communication through both text and voice.
SUMMARY
Individuals from the speech and hearing-impaired community often face difficulties communicating
their needs in public places like hospitals, police stations, and railway stations. The reliance on sign
language creates a communication gap, as most people are unable to understand it, leading to
frustration, unmet needs, and social isolation. To bridge this gap, sign language can be translated into
text and voice, facilitating smoother interactions.
This solution involves collecting a diverse dataset of sign lringuage gestures using video recordings or
sensors. The data is then pre-processed through normalization, segmentation, and annotation. A
Convolutional Neural Network (CNN) is trained to recognize and classify these gestures by extracting
visual features and patterns. Once a gesture is identified, it is translated into text or voice, enabling
seamless communication between the impaired and the general public. This approach not only
addresses communication barriers but also promotes inclusivity and self-confidence for the speech and
hearing-impaired communi!)'.
OBJECTIVES
I. Objective: ·
To accurately identify and recogmze hand gestures associated with different sign language
symbols.
Rationale:
CNNs are highly effective at processing images, making them suitable for identifying patterns
in hand shapes, movements, and orientations in sign language. The system should be able to
learn various signs and differentiate between them with high accuracy.
2. Objective:
To convert the recognized hand gestures into corresponding tex. in r~al-time.
Rationale:
The system should enable smooth communication by translating sign language gesture:; into
readable text, bridging the gap between sign language users and those who do not understand
sign language. This objective focuses on ensuring the text accurately represents the meaning of
the sign gestures.
3. Objective:
To generate real-time voice output from recognized sign language gestures .
Rationale:
The system should not only convert signs into text but also convert them into spoken language,
allowing signers to communicate more effectively with people who use voice communication.
Real-time response is critical for natural and fluid communication.
4. Objective: To achieve high accuracy in recognizing signs across different lighting conditions,
angles, and environments.
Rationale: For practical usability, the system must be reliable under varying conditions such as
different backgrounds, lighting, or camera angles. CNNs should be trained to generalize well in
these environments to avoid misclassification or enors.
5. Objective:
To minimize delays in processing stgn language input and converting it into votce or text
output.
Rationale:
Real-time communication systems require fast data processing, so the CNN model should be
optimized for quick recognition and response. Low latency is crucial to ensure a natural flow
of conversation without noticeable delays.
6. Objective:
To create an intuitive and accessible interlace for both stgn language users and nun-stgn
language users.
Rationale:
The system should be designed with a user-friendly interface that allows users to easily capture
their gestures (e.g., through a webcam or mobi'lc camera) and receive instant text or voice
feedback. This objective also ensures that the system is easy to operate, even for non-experts.
7. Objective:
To support multiple sign languages and dialects, making the system scalable and adaptable to
various regions.
Rationale:
Different countries or regions have their own versions of sign language (e.g., ASL, BSL ISL).
The system should aim to incorporate multiple sign languages and dialects to' bro•:den its
usability and inclusivity across different communities.
8. Objective:
To continuously improve the CNN model's performance through training on large datasets of
sign language gestures.
Rationale:
The success of the system relies on a well-trained CNN that can effectively recognize a wide
variety of gestures. The objective is to usc diverse and extensive training data, with regular
updates, to ensure the model remains accurate and improves over time.
9. Objective:
To create an affordable and accessible solution that can be used by people in daily life.
Rationale:
The system should be economically viable so it can be implemented on common devices such
as smartphones or computers, making it accessible to a large population of users, including
those from underprivileged backgrounds.
BRIEF DESCRIPTION OF THE DIAGRAMS
FIGURE 1: COMPREHENSIVE ARCHITECTURE DIAGRAM
1. Webcam Video Input
• Purpose: The system begins by capturing video input fi·om a webcam, where a person IS
performing sign language gestures.
• · How it Works: The we beam continuously records the user's hand, ann. and body movements,
which arc essential for recognizing sign language gestures. It acts as the raw input for the
entire process.
2. Video to Pose Generation
• Purpose: The video feed is divided into individual frames (images), allowing the system to
analyze the user·s movements frame by frame.
• How it Works:
o Frame Extraction: The continuous video stream is segmented into distinct images or
"fi·ames" for easier processing.
o Pose Detection: Algorithms like OpenPose, or similar pose esti111ation models, are
applied to each frame to identify key points in the user's body, focusing on hand
movements and body posture that correspond to sign language gestures.
o Key Information: Each frame provides a "pose", which is a set of key points (like the
position of the hands, fingers, elbows, and shoulders) needed to interpret the sign
accurately.
3. Open Poses (Pose Estimation)
• Purpose: This step involves extracting precise information about hand and bcdy positions fi·om
the video frames.
• How it Works:
o Keypoints Mapping: OpenPose or other similar pose detection techniques create a
skeletal map of the user's body. This skeletal map represents the position and
orientation of key body parts, such as joints and fingertips, which arc crucial for
understanding the meaning behind each sign language gesture.
o Focus on Hand Gestures: In sign language, hand shapes and movements are particularly
important. Thus, the system focuses specifically on hand-related keypoints, ensuring
the gestures are accurately identified from frame to frame.
4. Mapping Hand Gestures with the Dataset
• Purpose: To ic!entify the specific sign language gesture being performed by comparing the
detected hand poses with an existing dataset of known signs.
• How it Works:
o Sign Language Database: The system contains a pre-annotated database (a dataset) of
sign language gestures. Each gesture in this database is associated with a corresponding
textual meaning .
o Gesture Recognition (CNN): The extracted pose data is processed by a Convolutional
Neural Network (CNN), which is trained to recognize and classify different sign
language gestures. The CNN looks for patterns in the hand shapes, movements, and
positions and then matches these patlerns to the gestures stored in the sign language
dataset.
5. Learned Lookup Table
• Purpose: The recognized gesture is mapped to a pre-defined text through a lookup table, which
stores the mapping between gestures and their corresponding text or spoken equivalents.
• How it Works:
o Gesture-Text Mapping: The lookup table stores pre-learned mappings between gestures
and their equivalent text or words. For example, the gesture for "thank you" will map to
the text "thank you" in the lookup table.
o Training: The system is trained using large datasets to learn these mappings accurately,
meaning it can recognize complex signs and produce appropriate translations.
6. Text Conversion and Processing
• Purpose: Once the gesture is classified and mapped to text, this text is processed to ensure it is
ready for display or conversion to speech.
• How it Works:
o Text Generation: The raw gesture data is converted into meaningfi.tl text based on the
lookup table mappings. This ensures that each gesture is convet1ed into an accurate and
grammatically correct word, phrase, or sentenct:.
o Natural Language Processing (NLP): If needed, the system applies basic NLP
techniques to refine the text output, making sure it's contextually appropriate. For
example, multiple gestures combined in sequence can be processed into sentences
rather than individual words.
7. Voice Processing and Voice Conversion
• Purpose: The text that has been generated from the recognized gesture is converted into audible
speech, making it accessible to non-sign language users.
• How it Works:
o Text-to-Speech (TTS): The processed text is sent to a text-to-speech (TTS) engine. This
TTS system takes the text input and synthesizes it into a natural-sounding voice output.
o Voice Customization: Depending on the system's setup, the voice output can be
customized to different tones, genders, or accents to make it sound more natural or
suited to specific use cases.
8. Text and Voice Output
• Purpose: The final step is the output, where both the recognized text and voice arc presented to
the user or an audience.
• How it Works:
o Text Display: The recognized sign is shown as text on a screen or any display device,
allowing people to r'ead the message. This is crucial for environments where audio may
not be practical or accessible.
o Voice Output: Simultaneously, the TTS system produces the spoken version of the text,
allowing non-signers to hear the translated sign language. The voice output is typically
delivered through speakers or an audio device.
FIGURE 2: Data Flow of text Module
I. Data Preprocessing:
• At the very beginning, raw text input is prepared for further use. This involves tasks like
cleaning up unnecessary symbols, nom1alizing the text fommt (e.g., converting everything to
lowercase), or splitting it into smaller units for analysis. This step ensures the data is in a
usable state.
2. Mapping to Datasets:
• Once the text is preprocessed, it is checked against stored datasets. These datasets could be
large collections of text or patterns that the system will usc to lind matches. The goal is to see
if the input data can be directly linked or associated with anything in the existing database.
3. Decision Stage- Is There a Match?:
• Here, the system asks a crucial question: does the input text match anything in the dataset?
o If Yes, it proceeds to postprocessing the matched data.
o If No, it moves on to searching for data that might not be an exact match but is similar.
4. Finding Similar Data:
• If the input doesn't have an exact match in the database, the system doesn't stop there. Instead,
it looks for related or similar data that could still be relevant. This might involve pattern
recognition or similarity algorithms to find approximate matches.
5.

6.

Postprocessing:
Ai:er the data (whether exact or similar) is identified, it undergoes postprocessing. This step
refines the results to make them more meaningfili or better suited to the final output format. It
may include fonnctting or making final adjustments based on context.
Generating Text Output:
The tina I product of this process is a text-based result. This could be a response to a user query,
a translated text, or any form of textual data that the system was designed to produce.
FIGURE 3: Data Flow of Voice Module
I. Preprocessing Data:
o This is the first step where raw data (possibly voice or other input ·data) undergoes
preprocessing. Preprocessing could include cleaning, filtering, or transforming the data
into a suitable format for further processing.
2. Mapping with Datasets:
o After preprocessing, the data is mapped with existing datasets, most likely stored in a
database. The purpose of this step is to compare or match the input data with existing
data to identify patterns or relevant information.
3. If Data Matched (Decision Point):
o At this stage, a decision is made based on whether the input data matches existing data
in the dataset:
• Yes: !fa match is found, the flow proceeds to the next step.
No: If no match is found, it initiates a search for other similar data.
4. Looks for Other Similar Data:
5.
6.
o If the exact match isn't found, this step involves looking tor other similar data in the
database. The module may use techniques like approximate matching, machine leaming
algorithms, or pattern recognition to find relevant data.
Postprocessing Data:
o Once the matching or similar data is identified, the next step is postprocessing. This step could
involve formatting, enhancing, or preparing the matched data to be suitable for generating the
output.
Voice Output:
o The final step in the flow is generating the voice output. Based on the processed and
postprocessed data, the module generates an appropriate voice response or sound.

CLAIMS
We claim
Claim 1:
• Bridging Communication Barriers:
• This helps the speech and hearing-impaired community bridge their communication barriers
with society and helps them to overcome their insecurities.
• Enhanced Independence for Deaf Users:
• This conversion can allow users to panicipate in conversations without relying on interpreters
or intermediaries, enhancing independence.
Claim 2:
• Instant Translation of Gestures:
• Many sign language conversion technologies claim to otler real-time couversion of sign
language gestures into text or speech, allowing for immediate communication without delays.
Claim 3
• Multi-language Suppo11
• Support for Different Sign Languages:
• Since different regions use different forms of sign language (e.g., American Sign Language
(ASL), British Sign Language (BSL), etc.), technologies claim to provide support for muitiple
sign languages, enabling global communication.
Claim 4
• User-friendly Interfaces:
• Many sign language conversion tools are claimed to be intuitive and easy to use, often relying
on wearable devices, mobile applications, or cameras that capture hand movements.
Claim 5:
• Educational Tools for Learning Sign Language:
• These systems claim to provide a new way for people to learn sign language, making it easier
for non-signers to communicate with deaf individuals.
Claim 6:
• Support for Professional Settings:
• Conversion tools claim to enhance accessibility in professional settings such as meetings,
customer service, and conferences by making sign language users more included in real-time
discussions.
Claim 7:
• Portability: Some systems claim that users can easily carry or wear the devices
convert sign language, making the technology practical for daily use.

Documents

NameDate
202441083988-Form 1-041124.pdf07/11/2024
202441083988-Form 2(Title Page)-041124.pdf07/11/2024
202441083988-Form 3-041124.pdf07/11/2024
202441083988-Form 5-041124.pdf07/11/2024
202441083988-Form 9-041124.pdf07/11/2024

footer-service

By continuing past this page, you agree to our Terms of Service,Cookie PolicyPrivacy Policy  and  Refund Policy  © - Uber9 Business Process Services Private Limited. All rights reserved.

Uber9 Business Process Services Private Limited, CIN - U74900TN2014PTC098414, GSTIN - 33AABCU7650C1ZM, Registered Office Address - F-97, Newry Shreya Apartments Anna Nagar East, Chennai, Tamil Nadu 600102, India.

Please note that we are a facilitating platform enabling access to reliable professionals. We are not a law firm and do not provide legal services ourselves. The information on this website is for the purpose of knowledge only and should not be relied upon as legal advice or opinion.