Explored Approaches - Indian Sign Language

Data Collection and Preprocessing

We have used the WL-ASL dataset is a large collection of videos containing various ASL gestures for the WL-ASL model.
For the NLA-SLR model MS-ASL500 (Microsoft American Sign Language Dataset) is used which contains over 500 different American Sign Language (ASL) gestures.
Primarily, we have used the INCLUDE dataset specifically designed for Indian Sign Language (ISL). This dataset contains approximately 4,287 sign videos covering 262 words.
Initially, the INCLUDE50 dataset, consisting of 900 videos for 50 ISL words, was used for model training.
Later, the dataset was expanded to INCLUDE100, covering 1695 videos for 100 words.
Preprocessing involved extracting 3D landmarks (hand and lip features), which were transformed into feature vectors (tensor) used for training the model.

Shuwa Model Approach

The SHUWA model by Google is designed for sign language recognition, specifically optimized for Japanese Sign Language (JSL), but its underlying principles can be applied to other sign languages like Indian Sign Language (ISL). SHUWA leverages a combination of computer vision techniques and deep learning models to interpret sign language gestures in real time.

Data Pre-Processing: The model uses the Media-Pipe library to extract key points from video data, which are stored in .h5 files.

Working of SHUWA Model: The SHUWA model, designed for Japanese Sign Language recognition, uses the Media-Pipe library to extract key points from video data, which are stored in .h5 files. It trains a machine learning model on these key points to learn sign gestures. For inference, the model generates key points from new input videos and applies K-Nearest Neighbours (K-NN) to find the nearest matching gestures from the training set.

NLA-SLR Based Approach

NLA-SLR (Natural Language-Assisted Sign Language Recognition) is a project focused on recognizing and interpreting sign language using deep learning models, particularly leveraging large scale datasets like MS-ASL500 (Microsoft American Sign Language Dataset). The project involves processing videos of sign language gestures to generate key-points, which are used to train or finetune models for sign language recognition.

Key Components:

The project uses MS-ASL500, a dataset containing 500 different American Sign Language (ASL) gestures. Each gesture is represented by multiple videos of different individuals performing the sign.
The model works by extracting key points (coordinates representing important body parts, like hands and facial features) from each frame of the video. These key points are stored in a .Json file and used as input for the recognition model.
A deep learning model (RNN and CNN) trained for sign language recognition using key points extracted from video frames. The project supports both training from scratch and inference using a pretrained model.

WL-ASL Based Approach

This project focuses on sign language recognition using the WL-ASL dataset (Word Level American Sign Language Dataset) and deep learning models like the I3D (Inflated 3D) pretrained model. The main objective is to train and test models for recognizing American Sign Language (ASL) gestures from videos, utilizing WL-ASL as the dataset source.

Key Components:

The WL-ASL dataset is a large collection of videos containing various ASL gestures. The project aims to extract the top 100 video samples for initial testing. This can be done by running the default video extractor script and manually stopping it after reaching 100 entries. Videos are stored in the following directory structure: ‘WL-ASL/Dataset/…’.
The I3D (Inflated 3D) model, pretrained on large scale datasets, is used for recognizing gestures. This model is designed to process spatiotemporal information, which is ideal for video-based tasks like gesture recognition. These scripts train the model on the WL-ASL dataset, after which the trained model can be evaluated on unseen data to evaluate its performance.
We encountered a memory size limitation error, which was resolved by reducing the batch size. The training and inference script now works on the ASL dataset, and the accuracy is close to what is mentioned in the paper. Further we need to evaluate this model on INCLUDE dataset.

Transformer Model for INCLUDE50 and INCLUDE100

The INCLUDE model is designed for Indian Sign Language (ISL) recognition and stands as a key component of the INCLUDE (Indian Continuous Language Understanding, Development, and Evaluation) project. The model primarily focuses on capturing and interpreting ISL through video data, making it accessible and effective for real-time sign language translation and communication systems.

Data Pre-Processing: The data pre-processing task for the INCLUDE model involves generating key points from videos. The process starts by reading videos frame by frame using OpenCV. Each frame is then pre-processed (resized, normalized) and passed through a key point detection model of Media-Pipe, which detects the 2D or 3D coordinates of key body parts (such as joints or facial landmarks). These extracted key points are stored in formats like JSON for further use in action or sign language recognition tasks. The process is repeated for all frames in the video.

Transformer Model Working: The Transformer model processes input sequences by first embedding them with positional encodings. It uses the self-attention mechanism to compute relationships between words, capturing contextual information across the sequence. The output from self-attention is passed through a feed-forward network for further transformation. Multiple layers of self-attention and feed-forward networks are stacked to learn deeper patterns. Finally, the model generates either a sequence or classification output, with residual connections and layer normalization stabilizing the process.

CSLR (Continuous Sign Language Recognition)

Continuous Sign Language Recognition (CSLR) focuses on the real time interpretation of sign language by recognizing sequences of signs instead of isolated gestures. This approach is particularly vital for Indian Sign Language (ISL), where fluent signers seamlessly combine signs, much like spoken language. The ability to understand and interpret these continuous sequences is crucial for effective communication within the deaf community. Below is a detailed overview of the CSLR approach implemented in this project.

The key objectives of the CSLR project include:

Recognizing Sequences of Signs:

Interpret signs in context to capture fluidity and transitions between signs.
Understand the semantics of continuous signing, where meaning depends on combinations and order of signs rather than isolated gestures.

Handling Variations in Signing Style and Speed:

Adapt to variability in signing techniques influenced by regional differences, individual styles, and signing speed.
Ensure accurate recognition across different users through diverse training data and advanced modelling techniques that generalize well.

Ensuring High Accuracy and Low Latency:

Achieve high accuracy levels essential for effective real-world applications.
Maintain low latency for applications requiring real-time feedback, such as live interpretation during conversations.
Process and recognize signs swiftly to facilitate seamless communication.

Current Research Phase:

Currently, the project is in an active research phase, dedicated to exploring various models and reviewing existing literature in the field of sign language recognition. This phase is critical for laying the foundation for the CSLR system.