Microsoft Research recently demonstrated Project Rumi, combining text, audio and video data, through a multi-modal approach, to improve the comprehension ability of AI systems, so that they can better understand the intentions of humans.
Artificial intelligence systems have made great strides in recent years, especially in the field of natural language processing (NLP). However, existing NLP AI relies heavily on textual input and output, ignoring cues such as intonation, facial expressions, gestures, and body language that humans use in natural communication, which can lead to biases in understanding.
In AI terminology, these cues are collectively referred to as paralinguistics/paralanguage.
Microsoft Research has addressed this problem by developing Project Rumi, a novel framework designed to enhance AI understanding through multimodal paralinguistic cues. The project consists of two main components: a multimodal paralinguistic encoder and a multimodal paralinguistic decoder.