ABOUT THE TALK Natural human-human interaction is inherently multi-modal, as we use a variety of modalities including verbal commands, gestures and facial expressions, visual cues, gaze and even vocal nuances (e.g., tone and rhythm) to mutually convey our intent. Motivated by such human-human interaction scenarios, this thesis broadly investigates some methods to enable multi-modal sense-making for human-AI interaction tasks in resource-constrained wearable and edge devices. In particular, we consider object acquisition as an exemplary task for human-AI collaboration that can benefit from enabling naturalistic multi-modal interaction. To address this, we leverage Referring Expression Comprehension (REC) or Visual Grounding models developed in computer vision and NLP literature. These models, when provided with an image along with verbal and/or gestural inputs, identify the bounding box of the referred object. We then introduce a number of sense-making models and optimization techniques to support low-latency execution of such models for inferencing on pervasive devices.
In this thesis, our emphasis will be predominantly on exploring diverse dynamic optimizations for the comprehension of task instructions. Throughout these investigations, we rely on a common guiding principle that underscores our approach: the acknowledgement that not all instructions pose the same level of task complexity. To illustrate, consider the varying complexities introduced by different types of instructions. In a cluttered environment, identifying a target object often necessitates a more intricate execution pipeline to ensure accurate identification. Users may employ a combination of language instructions and pointing gestures, which can aid the model in disambiguation among closely situated objects. Consequently, the presence of multiple modalities helps alleviate task complexity. Conversely, in a less cluttered space, a simple pointing gesture may suffice for object identification, requiring a less complex execution pipeline. This nuanced understanding of task complexities serves as the foundation for the dynamic optimizations explored in this thesis.
This dissertation is organized into two parts. Part 1, focuses on studying model optimizations applied to REC models, which process a single static image along with language and, optionally, gestural modalities. In Part 2, we extend our methodologies to more complex scenarios involving videos as vision input, moving beyond single static images. |
ABOUT THE SPEAKER Dulanga WEERAKOON is a PhD candidate at the SCIS, SMU. Under the supervision of Prof. Archan Misra and co-supervision of Dr. Vigneshwaran Subbaraju, his research focuses on developing multi-modal human instruction understanding models integrating language, vision, and gestural cues. He primarily aims to optimize these models for pervasive and mobile devices, emphasizing low latency and energy overheads. |