Audio-Visual processing and Behavioral Modeling


  1. Multi-person localization and tracking from visual input
  2. Gesture recognition
  3. Speech recognition and understanding
  4. Human action recognition
  5. Social, cognitive and affective state tracking from audio visual data (linguistic, paralinguistic and visual information)
  6. Intent recognition

These technologies will help achieve our first technical goal for creating robots that analyze and track human behavior over time in the context of their surroundings (situational) in order to establish common ground and intention-reading capabilities.

Description of Work

  • Human Localization and Visual Tracking: Starting from person detection based on state-of-the-art computer vision methods, we shall explore human localization in the children-robot interaction framework of the proposed work by employing improved modern techniques for moving single-person detection as well as recent real-time methods for object and pedestrian detection. For visual tracking we shall employ particle filtering, which is capable for both single- and multi-person detection and tracking.
  • Gesture Recognition: We will build a gesture recognition module for the robotic platform that will involve the recognition of a relatively small vocabulary of gestures. Training data will be collected from the target audience of both TD and ADS children in the domain of the application scenarios. Classifiers and recognizers will be set up, adapted, trained and evaluated for low- and high-level features. The main goal is to build a real-time system analyzing the users’ gesturing behavior in the context of child- robot interaction.
  • Speech Recognition and Situated Multimodal Understanding: The overall mission of this task is to improve speech recognition levels for TD and, especially ASD children. However, even after adaptation and additional training we expect the speech recognition accuracy of the autistic children’s speech to still be relatively low. Hence, we will investigate additional means of assessing the intended meaning of children’s multimodal input. The task will also develop methods for computing input understanding confidence scores (beyond ASR confidence). For semantic interpretation of the speech recognition result, we will develop a statistical semantic chunking parser that uses a domain model to produce conceptual tree structures.
  • Human Action Recognition: For actions we intend first to exploit the results of Task 2.1 for improving action classification by enhancing the features and object representations currently being employed. Moreover, we shall extend our recent research work in the EU project MOBOT for action recognition in untrimmed videos. We shall also explore the problem of group activity detection and recognition by clustering individual person action characteristics. A significant amount of effort will be invested to develop a system that will be capable to operate close to real time and deal with the complexity of the BabyRobot visual data.
  • Socio-Affective State Modeling and Intent Recognition from Multimodal Cues: In this task, we will incorporate input from the previous tasks for determining estimates of: 1) the affective state of the child (arousal, valence, dominance, surprise), 2) the cognitive state of the child (engagement, cognitive load), 3) the intentions of the child (intent recognition). For multiparty interaction we will also investigate the status and dominance of each child. Finally, we will investigate child-robot entrainment at different levels (prosodic, lexical, gesture, posture, affective).