ADHA: A Benchmark for Recognizing Adverbs describing Human Actions in Videos

Who are we

We are part of the video understanding group within the Machine Vision and Intelligence Group at SJTU. We work on making AI to understand video faster and more intelligently. Our long-term technology mission is to subvert the computer vision from pattern recognition to real AI, making AI more like human.

This dataset is the first step to approach the human level pattern recognition: Adverb recognition. If we can teach an AI to understand adverbs of an action, it implies that the AI can understand the attitude and mood of the action player, which is necessary for interactive robots. We also believe this is the preliminary work to make AI understand the purpose and the intent behind actions. Unlike action recognition, HAAs describe the conceptions with very sensitive visual patterns that are difficult to recognize. For example, is the drinking people happy or sad; Is the hand shaking an expression of excitement or politeness? Extensive experiments show that understanding adverb is very challenging for current state-of-the-art deep learning architecture. Note that in the topic of image captioning, adverb may be included in the language material. However, on one hand, they don’t take adverb as a target. On the other hand, we believe adverb recognition will be one of the important tools to further advance the topic of image captioning. Our paper describes more about the dataset such as the collection and annotation methods. The paper will soon be released on ArXiv.