ADHA: A Benchmark for Recognizing Adverbs describing Human Actions in Videos

Download

Framework of the dataset

Videos

ADHA videos are splited into 32 folders. One folder corresponds to one action. So that we do not give out an alone labels on the action.
./videos
    /brush_hair
    /chew
    /clap
    ...
    /wave
    /walk
              

Bounding-box

One video may have more than one target person. We use bounding box to point out them. One bounding-box contains two point coordinates, in total 4 numbers: x1, y1, x2, y2. The structure of the bounding-file is like this:
videoname frame_number p1_x1 p1_y1 p1_x2 p1_y2 ... pN_x1 pN_y1 pN_x2 pN_y2
              
where "p1_x1" means the x coordinate of the first bounding-box point of the first person. "frame_number" index the bounding-box is at which frame. For example 1.avi 3 10 15 50 50 means the video 1.avi only have one target person the bounding-box is (x1,y1,x2,y2)=(10,15,50,50) and this boundingbox is at frame 3.

Labels

Labels are also splited into 32 action folders. Each video is annotated by 3 annotators. So that there are three group of labels for one action. Labels are in the form of pickle. The pickle structure is like this:
[[videoName_1,[adv_labels for person1],...,[adv_labels for personN]],...,[videoName_N,[adv_labels for person1],...,[adv_labels for personN]]]
              
If the adverb for one target person is None or -1, this means that none of the adverbs are appropriate. The order of the person in one video is the same with the bounding-box.

Tracking Result

We provide tracking result for the target person. Tracking uses bounding-box. We also save the tracking result in pickle file. The structure of it is:
{'fps':val, 'type':'rect', 'res':[[x1,y1,x2,y2],...,[x1,y1,x2,y2]]]
              
The length of the res is the same as the video. The tracking is started with the bounding-box, which means that it may not start at frame 0. We use 0 to padding it.