Action tube detection in a `biking' video taken from UCF-101 dataset. The detection boxes in each frame are linked up to form space-time action tubes. (a) Viewing the video as a 3D volume with selected image frames; notice that we are able to detect multiple action instances in both space and time. (b) Top-down view. Our method can detect several (more than 2) action instances concurrently, as shown in the above Fig.
In this work we propose a new approach to the spatiotemporal localisation (detection) and classification of multiple concurrent actions within temporally untrimmed videos. Our framework is composed of three stages. In stage 1, a cascade of deep region proposal and detection networks are employed to classify regions of each video frame potentially containing an action of interest. In stage 2, appearance and motion cues are combined by merging the detection boxes and softmax classification scores generated by the two cascades. In stage 3, sequences of detection boxes most likely to be associated with a single action instance, called {action tubes}, are constructed by solving two optimisation problems via dynamic programming. While in the first pass action paths spanning the whole video are built by linking detection boxes over time using their class-specific scores and their spatial overlap, in the second pass temporal trimming is performed by ensuring label consistency for all constituting detection boxes. We demonstrate the performance of our algorithm on the challenging UCF101, J-HMDB-21 and LIRIS-HARL datasets, achieving new state-of-the-art results across the board and significantly lower detection latency at test time.
At test time, (a) RGB and flow images are passed to (b) two separate region proposal networks (RPNs). (c) Each network outputs region proposals with associated actionness scores. (d) Each appearance/flow detection network takes as input the relevant image and RPN-generated region proposals, and (e) outputs detection boxes and softmax probability scores. (f) Spatial and flow detections are fused and (g) linked up to generate class-specific action paths spanning the whole video. (h) Finally the action paths are temporally trimmed to form action tubes.
Spatio-temporal overlap threshold δ | 0.05 | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 |
---|---|---|---|---|---|---|---|
Yu et al. [2] | 42.80 | - | - | - | - | - | - |
Weinzaepfel et al. [1] | 54.28 | 51.68 | 46.77 | 37.82 | - | - | - |
Our (appearance detection model) | 67.56 | 65.45 | 56.55 | 48.52 | 39.00 | 30.64 | 22.89 |
Our (motion detection model) | 65.19 | 62.94 | 55.68 | 46.32 | 37.55 | 27.84 | 18.75 |
Our (appearance + motion fusion) | 78.85 | 76.12 | 66.36 | 54.93 | 45.24 | 34.82 | 25.86 |
Spatio-temporal overlap threshold δ | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 |
---|---|---|---|---|---|---|---|
Gkioxari and Malik[3] | - | - | - | - | 53.3 | - | - |
Wang et al.[4] | - | - | - | - | 56.4 | - | - |
Weinzaepfel et al. [1] | - | 63.1 | 63.5 | 62.2 | 60.7 | - | - |
Our (appearance detection model) | 52.99 | 52.94 | 52.57 | 52.22 | 51.34 | 49.55 | 45.65 |
Our (motion detection model) | 69.63 | 69.59 | 69.49 | 69.00 | 67.90 | 65.25 | 54.35 |
Our (appearance + motion fusion) | 72.65 | 72.63 | 72.59 | 72.24 | 71.50 | 68.73 | 56.57 |
We report space-time detection results by fixing the
threshold quality level to 10% for the four thresholds [5] and measuring temporal precision
and recall along with spatial precision and recall, to produce an integrated score. We refer
the readers to [5] for more details on LIRIS HARL’s evaluation metrics.
Abbreviations used in this table:
I-SR: Integrated spatial recall; I-SP: Integrated spatial precision;
I-TR: Integrated temporal recall; I-TP: Integrated temporal precision;
IQ: Integrated quality score.
Method | Recall-10 | Precision-10 | F1-Score-10 | I-SR | I-SP | I-TR | I-TP | IQ |
---|---|---|---|---|---|---|---|---|
VPULABUAM-13-IQ [6] | 0.04 | 0.08 | 0.05 | 0.02 | 0.03 | 0.03 | 0.03 | 0.03 |
IACAS-51-IQ [7] | 0.03 | 0.04 | 0.03 | 0.01 | 0.01 | 0.03 | 00.0 | 0.02 |
Our (appearance + motion fusion) | 0.568 | 0.595 | 0.581 | 0.5383 | 0.3402 | 0.4802 | 0.4739 | 0.458 |
Spatio-temporal overlap threshold δ | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 |
---|---|---|---|---|---|
Appearance detection model | 46.21 | 41.94 | 31.38 | 25.22 | 20.43 |
Motion detection model | 52.76 | 46.58 | 35.54 | 26.11 | 19.28 |
Appearance+motion fusion with one DP pass | 38.1 | 29.46 | 23.58 | 14.54 | 9.59 |
Appearance+motion fusion with two DP passes | 54.18 | 49.10 | 35.91 | 28.03 | 21.36 |
Ground-truth boxes are in green, detection boxes in red. The top row shows correct detections, the bottom one contains examples of more mixed results. In the last frame, 3 out of 4 `Fencing' instances are nevertheless correctly detected.
Left-most three frames: accurate detection examples. Right-most three frames: mis-detection examples.
Frames from the space-time action detection results on LIRIS-HARL, some of which include single actions involving more than one person like ‘handshaking’ and ‘discussion’. Left-most 374 three frames: accurate detection examples. Right-most three frames: mis-detection examples.
Each row represents a UCF-101 test video clip. Ground-truth bounding boxes are in green, detection boxes in red.
Performance comparison between Selective Search (SS) and RPN-based region proposals on four groups of action classes (vertical columns) in UCF-101. Top row: recall vs. IoU curve for SS. Bottom row: results for RPN-based region proposals.
Note: that the reference numbers are in line with our BMVC2016 suplementary material.