Action recognition is important for understanding the human behaviors in the video,and the video representation is the basis for action recognition.This paper provides a new video representation based on convolution neural networks(CNN).For capturing human motion information in one CNN,we take both the optical flow maps and gray images as input,and combine multiple convolutional features by max pooling across frames.In another CNN,we input single color frame to capture context information.Finally,we take the top full connected layer vectors as video representation and train the classifiers by linear support vector machine.The experimental results show that the representation which integrates the optical flow maps and gray images obtains more discriminative properties than those which depend on only one element.On the most challenging data sets HMDB51 and UCF101,this video representation obtains competitive performance.
A new method for estimating gain factors in amplitude panning system is proposed. The method is based on particle ve- locity and balanced sound energy formulation. A scale factor is employed in amplitude panning system and thus, an overdeter- mined system of equation is derived in particle velocity equation. To obtain the analytic solution of the overdetermined equation, the sound energy identical formula is considered and then the unique gain factors are estimated. The proposed method is able to repro- duce sound source direction and control the distance perception in a flexible twoor three-dimension loudspeaker setup. Subjective evaluations show that the proposed technique in an aspheric loudspeaker setup maintains the sound direction and controls the distance perception at the listening point.
Recognizing actions according to video features is an important problem in a wide scope of applications. In this paper, we propose a temporal scale.invariant deep learning framework for action recognition, which is robust to the change of action speed. Specifically, a video is firstly split into several sub.action clips and a keyframe is selected from each sub.action clip. The spatial and motion features of the keyframe are extracted separately by two Convolutional Neural Networks(CNN) and combined in the convolutional fusion layer for learning the relationship between the features. Then, Long Short Term Memory(LSTM) networks are applied to the fused features to formulate long.term temporal clues. Finally, the action prediction scores of the LSTM network are combined by linear weighted summation. Extensive experiments are conducted on two popular and challenging benchmarks, namely, the UCF.101 and the HMDB51 Human Actions. On both benchmarks, our framework achieves superior results over the state.of.the.art methods by 93.7% on UCF.101 and 69.5% on HMDB51, respectively.
Huafeng ChenJun ChenRuimin HuChen ChenZhongyuan Wang
The 22.2 multichannel system and its simplified system with 10-channel and 8-channel have been proposed, which brings people 3 D listening experience. But these systems could only accurately reproduce sound field at a central listening point which is called sweetspot. In order to solve this problem, this paper proposes a non-central zone sound field reproduction method PVMDZ(particle velocity matching between different zones) based on the physical property of sound. The proposed method matches the physical property of sound of non-central zone in reconstructed sound field with that of central zone in original sound field, so the reproduced non-central zone would produce the same listening experience as the central zone of the original system does. By experiments, we compare the performances of the proposed method with the traditional one, and the result proves that the sound field error of proposed method is reduced.