際際滷

際際滷Share a Scribd company logo
Detection of Violent Scenes using Affective Features




Esra Acar
Competence Center Information Retrieval and Machine Learning




               4. October 2012
Outline

 Motivation
 Background
 The Method
   Audio Features
   Visual Features
 Results & Discussion
 Conclusions & Future Work




          4. October 2012   Detection of Violent Scenes using Affective Features   2
Motivation

 The MediaEval 2012 Affect Task aims at detecting violent
  segments in movies.

 A recent work on horror scene recognition detects horror
  scenes by affect-related features.

 We investigate whether
   affect-related features provide good representation of
    violence, and
   making abstractions from low-level features is better than
    directly using low-level data.



          4. October 2012   Detection of Violent Scenes using Affective Features   3
Background

 The affective content of a video corresponds to
   the intensity (i.e. arousal), and
   the type (i.e. valence) of emotion
  expected to arise in the user while watching that video.

 Recent research efforts propose methods to map low-level
  features to high-level emotions.

 Film-makers intend to elicit some particular emotions (i.e.
  expected emotions) in the audience.
 When we refer to violence as an expected emotion in
  videos, affect-related features are applicable for violence
  detection.

           4. October 2012   Detection of Violent Scenes using Affective Features   4
The Method

 The method uses affect-related audio and visual features to
  represent violence.

 Low-level audio and visual features are extracted.
 Mid-level audio features are generated based on the low-
  level ones.

 The audio and visual features are then fused at the feature-
  level and a two-class SVM is trained.




          4. October 2012   Detection of Violent Scenes using Affective Features   5
Audio Features - 1

 Affect-related audio features used in the work are:
   Audio energy
          related to the arousal aspect.
          high/low energy corresponds to high/low emotion intensity.
          used for vocal emotion detection.
      Mel-Frequency Cepstral Coefficients (MFCC)
          related to the arousal aspect.
          works well for the detection of excitement/non-excitement.
      Pitch
          related to the valence aspect.
          significant for emotion detection in speech and music.




           4. October 2012   Detection of Violent Scenes using Affective Features   6
Audio Features - 2

   Each video shot has different numbers of audio energy, pitch and
    MFCC feature vectors (due to varying shot durations).
   Audio representations are obtained by computing mean and
    standard deviation for these audio features.

   Abstraction for MFCC:
     MFCC-based Bag of Audio Words (BoAW) approach is chosen to
       generate mid-level audio representations.
     Two different audio vocabularies are constructed: violence and
       non-violence vocabularies (by k-means clustering).
     MFCC of violent/non-violent movie segments are used to
       construct violence/non-violence words.
     Violence and non-violence word occurrences within a video shot
       are represented by a BoAW histogram.



            4. October 2012   Detection of Violent Scenes using Affective Features   7
Visual Features

 Average motion
   related to the arousal aspect.
   Motion vectors are computed using block-based motion
     estimation.
   Average motion is found as the average magnitude of all
     motion vectors.

 We compute average motion around the keyframe of video
  shots.




          4. October 2012   Detection of Violent Scenes using Affective Features   8
Results & Discussion - 1

 The performance of our method was assessed on 3
  Hollywood movies (evaluation criteria: MAP at 100).

 We submitted five runs:
   r1-low-level: low-level audio and visual features,
   Runs based on mid-level audio and low-level visual features
         r2-mid-level-100k: 100k samples for dictionary construction,
         r3-mid-level-300k: 300k samples for dictionary construction,
         r4-mid-level-300k-default: 300k samples for dictionary
          construction + SVM default parameters, and
         r5-mid-level-500k: 500k samples for dictionary construction.




           4. October 2012   Detection of Violent Scenes using Affective Features   9
Results & Discussion - 2
                   Table 1  Precision, Recall and F-measure at shot level
        Run                                    AED-P           AED-R             AED-F
        r1-low-level                             0.141           0.597           0.2287
        r2-mid-level-100k                        0.140           0.629           0.2285
        r3-mid-level-300k                        0.144           0.625           0.2337
        r4-mid-level-300k-default                0.190           0.627           0.2971
        r5-mid-level-500k                        0.154           0.603           0.2457

                   Table 2  Mean Average Precision (MAP) values at 20 and 100
        Run                                      MAP at 20               MAP at 100
        r1-low-level                                0.2132                   0.18502
        r2-mid-level-100k                           0.2037                   0.14492
        r3-mid-level-300k                           0.3593                   0.18538
        r4-mid-level-300k-default                   0.1547                   0.15083
        r5-mid-level-500k                            0.15                    0.11527
   Slightly better performance is achieved with mid-level representations compared
    to the low-level one.
   Using affect-related features to describe violence needs some improvements
    (especially the visual part).
               4. October 2012         Detection of Violent Scenes using Affective Features   10
Conclusions & Future Work

 The aim of this work was to investigate whether affect-
  related features are well-suited to describe violence.
 Affect-related audio and visual features are merged in a
  supervised manner using SVM.

 Our main finding is that more sophisticated affect-related
  features are necessary to describe the content of videos
  (especially the visual part).
 Our next step in this work is to use
   mid-level features such as human facial features, and
   more sophisticated motion descriptors such as Lagrangian
      measures
  for video content representation.


          4. October 2012   Detection of Violent Scenes using Affective Features   11
Thank you!


                  Questions?



4. October 2012    Detection of Violent Scenes using Affective Features   12
Ad

Recommended

mevd2012 esra_
mevd2012 esra_
MediaEval2012
A Review of Video Classification Techniques
A Review of Video Classification Techniques
IRJET Journal
Evidence it Works
Evidence it Works
harkeroliver
Pam parents presentation st elizabeth's 2
Pam parents presentation st elizabeth's 2
mr_hughes
Human Action Recognition in Videos Employing 2DPCA on 2DHOOF and Radon Transform
Human Action Recognition in Videos Employing 2DPCA on 2DHOOF and Radon Transform
Fadwa Fouad
A Smart Target Detection System using Fuzzy Logic and Background Subtraction
A Smart Target Detection System using Fuzzy Logic and Background Subtraction
IRJET Journal
D25014017
D25014017
IJERA Editor
IMAGE RESOLUTION ENHANCEMENT BY USING SWT AND DWT
IMAGE RESOLUTION ENHANCEMENT BY USING SWT AND DWT
IRJET Journal
Target Detection and Classification Performance Enhancement using Super-Resol...
Target Detection and Classification Performance Enhancement using Super-Resol...
sipij
TARGET DETECTION AND CLASSIFICATION PERFORMANCE ENHANCEMENT USING SUPERRESOLU...
TARGET DETECTION AND CLASSIFICATION PERFORMANCE ENHANCEMENT USING SUPERRESOLU...
sipij
TARGET DETECTION AND CLASSIFICATION PERFORMANCE ENHANCEMENT USING SUPERRESOLU...
TARGET DETECTION AND CLASSIFICATION PERFORMANCE ENHANCEMENT USING SUPERRESOLU...
sipij
A survey on coding binary visual features extracted from video sequences
A survey on coding binary visual features extracted from video sequences
IRJET Journal
An Effective Attendance Management System using Face Recognition
An Effective Attendance Management System using Face Recognition
IRJET Journal
MUSIC RECOMMENDATION THROUGH FACE RECOGNITION AND EMOTION DETECTION
MUSIC RECOMMENDATION THROUGH FACE RECOGNITION AND EMOTION DETECTION
IRJET Journal
Quality assessment of stereoscopic 3 d image compression by binocular integra...
Quality assessment of stereoscopic 3 d image compression by binocular integra...
Shakas Technologies
IRJET- Heuristic Approach for Low Light Image Enhancement using Deep Learning
IRJET- Heuristic Approach for Low Light Image Enhancement using Deep Learning
IRJET Journal
Study of Various Edge Detection Techniques and Implementation of Real Time Fr...
Study of Various Edge Detection Techniques and Implementation of Real Time Fr...
IRJET Journal
C04841417
C04841417
IOSR-JEN
IRJET- Image Caption Generation System using Neural Network with Attention Me...
IRJET- Image Caption Generation System using Neural Network with Attention Me...
IRJET Journal
IRJET- Universal Currency Identifier
IRJET- Universal Currency Identifier
IRJET Journal
IRJET- SEPD Technique for Removal of Salt and Pepper Noise in Digital Images
IRJET- SEPD Technique for Removal of Salt and Pepper Noise in Digital Images
IRJET Journal
Ee 417 Senior Design
Ee 417 Senior Design
crouchj1
Realtime pothole detection system using improved CNN Models
Realtime pothole detection system using improved CNN Models
nithinsai2992
Thesis_Oral
Thesis_Oral
FanDong Wei
IRJET- A Review on Various Restoration Techniques in Digital Image Processing
IRJET- A Review on Various Restoration Techniques in Digital Image Processing
IRJET Journal
Human Action Recognition Based on Spacio-temporal features-Poster
Human Action Recognition Based on Spacio-temporal features-Poster
nikhilus85
[CARS2012@RecSys]Optimal Feature Selection for Context-Aware Recommendation u...
[CARS2012@RecSys]Optimal Feature Selection for Context-Aware Recommendation u...
YONG ZHENG
Quality of Multimedia Experience: Past, Present and Future
Quality of Multimedia Experience: Past, Present and Future
Touradj Ebrahimi

More Related Content

Similar to Mevd2012 tub esra_acar (20)

Target Detection and Classification Performance Enhancement using Super-Resol...
Target Detection and Classification Performance Enhancement using Super-Resol...
sipij
TARGET DETECTION AND CLASSIFICATION PERFORMANCE ENHANCEMENT USING SUPERRESOLU...
TARGET DETECTION AND CLASSIFICATION PERFORMANCE ENHANCEMENT USING SUPERRESOLU...
sipij
TARGET DETECTION AND CLASSIFICATION PERFORMANCE ENHANCEMENT USING SUPERRESOLU...
TARGET DETECTION AND CLASSIFICATION PERFORMANCE ENHANCEMENT USING SUPERRESOLU...
sipij
A survey on coding binary visual features extracted from video sequences
A survey on coding binary visual features extracted from video sequences
IRJET Journal
An Effective Attendance Management System using Face Recognition
An Effective Attendance Management System using Face Recognition
IRJET Journal
MUSIC RECOMMENDATION THROUGH FACE RECOGNITION AND EMOTION DETECTION
MUSIC RECOMMENDATION THROUGH FACE RECOGNITION AND EMOTION DETECTION
IRJET Journal
Quality assessment of stereoscopic 3 d image compression by binocular integra...
Quality assessment of stereoscopic 3 d image compression by binocular integra...
Shakas Technologies
IRJET- Heuristic Approach for Low Light Image Enhancement using Deep Learning
IRJET- Heuristic Approach for Low Light Image Enhancement using Deep Learning
IRJET Journal
Study of Various Edge Detection Techniques and Implementation of Real Time Fr...
Study of Various Edge Detection Techniques and Implementation of Real Time Fr...
IRJET Journal
C04841417
C04841417
IOSR-JEN
IRJET- Image Caption Generation System using Neural Network with Attention Me...
IRJET- Image Caption Generation System using Neural Network with Attention Me...
IRJET Journal
IRJET- Universal Currency Identifier
IRJET- Universal Currency Identifier
IRJET Journal
IRJET- SEPD Technique for Removal of Salt and Pepper Noise in Digital Images
IRJET- SEPD Technique for Removal of Salt and Pepper Noise in Digital Images
IRJET Journal
Ee 417 Senior Design
Ee 417 Senior Design
crouchj1
Realtime pothole detection system using improved CNN Models
Realtime pothole detection system using improved CNN Models
nithinsai2992
Thesis_Oral
Thesis_Oral
FanDong Wei
IRJET- A Review on Various Restoration Techniques in Digital Image Processing
IRJET- A Review on Various Restoration Techniques in Digital Image Processing
IRJET Journal
Human Action Recognition Based on Spacio-temporal features-Poster
Human Action Recognition Based on Spacio-temporal features-Poster
nikhilus85
[CARS2012@RecSys]Optimal Feature Selection for Context-Aware Recommendation u...
[CARS2012@RecSys]Optimal Feature Selection for Context-Aware Recommendation u...
YONG ZHENG
Quality of Multimedia Experience: Past, Present and Future
Quality of Multimedia Experience: Past, Present and Future
Touradj Ebrahimi
Target Detection and Classification Performance Enhancement using Super-Resol...
Target Detection and Classification Performance Enhancement using Super-Resol...
sipij
TARGET DETECTION AND CLASSIFICATION PERFORMANCE ENHANCEMENT USING SUPERRESOLU...
TARGET DETECTION AND CLASSIFICATION PERFORMANCE ENHANCEMENT USING SUPERRESOLU...
sipij
TARGET DETECTION AND CLASSIFICATION PERFORMANCE ENHANCEMENT USING SUPERRESOLU...
TARGET DETECTION AND CLASSIFICATION PERFORMANCE ENHANCEMENT USING SUPERRESOLU...
sipij
A survey on coding binary visual features extracted from video sequences
A survey on coding binary visual features extracted from video sequences
IRJET Journal
An Effective Attendance Management System using Face Recognition
An Effective Attendance Management System using Face Recognition
IRJET Journal
MUSIC RECOMMENDATION THROUGH FACE RECOGNITION AND EMOTION DETECTION
MUSIC RECOMMENDATION THROUGH FACE RECOGNITION AND EMOTION DETECTION
IRJET Journal
Quality assessment of stereoscopic 3 d image compression by binocular integra...
Quality assessment of stereoscopic 3 d image compression by binocular integra...
Shakas Technologies
IRJET- Heuristic Approach for Low Light Image Enhancement using Deep Learning
IRJET- Heuristic Approach for Low Light Image Enhancement using Deep Learning
IRJET Journal
Study of Various Edge Detection Techniques and Implementation of Real Time Fr...
Study of Various Edge Detection Techniques and Implementation of Real Time Fr...
IRJET Journal
C04841417
C04841417
IOSR-JEN
IRJET- Image Caption Generation System using Neural Network with Attention Me...
IRJET- Image Caption Generation System using Neural Network with Attention Me...
IRJET Journal
IRJET- Universal Currency Identifier
IRJET- Universal Currency Identifier
IRJET Journal
IRJET- SEPD Technique for Removal of Salt and Pepper Noise in Digital Images
IRJET- SEPD Technique for Removal of Salt and Pepper Noise in Digital Images
IRJET Journal
Ee 417 Senior Design
Ee 417 Senior Design
crouchj1
Realtime pothole detection system using improved CNN Models
Realtime pothole detection system using improved CNN Models
nithinsai2992
IRJET- A Review on Various Restoration Techniques in Digital Image Processing
IRJET- A Review on Various Restoration Techniques in Digital Image Processing
IRJET Journal
Human Action Recognition Based on Spacio-temporal features-Poster
Human Action Recognition Based on Spacio-temporal features-Poster
nikhilus85
[CARS2012@RecSys]Optimal Feature Selection for Context-Aware Recommendation u...
[CARS2012@RecSys]Optimal Feature Selection for Context-Aware Recommendation u...
YONG ZHENG
Quality of Multimedia Experience: Past, Present and Future
Quality of Multimedia Experience: Past, Present and Future
Touradj Ebrahimi

Mevd2012 tub esra_acar

  • 1. Detection of Violent Scenes using Affective Features Esra Acar Competence Center Information Retrieval and Machine Learning 4. October 2012
  • 2. Outline Motivation Background The Method Audio Features Visual Features Results & Discussion Conclusions & Future Work 4. October 2012 Detection of Violent Scenes using Affective Features 2
  • 3. Motivation The MediaEval 2012 Affect Task aims at detecting violent segments in movies. A recent work on horror scene recognition detects horror scenes by affect-related features. We investigate whether affect-related features provide good representation of violence, and making abstractions from low-level features is better than directly using low-level data. 4. October 2012 Detection of Violent Scenes using Affective Features 3
  • 4. Background The affective content of a video corresponds to the intensity (i.e. arousal), and the type (i.e. valence) of emotion expected to arise in the user while watching that video. Recent research efforts propose methods to map low-level features to high-level emotions. Film-makers intend to elicit some particular emotions (i.e. expected emotions) in the audience. When we refer to violence as an expected emotion in videos, affect-related features are applicable for violence detection. 4. October 2012 Detection of Violent Scenes using Affective Features 4
  • 5. The Method The method uses affect-related audio and visual features to represent violence. Low-level audio and visual features are extracted. Mid-level audio features are generated based on the low- level ones. The audio and visual features are then fused at the feature- level and a two-class SVM is trained. 4. October 2012 Detection of Violent Scenes using Affective Features 5
  • 6. Audio Features - 1 Affect-related audio features used in the work are: Audio energy related to the arousal aspect. high/low energy corresponds to high/low emotion intensity. used for vocal emotion detection. Mel-Frequency Cepstral Coefficients (MFCC) related to the arousal aspect. works well for the detection of excitement/non-excitement. Pitch related to the valence aspect. significant for emotion detection in speech and music. 4. October 2012 Detection of Violent Scenes using Affective Features 6
  • 7. Audio Features - 2 Each video shot has different numbers of audio energy, pitch and MFCC feature vectors (due to varying shot durations). Audio representations are obtained by computing mean and standard deviation for these audio features. Abstraction for MFCC: MFCC-based Bag of Audio Words (BoAW) approach is chosen to generate mid-level audio representations. Two different audio vocabularies are constructed: violence and non-violence vocabularies (by k-means clustering). MFCC of violent/non-violent movie segments are used to construct violence/non-violence words. Violence and non-violence word occurrences within a video shot are represented by a BoAW histogram. 4. October 2012 Detection of Violent Scenes using Affective Features 7
  • 8. Visual Features Average motion related to the arousal aspect. Motion vectors are computed using block-based motion estimation. Average motion is found as the average magnitude of all motion vectors. We compute average motion around the keyframe of video shots. 4. October 2012 Detection of Violent Scenes using Affective Features 8
  • 9. Results & Discussion - 1 The performance of our method was assessed on 3 Hollywood movies (evaluation criteria: MAP at 100). We submitted five runs: r1-low-level: low-level audio and visual features, Runs based on mid-level audio and low-level visual features r2-mid-level-100k: 100k samples for dictionary construction, r3-mid-level-300k: 300k samples for dictionary construction, r4-mid-level-300k-default: 300k samples for dictionary construction + SVM default parameters, and r5-mid-level-500k: 500k samples for dictionary construction. 4. October 2012 Detection of Violent Scenes using Affective Features 9
  • 10. Results & Discussion - 2 Table 1 Precision, Recall and F-measure at shot level Run AED-P AED-R AED-F r1-low-level 0.141 0.597 0.2287 r2-mid-level-100k 0.140 0.629 0.2285 r3-mid-level-300k 0.144 0.625 0.2337 r4-mid-level-300k-default 0.190 0.627 0.2971 r5-mid-level-500k 0.154 0.603 0.2457 Table 2 Mean Average Precision (MAP) values at 20 and 100 Run MAP at 20 MAP at 100 r1-low-level 0.2132 0.18502 r2-mid-level-100k 0.2037 0.14492 r3-mid-level-300k 0.3593 0.18538 r4-mid-level-300k-default 0.1547 0.15083 r5-mid-level-500k 0.15 0.11527 Slightly better performance is achieved with mid-level representations compared to the low-level one. Using affect-related features to describe violence needs some improvements (especially the visual part). 4. October 2012 Detection of Violent Scenes using Affective Features 10
  • 11. Conclusions & Future Work The aim of this work was to investigate whether affect- related features are well-suited to describe violence. Affect-related audio and visual features are merged in a supervised manner using SVM. Our main finding is that more sophisticated affect-related features are necessary to describe the content of videos (especially the visual part). Our next step in this work is to use mid-level features such as human facial features, and more sophisticated motion descriptors such as Lagrangian measures for video content representation. 4. October 2012 Detection of Violent Scenes using Affective Features 11
  • 12. Thank you! Questions? 4. October 2012 Detection of Violent Scenes using Affective Features 12