The document presents a method for detecting violent scenes in movies using affective audio and visual features. Low-level audio and visual features are extracted from movie segments and used to generate mid-level audio representations based on Bag of Audio Words. Audio and visual features are then fused and used to train an SVM classifier to detect violence. Experimental results on 3 movies showed that mid-level representations achieved slightly better performance than low-level features, but affect-related features need improvement, especially for visual representations. Future work will explore using mid-level features like facial features and more sophisticated motion descriptors.