The document describes a method for detecting violent scenes in videos using audio and visual features. It discusses representing video segments using sparse coding for audio and visual dictionaries. It also uses low-level motion and color descriptors. Violence is modeled by clustering video segments and learning different models for violence sub-concepts. Experimental results show that mid-level audio representations based on MFCC and sparse coding perform well, outperforming visual representations. Future work includes improving visual representations and further investigating feature space partitioning.