This document proposes representing videos using local feature clusters to capture spatial and temporal information ignored by bag-of-features representations. It groups local features into clusters based on their proximity in space and time. Each cluster is represented independently using bag-of-features to localize actions. An experiment on classifying 7 actions from TRECVID videos found the optimal number of clusters varies by action class, with performance generally improving up to 8-16 clusters before declining with more clusters.