Abstract. We address the problem of articulated human pose estimation by learning a coarse-to-ﬁne cascade of pictorial structure models. While the ﬁne-level state-space of poses of individual parts is too large to permit the use of rich appearance models, most possibilities can be ruled out by eﬃcient structured models at a coarser scale. We propose to learn a sequence of structured models at diﬀerent pose resolutions, where coarse models ﬁlter the pose space for the next level via their max-marginals. The cascade is trained to prune as much as possible while preserving true poses for the ﬁnal level pictorial structure model. The ﬁnal level uses much more expensive segmentation, contour and shape features in the model for the remaining ﬁltered set of candidates. We evaluate our framework on the challenging Buﬀy and PASCAL human pose datasets, improving the state-of-the-art.