We address the problem of understanding an indoor scene from a single image in terms of recovering the layouts of the faces (ﬂoor, ceiling, walls) and furniture. A major challenge of this task arises from the fact that most indoor scenes are cluttered by furniture and decorations, whose appearances vary drastically across scenes, and can hardly be modeled (or even hand-labeled) consistently. In this paper we tackle this problem by introducing latent variables to account for clutters, so that the observed image is jointly explained by the face and clutter layouts. Model parameters are learned in the maximum margin formulation, which is constrained by extra prior energy terms that deﬁne the role of the latent variables. Our approach enables taking into account and inferring indoor clutter layouts without hand-labeling of the clutters in the training set. Yet it outperforms the state-of-the-art method of Hedau et al.  that requires clutter labels.