Learning Visual Clothing Style with Heterogeneous Dyads
‘What outfit goes well with this pair of shoes?’ To answer this type of questions, one has to go beyond learning visual similarity and learn a visual notion of compatibility across categories. In this paper, we propose a novel learning framework to help answer this type of questions.
Paper
2015 | |
Learning Visual Clothing Style with Heterogeneous Dyadic Co-occurrences International Conference on Computer Vision (ICCV), Santiago, Chile, 2015, (*Equal Contribution). |
Abstract
With the rapid proliferation of smart mobile devices, users now take millions of photos every day. These include large numbers of clothing and accessory images. We would like to answer questions like ‘What outfit goes well with this pair of shoes?’ To answer these types of questions, one has to go beyond learning visual similarity and learn a visual notion of compatibility across categories. In this paper, we propose a novel learning framework to help answer these types of questions. The main idea of this framework is to learn a feature transformation from images of items into a latent space that expresses compatibility. For the feature transformation, we use a Siamese Convolutional Neural Network (CNN) architecture, where training examples are pairs of items that are either compatible or incompatible. We model compatibility based on co-occurrence in largescale user behavior data; in particular co-purchase data from Amazon.com. To learn cross-category fit, we introduce a strategic method to sample training data, where pairs of items are heterogeneous dyads, i.e., the two elements of a pair belong to different high-level categories. While this approach is applicable to a wide variety of settings, we focus on the representative problem of learning compatible clothing style. Our results indicate that the proposed framework is capable of learning semantic information about visual style and is able to generate outfits of clothes, with items from different categories, that go well together.
60-second Video Spotlight
Supplementary Material
Model files and data
Model files for our Siamese CNNs:
[caffe model file]
[deploy prototxt]
[solver prototxt]
[train-val prototxt]
Dataset for our experiments:
Train / Test split:
We separate the data by nodes (products). The train, validation and test sets are disjunct lists of product ids (“asin” for example, ). We make sure to not consider any links from the dataset where the two products of the link belong to different sets. This ensures that during training the network has never seen the images of products in the validation or test set.
[test_ids], [train_ids], [val_ids]
Training, val and test pairs from our approach:
We train and validate on pairs of products. The train, validation and test sets are lists of product pairs and a label indicating whether they are a match. The products are identified by file name and the label is “0” for not matching and “1” for matching. The file names are as in the amazon dataset file. And example line looks like this:
data/clothingstyle/images/I/4/1/p/41pdLoCP%2BBL._SY445_.jpg data/clothingstyle/images/I/4/1/0/410PFJMmkCL._SY395_.jpg 0
[test_pairs], [train_pairs], [val_pairs]
Acknowledgements
We would like to thank Xiying Wang for her assistance that greatly improved the illustrations. We further thank Vlad Niculae, Michael Wilber and Sam Kwak for insightful feedback. This work is partly funded by AOL-Program for Connected Experiences and a Google Focused Research award.