Recent Readings
Data augmentation
affine transformation & elastic transformation
under- & over- diversity, due to text shapes
sample-aware augmentor to transform adaptively
Data augmentor
gated module, affine transformation module, alastic transformation module
affine module: keeping affinity
elastic module: improve diversity
gated module: choose transform type
Adversarial learning to optimize
Gated module: predict the type of transformation
X = arg max ( α k + G k ) X=\argmax (\alpha_k +G_k) X = a r g m a x ( α k + G k )
X ^ = s o f t m a x ( ( α k + G k ) / τ ) \hat{X} = \mathrm{softmax}((\alpha_k+G_k)/\tau) X ^ = s o f t m a x ( ( α k + G k ) / τ ) ,α k \alpha_k α k classification score, G k G_k G k Gumbel distribution, τ → 0 \tau \to 0 τ → 0 to approach argmax
Affine Transformation Module
Localization Network to generate six parameters of affine transform A θ A_\theta A θ
Grid Generator: generate sampling grid using affine parameters
Generate differantiable sample: I i ′ = ∑ h H ∑ w W I h w max ( 0 , 1 − ∣ x i s − w ∣ ) max ( 0 , 1 − ∣ y i s − h ∣ ) I_i' = \sum_h^H\sum_w^W I_{hw} \max(0, 1 - |x_i^s-w|)\max(0, 1-|y_i^s-h|) I i ′ = ∑ h H ∑ w W I h w max ( 0 , 1 − ∣ x i s − w ∣ ) max ( 0 , 1 − ∣ y i s − h ∣ )
Elastic transformation module
Localization Network generates 2 K 2K 2 K vector of control points coordinates of the input sample
Grid Generator
Sampler
Adversarial control loss
Recognizer loss: L = − ∑ t = 1 T ( log p ( y t ∣ I ) ) L=-\sum_{t=1}^T (\log p(y_t|I)) L = − ∑ t = 1 T ( log p ( y t ∣ I ) )
Adversarial loss: L A = − L ( P ′ ) L_A=-L(P') L A = − L ( P ′ )
Adversarial control loss: l A C = ∣ 1 − exp ( L ( P ′ ) − α L ( P ) ) ∣ l_{AC} = |1 - \exp(L(P') -\alpha L(P))| l A C = ∣ 1 − exp ( L ( P ′ ) − α L ( P ) ) ∣ , α = max ( 1 , exp ( ∑ k = 1 K y ^ k ⋅ y k G ) ) \alpha=\max(1, \exp(\sum_{k=1}^K \hat{y}_k\cdot y_k^G)) α = max ( 1 , exp ( ∑ k = 1 K y ^ k ⋅ y k G ) )
Networks generalize because of training examples & class diversity
When data is scarce, can additional labels make learning better?
Improving performance using seemingly uninformative labels to complement expert annotations: multiclass problem
Collecting new examples, expert annotations, non-expert annotations
直觉:向学习目标添加信息提升性能,蒸馏、多任务学习
本文例:乳房X光图像,癌症需要专家标注,然而非专家对胸部组织如皮肤、肌肉等的标注提升了性能;数据越稀少,提升越大
贡献
实验证据:便宜的补充标签在低数据领域提升模型性能
在一个医疗任务与两个公开数据集上观察到此现象
见解:1. 数据越大效果越小 2. 补充标签对标注者偏差提供鲁棒 3. 不同标签的有效性 4. 有些标签不行 5. 标签增加性能变好 6. 低质量标签和高质量标签一样好 7. 补充标签提升训练稳定性 8. 补充标签对域迁移提供鲁棒性
发布 Csaw-S数据集
相关工作
训练数据不足:预训练;自然图像与医疗图像,不可行
下游任务多样性少的时候,预训练的作用更大
数据增广:训练中学习增广,使用GAN增广
k-shot 方法
Side information
seemingly uninformative complementary labels, used as additional learning targets, have a direct impact on the model’s generalization for image segmentation in low data regimes
For the task of locating tumors in mammograms, complementary labels might include the skin, pectoral muscle, and other parts of anatomy
Plausible explanations
complementary labels encourage learning of enriched representations: help to model background data by structuring into meaningful sub-classes
complementary labels help to model interactions between objects: KL benefits from interactions between labels
Sota SSL methods: image-based transformations & consistency regularization
图像空间的变换,没有利用数据集内的其他实例来进行多样的变换
This paper: learned feature-based refinement & augmentation
complex transformations
leverage information from other instances
Train students parameterized identically to their teachers
Born-Again Networks, BANs, outperforms teachers: sota on CIFAR-10 & CIFAR-100
Additional Experiments
Confidence-Weighted by Teacher Max
Dark Knowledge with Permuted Predictions
Leo Breiman:
different stochastic algorithms lead to diverse models with similar validation performances
model ensemble achieves superior
given ensemble, simpler model mimics ensemble & achieve its performance
Model compression & KD: transfer knowledge of high-capacity high-performance teacher to more compact student
This paper: teachers & students with identical capacity
students outperforming teachers
re-training procedure: when teacher converges, initialize new student, train with dual goal: correct labels & matching output of teacher, named BANs
gradient of KD: dark knowledge of knowledge on wrong outputs & groundtruth corresponds to original gradient from real label
importance weight based on teacher's confidence
Ralated work
Hinton, dark knowledge, mimics full softmax distribution of teacher model
base
dataset ( x , y ) ∈ X × Y (x,y)\in \mathcal{X}\times \mathcal{Y} ( x , y ) ∈ X × Y , f ( x , θ 1 ) : X → Y f(x,\theta_1): \mathcal{X}\to \mathcal{Y} f ( x , θ 1 ) : X → Y , θ 1 \theta_1 θ 1 in Θ 1 \Theta_1 Θ 1
ERM: θ 1 ∗ = arg min θ 1 L ( y , f ( x , θ 1 ) ) \theta_1^* = \argmin_{\theta_1} \mathcal{L}(y, f(x, \theta_1)) θ 1 ∗ = a r g m i n θ 1 L ( y , f ( x , θ 1 ) )
New loss: CEloss L ( f ( x , arg min θ 1 L ( y , f ( x , θ 1 ) ) ) , f ( x , θ 2 ) \mathcal{L}(f(x, \argmin_{\theta_1} \mathcal{L}(y, f(x,\theta_1))), f(x,\theta_2) L ( f ( x , a r g m i n θ 1 L ( y , f ( x , θ 1 ) ) ) , f ( x , θ 2 )
student & teacher identical arch; student & teacher similar capacity but different arch
Sequence of Teaching Selves Born-Again Networks Ensemble
Larsson, M., Stenborg, E., Toft, C., Hammarstrand, L., Sattler, T., & Kahl, F. (2019). Fine-grained segmentation networks: Self-supervised segmentation for improved long-term visual localization. Proceedings of the IEEE International Conference on Computer Vision, 2019-Octob, 31–41.
long-term visual localization: estimating camera pose of given query, appearance changes over time
semantic segmentation as invariant scene representation, semantic not be affected by seasonal and other changes
This paper:
Fine-Grained Segmentation Network, larger number of labels, trained in self-supervised fashion
output consistent labels across seasonal changes
k-means clustering on pixel-level CNN features to define k classes
Contribution
FGSN
more classes improve localization
Same structure as standard CNN, labels created in self-sup manner
2D-2D point correspondences, ensure the predictions stable under seasonal changes
extract features, k-means cluster
min C ∈ R d × m 1 N ∑ n = 1 N min y n ∈ { 0 , 1 } m ∣ ∣ d n − C y n ∣ ∣ 2 2 s . t . y n T 1 m = 1 \min_{C\in \mathbb{R}^{d\times m}}\frac{1}{N} \sum_{n=1}^N \min_{y_n\in \{0,1\}^m ||d_n -Cy_n||_2^2} \; \mathrm{s.t.}\; y_n^T1_m=1 min C ∈ R d × m N 1 ∑ n = 1 N min y n ∈ { 0 , 1 } m ∣ ∣ d n − C y n ∣ ∣ 2 2 s . t . y n T 1 m = 1
avoid trivial solution: reassignment of empty clusters
Loss
L = L c l a s s + L c o r r \mathcal{L} = \mathcal{L}_{class} + \mathcal{L}_{corr} L = L c l a s s + L c o r r
Semi-supervising learning, disagreement-based learning
co-traning, tri-training
training multiple learners, exploit disagreements during learning process
This paper
3 initial modules, predict a pool of ublabeled data; two label unlabeled data for another module; refine modules using new labeled examples
model initialization, diversity augmentation, pseudo-label editing
output smearing to generate initial modules
finetune modules to augment the diversity
data editing DES as stable pseudo-labels are more reliable
labeled data L = { ( x l , y l ) ∣ l = 1 , 2 , ⋯ , L } \mathcal{L} = \{(x_l, y_l)| l = 1, 2, \cdots, L\} L = { ( x l , y l ) ∣ l = 1 , 2 , ⋯ , L } , label one-hot
unlabeled data U = { ( x u ) ∣ u = 1 , 2 , ⋯ , U } \mathcal{U} = \{(x_u)| u = 1, 2, \cdots, U\} U = { ( x u ) ∣ u = 1 , 2 , ⋯ , U }
Initialization
Shared module M S M_S M S ; different modules M 1 M_1 M 1 . M 2 M_2 M 2 , M 3 M_3 M 3
M S M_S M S & M v M_v M v trained on L O S v \mathcal{L}_{OS}^v L O S v
Training
Inference
y = arg max c ∈ { 1 , 2 , ⋯ , C } { p ( M 1 ( M S ( x ) ) = c ∣ x ) + p ( M 2 ( M S ( x ) ) = c ∣ x ) + p ( M 3 ( M S ( x ) ) = c ∣ x ) } y = \argmax_{c\in \{1,2,\cdots,C\}} \{ p(M_1(M_S(x))=c|x) +p(M_2(M_S(x))=c|x) +p(M_3(M_S(x))=c|x) \} y = a r g m a x c ∈ { 1 , 2 , ⋯ , C } { p ( M 1 ( M S ( x ) ) = c ∣ x ) + p ( M 2 ( M S ( x ) ) = c ∣ x ) + p ( M 3 ( M S ( x ) ) = c ∣ x ) }
For { x l , y l } \{x_l, y_l\} { x l , y l } , y l y_l y l is one-hot, add noise into y l y_l y l
y ^ l c = y l c + R e L U ( z l c × s t d ) \hat{y}_{lc} = y_{lc} + \mathrm{ReLU}(z_{lc} \times std) y ^ l c = y l c + R e L U ( z l c × s t d ) ; z l c z_{lc} z l c sampled from normal dist, s t d std s t d standard deviation
y ^ l = ( y ^ l 1 , y ^ l 2 , ⋯ , y ^ l C ) / ∑ c = 1 C y ^ l c \hat{y}_l = (\hat{y}_{l1}, \hat{y}_{l2}, \cdots, \hat{y}_{lC})/ \sum_{c=1}^C \hat{y}_{lc} y ^ l = ( y ^ l 1 , y ^ l 2 , ⋯ , y ^ l C ) / ∑ c = 1 C y ^ l c
L o s v = { ( x l , y ^ l v ) ∣ 1 ≤ l ≤ L } ( v = 1 , 2 , 3 ) \mathcal{L}_{os}^v = \{ (x_l, \hat{y}_l^v)| 1\leq l \leq L \}(v=1,2,3) L o s v = { ( x l , y ^ l v ) ∣ 1 ≤ l ≤ L } ( v = 1 , 2 , 3 )
L o s s = 1 L { L y ( M 1 ( M S ( x l ) ) , y ^ l 1 ) + L y ( M 2 ( M S ( x l ) ) , y ^ l 2 ) + L y ( M 3 ( M S ( x l ) ) , y ^ l 3 ) } \mathrm{Loss} = \frac{1}{L}\{ L_y(M_1(M_S(x_l)), \hat{y}_l^1) + L_y(M_2(M_S(x_l)), \hat{y}_l^2) + L_y(M_3(M_S(x_l)), \hat{y}_l^3) \} L o s s = L 1 { L y ( M 1 ( M S ( x l ) ) , y ^ l 1 ) + L y ( M 2 ( M S ( x l ) ) , y ^ l 2 ) + L y ( M 3 ( M S ( x l ) ) , y ^ l 3 ) } ; L y L_y L y softmax cross-entropy
Three modules augment one another becomes similar
Finetune to maintain diversity
deal with the suspocious pseudo-labels
dropout at training & fixed at testing
when training, predict x i x_i x i for K K K times
if training predictions different from testing predictions & freq k ≥ K / 3 k \geq K/3 k ≥ K / 3 , delete it
most triplets have already anchor closer to positive than negative
triplet mining: finding useful triplets
optimizing with the hardest negative examples leads to bad local minima
triplet loss optimize, weights of CNN, map images to a feature vector
vectors normalized before computing the similarity, as simple dot-product; project points to hypersphere
Problem 1: gradient lost if not consider normalization
Problem 2: map hard negative examples closer to anchor
Contribution
triplet diagram to help triplet selection
understand optimization failures
modification to fix bad optimization
Personalized retrieval
Typical retrieval task: high-level semantics & visual attributes
Different people want different result
Dual Purpose Hashing, DPH
Intuition: category & attributes, description at different level, share some common low-level visual features
use CNN to learn unified binary codes
Contribution
framework simultaneously preserve category & arribute similarities
jointly preserving multiple types of similarities, model can make use of rich relationships between different tasks to suppress redundancies and learn more compact
new training scheme using partially labelled data to improve performance & alleviate overfitting
Compared with conference version
more loss functions
attribute-related tasks, sample imbanlance
more experiments to give insight: modify net arch, analysis of each bit, comparision with sota
implementation details
Multi-task learning
may improve one or more tasks
less computation & memory than using one for each task
large-scale image retrieval
new loss function to partially labelled data
Locality Sensitive Hashing
data-independent: need long codes
data-ddependent
unsupervised
Spectral Hashing, Iterative Quantization
semi-supervised
semantic similarity
use CNN to jointly learning image representation & hash functions
Use attributes as queries
Predict probability of attributes with SVM classifiers, joint probability of attributes to rank
attribute correlation, fusion strategy ...
learn cross-modal binary codes to aligh samples of different modalities
Goal: learn compact binary codes
same category have similar binary codes
similar attributes have similar binary codes
generalize well to new-coming images
Backbone
N conv-pool layers with fc layers
penultimate layer: binary-like output, fc layer with sigmoid activation
train: jointly trained on category-related task & attribute-related task
F : Ω → { 0 , 1 } k \mathcal{F}:\Omega \to \{0,1\}^k F : Ω → { 0 , 1 } k
S t r = { ( X i t r , y i , A i ) ∣ i = 1 , ⋯ , N } S_{tr} = \{(X_i^{tr}, y_i, A_i)|i=1,\cdots,N\} S t r = { ( X i t r , y i , A i ) ∣ i = 1 , ⋯ , N }
X i t r ∈ Ω , y i ∈ { 1 , ⋯ , C , C + 1 } X_i^{tr}\in \Omega, y_i\in\{1, \cdots, C, C+1\} X i t r ∈ Ω , y i ∈ { 1 , ⋯ , C , C + 1 } ,C + 1 C+1 C + 1 means missing category label
A i ∈ { 0 , 1 , 2 } 1 × p A_i \in \{0,1,2\}^{1\times p} A i ∈ { 0 , 1 , 2 } 1 × p visual attribute labels, A i j = 2 A_{ij}=2 A i j = 2 means missing
B i ∈ { 0 , 1 } k B_i \in \{0,1\}^k B i ∈ { 0 , 1 } k k k k -bit binary codes
B i r = σ ( W h a s h ϕ ( X i t r ) + b h a s h ) B_i^r = \sigma(W^{hash}\phi(X_i^{tr}) + b^{hash}) B i r = σ ( W h a s h ϕ ( X i t r ) + b h a s h ) binary-like, use in train for bp optim as differentiable
ϕ ( X i t r ) \phi(X_i^{tr}) ϕ ( X i t r ) extracted features, W h a s h ∈ R k × d , b h a s h ∈ R k W^{hash}\in \mathbb{R}^{k\times d}, b^{hash}\in\mathbb{R}^k W h a s h ∈ R k × d , b h a s h ∈ R k
Category Information Encoding
L i c l s = − ∑ c = 1 C I { y i = c } log exp ( W c c l s B i r + b c c l s ) ∑ l = 1 C exp ( W l c l s B i r + b l c l s ) \displaystyle{L_i^{cls} = -\sum_{c=1}^C \mathbb{I}\{y_i=c\}\log \frac{\exp(W_c^{cls}B_i^r + b_c^{cls}) }{\sum_{l=1}^C \exp(W_l^{cls}B_i^r + b_l^{cls})}} L i c l s = − c = 1 ∑ C I { y i = c } log ∑ l = 1 C exp ( W l c l s B i r + b l c l s ) exp ( W c c l s B i r + b c c l s )
L i m l = ∑ y k ≠ y i ∑ y j = y i max ( D H ( B i , B j ) + m − D H ( B i , B k ) , 0 ) \displaystyle{L_i^{ml} =\sum_{y_k\neq y_i}\sum_{y_j=y_i}\max(D_H(B_i,B_j) +m - D_H(B_i,B_k), 0)} L i m l = y k = y i ∑ y j = y i ∑ max ( D H ( B i , B j ) + m − D H ( B i , B k ) , 0 ) ; D H D_H D H Hamming distance
L i m l r = ∑ y k ≠ y i ∑ y j = y i max ( D E ( B i r , B j r ) + m − D E ( B i r , B k r ) , 0 ) \displaystyle{L_i^{mlr} =\sum_{y_k\neq y_i}\sum_{y_j=y_i}\max(D_E(B^r_i,B^r_j) +m - D_E(B^r_i,B^r_k), 0)} L i m l r = y k = y i ∑ y j = y i ∑ max ( D E ( B i r , B j r ) + m − D E ( B i r , B k r ) , 0 ) ; D E D_E D E Euclidean distance
Attribute Encoding
L i j w s c e = − I { A i j ≠ 2 } [ w j ( p ) A i j log ( p i j ) + w j ( n ) ( 1 − A i j ) log ( 1 − p i j ) ] \displaystyle{L_{ij}^{wsce} = -\mathbb{I}\{A_{ij}\neq 2\} [w_j^{(p)} A_{ij}\log(p_{ij}) + w_j^{(n)}(1-A_{ij}) \log(1-p_{ij}) ]} L i j w s c e = − I { A i j = 2 } [ w j ( p ) A i j log ( p i j ) + w j ( n ) ( 1 − A i j ) log ( 1 − p i j ) ]
p i j = σ ( W j a t t r B i r + b j a t t r ) p_{ij} = \sigma(W_j^{attr}B_i^r+b_j^{attr}) p i j = σ ( W j a t t r B i r + b j a t t r )
w j ( p ) = N j ( n ) N j ( n ) + N j ( p ) w_j^{(p)} = \frac{N_j^{(n)}}{N_j{(n)}+N_j^{(p)}} w j ( p ) = N j ( n ) + N j ( p ) N j ( n ) , w j ( n ) = N j ( p ) N j ( n ) + N j ( p ) w_j^{(n)} = \frac{N_j^{(p)}}{N_j{(n)}+N_j^{(p)}} w j ( n ) = N j ( n ) + N j ( p ) N j ( p )
L i j h i n g e = I { A i j ≠ 2 } max ( 1 − ( 2 A i j − 1 ) ( W j a t t r B i r + b j a t t r ) , 0 ) \displaystyle{L_{ij}^{hinge} = \mathbb{I}\{A_{ij}\neq 2\} \max(1 - (2A_{ij}-1)(W_j^{attr}B_i^r+b_j^{attr}), 0)} L i j h i n g e = I { A i j = 2 } max ( 1 − ( 2 A i j − 1 ) ( W j a t t r B i r + b j a t t r ) , 0 )
avoid predicting all samples as a single category
Overall loss: missing num different, so must be scaled
L = ∑ i = 1 n L i c l s ∑ t = 1 n I { y t ≤ C } + α ∑ j = 1 p ∑ i = 1 n L i j w s c e ∑ t = 1 n I { A t j ≠ 2 } \displaystyle{L = \frac{\sum_{i=1}^n L_i^{cls}}{\sum_{t=1}^n \mathbb{I} \{y_t\leq C\}} + \alpha\sum_{j=1}^p \sum_{i=1}^n \frac{L_{ij}^{wsce}}{\sum_{t=1}^n \mathbb{I}\{A_{tj} \neq 2\}}} L = ∑ t = 1 n I { y t ≤ C } ∑ i = 1 n L i c l s + α j = 1 ∑ p i = 1 ∑ n ∑ t = 1 n I { A t j = 2 } L i j w s c e
Retrieval
Category retrieval: hash code ranking
Attribute retrieval
A = σ ( W a t t r B + b a t t r ) A=\sigma(W^{attr}B+b^{attr}) A = σ ( W a t t r B + b a t t r ) and a q a^q a q
P ( A i j = a j q , ∀ j ∈ U s ) = ∏ j ∈ U s ( I { a j q > 0.5 } A i j + I { a j q ≤ 0.5 } ( 1 − A i j ) ) P(A_{ij}=a_j^q, \forall j \in U_s) =\prod_{j\in U_s} (\mathbb{I}\{a_j^q>0.5\}A_{ij} + \mathbb{I}\{a_j^q\leq 0.5\}(1-A_{ij})) P ( A i j = a j q , ∀ j ∈ U s ) = ∏ j ∈ U s ( I { a j q > 0 . 5 } A i j + I { a j q ≤ 0 . 5 } ( 1 − A i j ) ) , joint matching probability
Combined
Attribute retrieval, then category retrieval
Bit Functionality
Settings
Datasets: Train-category, train-both, train-attr, test
evaluation protocol
category: mAP
Attribute: average mAP over all valid attribute combinations
combined: recall@k
Module analysis
combination of loss
cls + wsce best, but other competitive
compatible with different loss
feasible to use new loss
Partially labelled data
B, B+A, B+C, B+A+C, B+0.5A, B+0.5C
Advanced backbone
Bit Functionality
Comparison with Sota
Search with hash
hash table lookup
near items into same bucket
improve recall: visit more buckets; construct several hash tables
hash code ranking
exhaustive search, compute all distances
learning a (compound) hash function, y = h ( x ) y = h(x) y = h ( x ) , mapping an input item x x x to a compact code y y y , aiming that the nearest neighbor search result for a query q q q is as close as possible to the true nearest search result and the search in the coding space is also efficient
problems
hash function
similarity in coding space
similarity in input space
loss function
optimization technique
Hash function
linear function
y = h ( x ) = s g n ( w T x + b ) y = h(x) = \mathrm{sgn}(w^Tx +b) y = h ( x ) = s g n ( w T x + b )
kernel function
y = h ( x ) = s g n ( ∑ t = 1 T ω t K ( s t , x ) + b ) y = h(x) = \mathrm{sgn}(\sum_{t=1}^T \omega_t K(s_t, x)+b) y = h ( x ) = s g n ( ∑ t = 1 T ω t K ( s t , x ) + b )
quantization-based
y = arg min k ∈ { 1 , ⋯ , K } ∥ x − c k ∥ 2 y = \argmin_{k\in\{1,\cdots,K\}} \lVert x-c_k\rVert_2 y = a r g m i n k ∈ { 1 , ⋯ , K } ∥ x − c k ∥ 2
Similarity
Input space
s i j o = g ( d i j o ) = exp ( − ( d i j o ) 2 2 σ 2 ) s_{ij}^o = g(d_{ij}^o)=\exp(-\frac{(d_{ij}^o)^2}{2\sigma^2}) s i j o = g ( d i j o ) = exp ( − 2 σ 2 ( d i j o ) 2 )
Cosine sim: x i T x j ∣ ∣ x i ∣ ∣ 2 ∣ ∣ x j ∣ ∣ 2 \frac{x_i^Tx_j}{||x_i||_2||x_j||_2} ∣ ∣ x i ∣ ∣ 2 ∣ ∣ x j ∣ ∣ 2 x i T x j
Semantic sim: 0 or 1
Hash space
hamming distance: d i j h = ∑ m = 1 M δ [ y i m ≠ y j m ] d_{ij}^h = \sum_{m=1}^M \delta[y_{im}\neq y_{jm}] d i j h = ∑ m = 1 M δ [ y i m = y j m ]
s i j h = M − d i j h s_{ij}^h = M-d_{ij}^h s i j h = M − d i j h
weighted case
Euclidean distance for quantization
Loss function
Rule: preserve similarity order
pairwise
multiwise
Optimization
s g n sgn s g n function lead to mixed-binary-integer optimization problem
continuous relaxation: sigmoid, tanh, ...
too many data points
First work to fine-grained hashing problem
large intra-class variances & small inter-class variances
Modules
representation learning
global features
part-level feature by attention, or local features
part-level cues are crucial for fine-grained tasks
spatial and channel constraints
local feature alignment
anchor based feature alignment
anchored features by averaging all the local features across images
feature exchanging operation
hash code learning
alternating hashing learning & ahchor features updating
Contribution
large-scale fine-grained image retrieval, hash
ene-to-end trainable network, ExchNet, attention constraints, local feature alignment & anchor-based learning
extensive experiments
Representation Learning
E i ∈ R H × W × C E_i\in \mathbb{R}^{H\times W\times C} E i ∈ R H × W × C all deep feature
A i ∈ R M × H × W A_i\in \mathbb{R}^{M\times H\times W} A i ∈ R M × H × W attention pieces, A i j ∈ R H × W A_i^j \in \mathbb{R}^{H\times W} A i j ∈ R H × W is j j j -th parts for x i x_i x i
E ^ i j = E i ⊗ A i j ∈ R H × W × C \hat{E}_i^j = E_i \otimes A_i^j \in \mathbb{R}^{H\times W\times C} E ^ i j = E i ⊗ A i j ∈ R H × W × C ; E ^ i = { E ^ i 1 , ⋯ , E ^ i M } \hat{\mathcal{E}}_i = \{\hat{E}_i^1, \cdots, \hat{E}_i^M\} E ^ i = { E ^ i 1 , ⋯ , E ^ i M }
Local Features Refinement
F i = f L F R ( E ^ i ) = { F i 1 , ⋯ , F i M ∈ R H ′ × W ′ × C ′ } \mathcal{F}_i = f_{\mathrm{LFR}}(\hat{\mathcal{E}}_i) = \{F_i^1,\cdots, F_i^M\in \mathbb{R}^{H'\times W'\times C'}\} F i = f L F R ( E ^ i ) = { F i 1 , ⋯ , F i M ∈ R H ′ × W ′ × C ′ }
f i j = f G A P ( F i j ) ∈ R C ′ f_i^j = f_{GAP}(F_i^j)\in\mathbb{R}^{C'} f i j = f G A P ( F i j ) ∈ R C ′
F i g l o b a l = f G F R ( E i ) ∈ R H ′ × W ′ × C ′ F_i^{global} = f_{GFR}(E_i)\in \mathbb{R}^{H'\times W'\times C'} F i g l o b a l = f G F R ( E i ) ∈ R H ′ × W ′ × C ′ , f i g l o b a l = f G A P ( F i g l o b a l ) ∈ R C ′ f_i^{global}=f_{GAP}(F_i^{global})\in\mathbb{R}^{C'} f i g l o b a l = f G A P ( F i g l o b a l ) ∈ R C ′
Spatial diversity
Helliger distance
H ( p , q ) = 1 2 ∑ i = 1 k ( p i − q i ) 2 H(p,q) = \frac{1}{\sqrt{2}}\sqrt{\sum_{i=1}^k (\sqrt{p_i} - \sqrt{q_i})^2} H ( p , q ) = 2 1 ∑ i = 1 k ( p i − q i ) 2
becomes 1 when where p = 0 p=0 p = 0 , q ≠ 0 q\neq 0 q = 0
F i j → A ^ i j ∈ R H ′ × W ′ F_i^j \to \hat{A}_i^j\in \mathbb{R}^{H'\times W'} F i j → A ^ i j ∈ R H ′ × W ′ adding all C ′ C' C ′ features trought channel dimension
then softmax to make it distribution, flat to a vector a ^ i j \hat{a}_i^j a ^ i j
L s p ( x i ) = 1 − 1 2 ( M 2 ) ∑ l , k = 1 M ∥ a ^ i l − a ^ i k ∥ 2 \mathcal{L}_{sp}(x_i) = 1-\frac{1}{\sqrt{2}\binom{M}{2}}\sum_{l,k=1}^M \Big\lVert\sqrt{\hat{a}_i^l} - \sqrt{\hat{a}_i^k}\Big\rVert_2 L s p ( x i ) = 1 − 2 ( 2 M ) 1 l , k = 1 ∑ M ∥ ∥ ∥ ∥ a ^ i l − a ^ i k ∥ ∥ ∥ ∥ 2
this loss make the aggregation maps be activated as diverse as possible
Hellinger distance, this loss becomes 0 when a ^ i l \hat{a}_i^l a ^ i l and a ^ i k \hat{a}_i^k a ^ i k as different as possible
channel diversity
p i j = s o f t m a x ( f i j ) , ∀ j ∈ { 1 , ⋯ , M } p_i^j = \mathrm{softmax}(f_i^j),\; \forall j\in\{1,\cdots, M\} p i j = s o f t m a x ( f i j ) , ∀ j ∈ { 1 , ⋯ , M }
L c p ( x i ) = [ t − 1 2 ( M 2 ) ∑ l , k = 1 M ∥ p i l − p i k ∥ 2 ] + \mathcal{L}_{cp}(x_i) = \Big[t - \frac{1}{\sqrt{2}\binom{M}{2}}\sum_{l,k=1}^M \big\lVert\sqrt{p_i^l} - \sqrt{p_i^k}\big\rVert_2\Big]_+ L c p ( x i ) = [ t − 2 ( 2 M ) 1 l , k = 1 ∑ M ∥ ∥ ∥ p i l − p i k ∥ ∥ ∥ 2 ] +
t ∈ [ 0 , 1 ] t\in [0,1] t ∈ [ 0 , 1 ] as a hyper-parameter, [ ⋅ ] + [\cdot]_+ [ ⋅ ] + means max ( ⋅ , 0 ) \max(\cdot,0) max ( ⋅ , 0 )
Get discriminative local features
Learning to align
anchor-based local features alignment approach
if local features were well aligned, exchanging the features of identical parts for two input images belonging to the same sub-category should not change the generated hash codes
maintain dynamic anchored local features C y i = { c y i 1 , ⋯ , c y i M } \mathcal{C}_{y_i} = \{ c_{y_i}^1,\cdots,c_{y_i}^M \} C y i = { c y i 1 , ⋯ , c y i M }
c y i j c_{y_i}^j c y i j get by averaging all j j j -th part's local features for class y i y_i y i
updated at end of training epoch
Exchange half of learned local features of x i x_i x i in G i = { f i 1 , ⋯ , f i M } \mathcal{G}_i=\{f_i^1, \cdots, f_i^M\} G i = { f i 1 , ⋯ , f i M } with anchored local features
∀ j ∈ { 1 , ⋯ , M } , f ^ i j = { f i j , i f ξ j ≥ 0.5 , c y i j , o t h e r w i s e \forall j \in \{1,\cdots,M\}, \hat{f}_i^j = \begin{cases} f_i^j, &\mathrm{if}\; \xi_j \geq 0.5, \\ c_{y_i}^j, &\mathrm{otherwise} \end{cases} ∀ j ∈ { 1 , ⋯ , M } , f ^ i j = { f i j , c y i j , i f ξ j ≥ 0 . 5 , o t h e r w i s e
ξ j ∼ B ( 0.5 ) \xi_j\sim \mathcal{B}(0.5) ξ j ∼ B ( 0 . 5 )
Hash Code
fc layer & s i g n ( ⋅ ) \mathrm{sign}(\cdot) s i g n ( ⋅ ) activation, asymmetric hashing
u i = g ( [ G ^ i ; f i g l o b a l ] c a t ) = s i g n ( W ( g ) [ G ^ i ; f i g l o b a l ] c a t ) u_i = g([\hat{G}_i;f_i^{global}]_{cat}) = \mathrm{sign} (W^{(g)}[\hat{G}_i;f_i^{global}]_{cat}) u i = g ( [ G ^ i ; f i g l o b a l ] c a t ) = s i g n ( W ( g ) [ G ^ i ; f i g l o b a l ] c a t )
v i = h ( [ G ^ i ; f i g l o b a l ] c a t ) = s i g n ( W ( h ) [ G ^ i ; f i g l o b a l ] c a t ) v_i = h([\hat{G}_i;f_i^{global}]_{cat}) = \mathrm{sign} (W^{(h)}[\hat{G}_i;f_i^{global}]_{cat}) v i = h ( [ G ^ i ; f i g l o b a l ] c a t ) = s i g n ( W ( h ) [ G ^ i ; f i g l o b a l ] c a t )
U = { u i } i = 1 n V = { v i } i = 1 n U=\{u_i\}_{i=1}^n\;V=\{v_i\}_{i=1}^n U = { u i } i = 1 n V = { v i } i = 1 n
L s q ( u i , v j , C ) = ( u i T v j − q S i j ) 2 \mathcal{L}_{sq}(u_i,v_j,\mathcal{C}) = (u_i^Tv_j - qS_{ij})^2 L s q ( u i , v j , C ) = ( u i T v j − q S i j ) 2 , intractable
relax g ( ⋅ ) = s i g n ( ⋅ ) g(\cdot)=\mathrm{sign}(\cdot) g ( ⋅ ) = s i g n ( ⋅ ) to g ~ ( ⋅ ) = t a n h ( ⋅ ) \tilde{g}(\cdot)=\mathrm{tanh}(\cdot) g ~ ( ⋅ ) = t a n h ( ⋅ )
L ~ s q ( u ~ i , v j , C ) = ( u ~ i T v j − q S i j ) 2 \tilde{\mathcal{L}}_{sq}(\tilde{u}_i,v_j,\mathcal{C}) = (\tilde{u}_i^Tv_j - qS_{ij})^2 L ~ s q ( u ~ i , v j , C ) = ( u ~ i T v j − q S i j ) 2
min V , Θ , C L ( X ) = ∑ i , j = 1 n L ~ s q ( u ~ i , v j ; S i j ) + λ ∑ i = 1 n L s p ( x i ) + γ ∑ i = 1 n L c p ( x i ) s . t . ∀ i ∈ { 1 , ⋯ , n } \begin{aligned} &\min_{V,\Theta, \mathcal{C}} \mathcal{L}(\mathcal{X}) = \sum_{i,j=1}^n \tilde{\mathcal{L}}_{sq}(\tilde{u}_i,v_j;S_{ij}) + \lambda \sum_{i=1}^n \mathcal{L}_{sp}(x_i) +\gamma \sum_{i=1}^n \mathcal{L}_{cp}(x_i) \\ &\mathrm{s.t.}\;\forall i\in\{1,\cdots,n\} \end{aligned} V , Θ , C min L ( X ) = i , j = 1 ∑ n L ~ s q ( u ~ i , v j ; S i j ) + λ i = 1 ∑ n L s p ( x i ) + γ i = 1 ∑ n L c p ( x i ) s . t . ∀ i ∈ { 1 , ⋯ , n }
Learning algorithm