Recent Readings

2020.07.27

ludics

1. Sample-aware Data Augmentor for Scene Text Recognition

Data augmentation
- affine transformation & elastic transformation
- under- & over- diversity, due to text shapes
- sample-aware augmentor to transform adaptively
Data augmentor
- gated module, affine transformation module, alastic transformation module
  - affine module: keeping affinity
  - elastic module: improve diversity
  - gated module: choose transform type
- Adversarial learning to optimize
Gated module: predict the type of transformation
- $X=\argmax (\alpha_k +G_k)$
- $\hat{X} = \mathrm{softmax}((\alpha_k+G_k)/\tau)$ ， $\alpha_k$ classification score, $G_k$ Gumbel distribution, $\tau \to 0$ to approach argmax
Affine Transformation Module
- Localization Network to generate six parameters of affine transform $A_\theta$
- Grid Generator: generate sampling grid using affine parameters
- Generate differantiable sample: $I_i' = \sum_h^H\sum_w^W I_{hw} \max(0, 1 - |x_i^s-w|)\max(0, 1-|y_i^s-h|)$
Elastic transformation module
- Localization Network generates $2K$ vector of control points coordinates of the input sample
- Grid Generator
- Sampler
Adversarial control loss
- Recognizer loss: $L=-\sum_{t=1}^T (\log p(y_t|I))$
- Adversarial loss: $L_A=-L(P')$
- Adversarial control loss: $l_{AC} = |1 - \exp(L(P') -\alpha L(P))|$ , $\alpha=\max(1, \exp(\sum_{k=1}^K \hat{y}_k\cdot y_k^G))$

2. Adding Seemingly Uninformative Labels Helps in Low Data Regimes

Networks generalize because of training examples & class diversity
When data is scarce, can additional labels make learning better?
Improving performance using seemingly uninformative labels to complement expert annotations: multiclass problem
Collecting new examples, expert annotations, non-expert annotations
直觉：向学习目标添加信息提升性能，蒸馏、多任务学习
本文例：乳房X光图像，癌症需要专家标注，然而非专家对胸部组织如皮肤、肌肉等的标注提升了性能；数据越稀少，提升越大
贡献
- 实验证据：便宜的补充标签在低数据领域提升模型性能
- 在一个医疗任务与两个公开数据集上观察到此现象
- 见解：1. 数据越大效果越小 2. 补充标签对标注者偏差提供鲁棒 3. 不同标签的有效性 4. 有些标签不行 5. 标签增加性能变好 6. 低质量标签和高质量标签一样好 7. 补充标签提升训练稳定性 8. 补充标签对域迁移提供鲁棒性
- 发布 Csaw-S数据集
相关工作
- 训练数据不足：预训练；自然图像与医疗图像，不可行
- 下游任务多样性少的时候，预训练的作用更大
- 数据增广：训练中学习增广，使用GAN增广
- k-shot 方法
- Side information

Complementary Labels

seemingly uninformative complementary labels, used as additional learning targets, have a direct impact on the model’s generalization for image segmentation in low data regimes
For the task of locating tumors in mammograms, complementary labels might include the skin, pectoral muscle, and other parts of anatomy
Plausible explanations
- complementary labels encourage learning of enriched representations: help to model background data by structuring into meaningful sub-classes
- complementary labels help to model interactions between objects: KL benefits from interactions between labels

3. Location Sensitive Image Retrieval and Tagging

4. FeatMatch: Feature-Based Augmentation for Semi-Supervised Learning

Sota SSL methods: image-based transformations & consistency regularization
- 图像空间的变换，没有利用数据集内的其他实例来进行多样的变换
This paper: learned feature-based refinement & augmentation
- complex transformations
- leverage information from other instances

5. Born-Again Neural Networks

Train students parameterized identically to their teachers
Born-Again Networks, BANs, outperforms teachers: sota on CIFAR-10 & CIFAR-100
Additional Experiments
- Confidence-Weighted by Teacher Max
- Dark Knowledge with Permuted Predictions
Leo Breiman:
- different stochastic algorithms lead to diverse models with similar validation performances
- model ensemble achieves superior
- given ensemble, simpler model mimics ensemble & achieve its performance
Model compression & KD: transfer knowledge of high-capacity high-performance teacher to more compact student
This paper: teachers & students with identical capacity
- students outperforming teachers
- re-training procedure: when teacher converges, initialize new student, train with dual goal: correct labels & matching output of teacher, named BANs
- gradient of KD: dark knowledge of knowledge on wrong outputs & groundtruth corresponds to original gradient from real label
  - importance weight based on teacher's confidence
Ralated work
- Hinton, dark knowledge, mimics full softmax distribution of teacher model

Born-Again Networks

base
- dataset $(x,y)\in \mathcal{X}\times \mathcal{Y}$ , $f(x,\theta_1): \mathcal{X}\to \mathcal{Y}$ , $\theta_1$ in $\Theta_1$
- ERM: $\theta_1^* = \argmin_{\theta_1} \mathcal{L}(y, f(x, \theta_1))$
- New loss: CEloss $\mathcal{L}(f(x, \argmin_{\theta_1} \mathcal{L}(y, f(x,\theta_1))), f(x,\theta_2)$
- student & teacher identical arch; student & teacher similar capacity but different arch
Sequence of Teaching Selves Born-Again Networks Ensemble

6. Fine-Grained Segmentation Networks: Self-Supervised Segmentation for Improved Long-Term Visual Localization

Larsson, M., Stenborg, E., Toft, C., Hammarstrand, L., Sattler, T., & Kahl, F. (2019). Fine-grained segmentation networks: Self-supervised segmentation for improved long-term visual localization. Proceedings of the IEEE International Conference on Computer Vision, 2019-Octob, 31–41.
long-term visual localization: estimating camera pose of given query, appearance changes over time
semantic segmentation as invariant scene representation, semantic not be affected by seasonal and other changes
- not very discriminative
This paper:
- Fine-Grained Segmentation Network, larger number of labels, trained in self-supervised fashion
- output consistent labels across seasonal changes
- k-means clustering on pixel-level CNN features to define k classes
- Contribution
  - FGSN
  - more classes improve localization

Fine-Grained Segmentation Networks

Same structure as standard CNN, labels created in self-sup manner
2D-2D point correspondences, ensure the predictions stable under seasonal changes
extract features, k-means cluster
- $\min_{C\in \mathbb{R}^{d\times m}}\frac{1}{N} \sum_{n=1}^N \min_{y_n\in \{0,1\}^m ||d_n -Cy_n||_2^2} \; \mathrm{s.t.}\; y_n^T1_m=1$
- avoid trivial solution: reassignment of empty clusters
Loss
- $\mathcal{L} = \mathcal{L}_{class} + \mathcal{L}_{corr}$

7. TextPlace: Visual Place Recognition and Topological Localization Through Reading Scene Texts

8. Tri-net for Semi-Supervised Deep Learning

Semi-supervising learning, disagreement-based learning
- co-traning, tri-training
- training multiple learners, exploit disagreements during learning process
This paper
- 3 initial modules, predict a pool of ublabeled data; two label unlabeled data for another module; refine modules using new labeled examples
- model initialization, diversity augmentation, pseudo-label editing
- output smearing to generate initial modules
- finetune modules to augment the diversity
- data editing DES as stable pseudo-labels are more reliable

Approach

labeled data $\mathcal{L} = \{(x_l, y_l)| l = 1, 2, \cdots, L\}$ , label one-hot
unlabeled data $\mathcal{U} = \{(x_u)| u = 1, 2, \cdots, U\}$
Initialization
- Shared module $M_S$ ; different modules $M_1$ . $M_2$ , $M_3$
- $M_S$ & $M_v$ trained on $\mathcal{L}_{OS}^v$
Training
Inference
- $y = \argmax_{c\in \{1,2,\cdots,C\}} \{ p(M_1(M_S(x))=c|x) +p(M_2(M_S(x))=c|x) +p(M_3(M_S(x))=c|x) \}$

Output Smearing

For $\{x_l, y_l\}$ , $y_l$ is one-hot, add noise into $y_l$
- $\hat{y}_{lc} = y_{lc} + \mathrm{ReLU}(z_{lc} \times std)$ ; $z_{lc}$ sampled from normal dist, $std$ standard deviation
- $\hat{y}_l = (\hat{y}_{l1}, \hat{y}_{l2}, \cdots, \hat{y}_{lC})/ \sum_{c=1}^C \hat{y}_{lc}$
- $\mathcal{L}_{os}^v = \{ (x_l, \hat{y}_l^v)| 1\leq l \leq L \}(v=1,2,3)$
- $\mathrm{Loss} = \frac{1}{L}\{ L_y(M_1(M_S(x_l)), \hat{y}_l^1) + L_y(M_2(M_S(x_l)), \hat{y}_l^2) + L_y(M_3(M_S(x_l)), \hat{y}_l^3) \}$ ; $L_y$ softmax cross-entropy

Diversity Augmentation

Three modules augment one another becomes similar
Finetune to maintain diversity

Pseudo-Label Editing

deal with the suspocious pseudo-labels
dropout at training & fixed at testing
when training, predict $x_i$ for $K$ times
if training predictions different from testing predictions & freq $k \geq K/3$ , delete it

8. Hard negative examples are hard, but useful

most triplets have already anchor closer to positive than negative
triplet mining: finding useful triplets
optimizing with the hardest negative examples leads to bad local minima
triplet loss optimize, weights of CNN, map images to a feature vector
- vectors normalized before computing the similarity, as simple dot-product; project points to hypersphere
- Problem 1: gradient lost if not consider normalization
- Problem 2: map hard negative examples closer to anchor
Contribution
- triplet diagram to help triplet selection
- understand optimization failures
- modification to fix bad optimization

Learning Multifunctional Binary Codes for Personalized Image Retrieval

Personalized retrieval
Typical retrieval task: high-level semantics & visual attributes
Different people want different result
Dual Purpose Hashing, DPH
- category
- attribute
Intuition: category & attributes, description at different level, share some common low-level visual features
- use CNN to learn unified binary codes
Contribution
- framework simultaneously preserve category & arribute similarities
- jointly preserving multiple types of similarities, model can make use of rich relationships between different tasks to suppress redundancies and learn more compact
- new training scheme using partially labelled data to improve performance & alleviate overfitting
Compared with conference version
- more loss functions
- attribute-related tasks, sample imbanlance
- more experiments to give insight: modify net arch, analysis of each bit, comparision with sota
- implementation details

Multi-task learning
- may improve one or more tasks
- less computation & memory than using one for each task
- large-scale image retrieval
- new loss function to partially labelled data
Locality Sensitive Hashing
- data-independent: need long codes
- data-ddependent
- unsupervised
  - Spectral Hashing, Iterative Quantization
- semi-supervised
  - semantic similarity
  - use CNN to jointly learning image representation & hash functions
Use attributes as queries
- Predict probability of attributes with SVM classifiers, joint probability of attributes to rank
- attribute correlation, fusion strategy ...
- learn cross-modal binary codes to aligh samples of different modalities

Approach DPH

Goal: learn compact binary codes
- same category have similar binary codes
- similar attributes have similar binary codes
- generalize well to new-coming images
Backbone
- N conv-pool layers with fc layers
- penultimate layer: binary-like output, fc layer with sigmoid activation
- train: jointly trained on category-related task & attribute-related task
$\mathcal{F}:\Omega \to \{0,1\}^k$
- $S_{tr} = \{(X_i^{tr}, y_i, A_i)|i=1,\cdots,N\}$
- $X_i^{tr}\in \Omega, y_i\in\{1, \cdots, C, C+1\}$ ， $C+1$ means missing category label
- $A_i \in \{0,1,2\}^{1\times p}$ visual attribute labels, $A_{ij}=2$ means missing
- $B_i \in \{0,1\}^k$ $k$ -bit binary codes
- $B_i^r = \sigma(W^{hash}\phi(X_i^{tr}) + b^{hash})$ binary-like, use in train for bp optim as differentiable
- $\phi(X_i^{tr})$ extracted features, $W^{hash}\in \mathbb{R}^{k\times d}, b^{hash}\in\mathbb{R}^k$
Category Information Encoding
- $\displaystyle{L_i^{cls} = -\sum_{c=1}^C \mathbb{I}\{y_i=c\}\log \frac{\exp(W_c^{cls}B_i^r + b_c^{cls}) }{\sum_{l=1}^C \exp(W_l^{cls}B_i^r + b_l^{cls})}}$
- $\displaystyle{L_i^{ml} =\sum_{y_k\neq y_i}\sum_{y_j=y_i}\max(D_H(B_i,B_j) +m - D_H(B_i,B_k), 0)}$ ; $D_H$ Hamming distance
- $\displaystyle{L_i^{mlr} =\sum_{y_k\neq y_i}\sum_{y_j=y_i}\max(D_E(B^r_i,B^r_j) +m - D_E(B^r_i,B^r_k), 0)}$ ; $D_E$ Euclidean distance
Attribute Encoding
- $\displaystyle{L_{ij}^{wsce} = -\mathbb{I}\{A_{ij}\neq 2\} [w_j^{(p)} A_{ij}\log(p_{ij}) + w_j^{(n)}(1-A_{ij}) \log(1-p_{ij}) ]}$
  - $p_{ij} = \sigma(W_j^{attr}B_i^r+b_j^{attr})$
  - $w_j^{(p)} = \frac{N_j^{(n)}}{N_j{(n)}+N_j^{(p)}}$ , $w_j^{(n)} = \frac{N_j^{(p)}}{N_j{(n)}+N_j^{(p)}}$
- $\displaystyle{L_{ij}^{hinge} = \mathbb{I}\{A_{ij}\neq 2\} \max(1 - (2A_{ij}-1)(W_j^{attr}B_i^r+b_j^{attr}), 0)}$
  - avoid predicting all samples as a single category
Overall loss: missing num different, so must be scaled
- $\displaystyle{L = \frac{\sum_{i=1}^n L_i^{cls}}{\sum_{t=1}^n \mathbb{I} \{y_t\leq C\}} + \alpha\sum_{j=1}^p \sum_{i=1}^n \frac{L_{ij}^{wsce}}{\sum_{t=1}^n \mathbb{I}\{A_{tj} \neq 2\}}}$
Retrieval
- Category retrieval: hash code ranking
- Attribute retrieval
  - $A=\sigma(W^{attr}B+b^{attr})$ and $a^q$
  - $P(A_{ij}=a_j^q, \forall j \in U_s) =\prod_{j\in U_s} (\mathbb{I}\{a_j^q>0.5\}A_{ij} + \mathbb{I}\{a_j^q\leq 0.5\}(1-A_{ij}))$ , joint matching probability
- Combined
  - Attribute retrieval, then category retrieval
Bit Functionality

DPH Experiments

Settings
- Datasets: Train-category, train-both, train-attr, test
- evaluation protocol
  - category: mAP
  - Attribute: average mAP over all valid attribute combinations
  - combined: recall@k
Module analysis
- combination of loss
  - cls + wsce best, but other competitive
  - compatible with different loss
  - feasible to use new loss
- Partially labelled data
  - B, B+A, B+C, B+A+C, B+0.5A, B+0.5C
- Advanced backbone
  - ResNet-18
- Bit Functionality
Comparison with Sota

A Survey on Learning to Hash

Search with hash
- hash table lookup
  - near items into same bucket
  - improve recall: visit more buckets; construct several hash tables
- hash code ranking
  - exhaustive search, compute all distances

LEARNING TO HASH

learning a (compound) hash function, $y = h(x)$ , mapping an input item $x$ to a compact code $y$ , aiming that the nearest neighbor search result for a query $q$ is as close as possible to the true nearest search result and the search in the coding space is also efficient
problems
- hash function
- similarity in coding space
- similarity in input space
- loss function
- optimization technique
Hash function
- linear function
  - $y = h(x) = \mathrm{sgn}(w^Tx +b)$
- kernel function
  - $y = h(x) = \mathrm{sgn}(\sum_{t=1}^T \omega_t K(s_t, x)+b)$
- quantization-based
  - $y = \argmin_{k\in\{1,\cdots,K\}} \lVert x-c_k\rVert_2$
Similarity
- Input space
  - $s_{ij}^o = g(d_{ij}^o)=\exp(-\frac{(d_{ij}^o)^2}{2\sigma^2})$
  - Cosine sim: $\frac{x_i^Tx_j}{||x_i||_2||x_j||_2}$
  - Semantic sim: 0 or 1
- Hash space
  - hamming distance: $d_{ij}^h = \sum_{m=1}^M \delta[y_{im}\neq y_{jm}]$
  - $s_{ij}^h = M-d_{ij}^h$
  - weighted case
  - Euclidean distance for quantization
Loss function
- Rule: preserve similarity order
- pairwise
- multiwise
Optimization
- $sgn$ function lead to mixed-binary-integer optimization problem
  - continuous relaxation: sigmoid, tanh, ...
- too many data points
  - sample

ExchNet: A Unified Hashing Network for Large-Scale Fine-Grained Image Retrieval

First work to fine-grained hashing problem
- large intra-class variances & small inter-class variances
Modules
- representation learning
  - global features
  - part-level feature by attention, or local features
  - part-level cues are crucial for fine-grained tasks
  - spatial and channel constraints
- local feature alignment
  - anchor based feature alignment
  - anchored features by averaging all the local features across images
  - feature exchanging operation
- hash code learning
  - alternating hashing learning & ahchor features updating
Contribution
- large-scale fine-grained image retrieval, hash
- ene-to-end trainable network, ExchNet, attention constraints, local feature alignment & anchor-based learning
- extensive experiments

ExchNet Methodology

Representation Learning
- $E_i\in \mathbb{R}^{H\times W\times C}$ all deep feature
- $A_i\in \mathbb{R}^{M\times H\times W}$ attention pieces, $A_i^j \in \mathbb{R}^{H\times W}$ is $j$ -th parts for $x_i$
- $\hat{E}_i^j = E_i \otimes A_i^j \in \mathbb{R}^{H\times W\times C}$ ; $\hat{\mathcal{E}}_i = \{\hat{E}_i^1, \cdots, \hat{E}_i^M\}$
- Local Features Refinement
  - $\mathcal{F}_i = f_{\mathrm{LFR}}(\hat{\mathcal{E}}_i) = \{F_i^1,\cdots, F_i^M\in \mathbb{R}^{H'\times W'\times C'}\}$
  - $f_i^j = f_{GAP}(F_i^j)\in\mathbb{R}^{C'}$
  - $F_i^{global} = f_{GFR}(E_i)\in \mathbb{R}^{H'\times W'\times C'}$ , $f_i^{global}=f_{GAP}(F_i^{global})\in\mathbb{R}^{C'}$
- Spatial diversity
  - Helliger distance
    - $H(p,q) = \frac{1}{\sqrt{2}}\sqrt{\sum_{i=1}^k (\sqrt{p_i} - \sqrt{q_i})^2}$
    - becomes 1 when where $p=0$ , $q\neq 0$
  - $F_i^j \to \hat{A}_i^j\in \mathbb{R}^{H'\times W'}$ adding all $C'$ features trought channel dimension
  - then softmax to make it distribution, flat to a vector $\hat{a}_i^j$
  - $\mathcal{L}_{sp}(x_i) = 1-\frac{1}{\sqrt{2}\binom{M}{2}}\sum_{l,k=1}^M \Big\lVert\sqrt{\hat{a}_i^l} - \sqrt{\hat{a}_i^k}\Big\rVert_2$
  - this loss make the aggregation maps be activated as diverse as possible
  - Hellinger distance, this loss becomes 0 when $\hat{a}_i^l$ and $\hat{a}_i^k$ as different as possible
- channel diversity
  - $p_i^j = \mathrm{softmax}(f_i^j),\; \forall j\in\{1,\cdots, M\}$
  - $\mathcal{L}_{cp}(x_i) = \Big[t - \frac{1}{\sqrt{2}\binom{M}{2}}\sum_{l,k=1}^M \big\lVert\sqrt{p_i^l} - \sqrt{p_i^k}\big\rVert_2\Big]_+$
  - $t\in [0,1]$ as a hyper-parameter, $[\cdot]_+$ means $\max(\cdot,0)$
- Get discriminative local features
Learning to align
- anchor-based local features alignment approach
- if local features were well aligned, exchanging the features of identical parts for two input images belonging to the same sub-category should not change the generated hash codes
- maintain dynamic anchored local features $\mathcal{C}_{y_i} = \{ c_{y_i}^1,\cdots,c_{y_i}^M \}$
  - $c_{y_i}^j$ get by averaging all $j$ -th part's local features for class $y_i$
  - updated at end of training epoch
- Exchange half of learned local features of $x_i$ in $\mathcal{G}_i=\{f_i^1, \cdots, f_i^M\}$ with anchored local features
  - $\forall j \in \{1,\cdots,M\}, \hat{f}_i^j = \begin{cases} f_i^j, &\mathrm{if}\; \xi_j \geq 0.5, \\ c_{y_i}^j, &\mathrm{otherwise} \end{cases}$
  - $\xi_j\sim \mathcal{B}(0.5)$
Hash Code
- fc layer & $\mathrm{sign}(\cdot)$ activation, asymmetric hashing
- $u_i = g([\hat{G}_i;f_i^{global}]_{cat}) = \mathrm{sign} (W^{(g)}[\hat{G}_i;f_i^{global}]_{cat})$
- $v_i = h([\hat{G}_i;f_i^{global}]_{cat}) = \mathrm{sign} (W^{(h)}[\hat{G}_i;f_i^{global}]_{cat})$
- $U=\{u_i\}_{i=1}^n\;V=\{v_i\}_{i=1}^n$
- $\mathcal{L}_{sq}(u_i,v_j,\mathcal{C}) = (u_i^Tv_j - qS_{ij})^2$ , intractable
- relax $g(\cdot)=\mathrm{sign}(\cdot)$ to $\tilde{g}(\cdot)=\mathrm{tanh}(\cdot)$
- $\tilde{\mathcal{L}}_{sq}(\tilde{u}_i,v_j,\mathcal{C}) = (\tilde{u}_i^Tv_j - qS_{ij})^2$
- $\begin{aligned} &\min_{V,\Theta, \mathcal{C}} \mathcal{L}(\mathcal{X}) = \sum_{i,j=1}^n \tilde{\mathcal{L}}_{sq}(\tilde{u}_i,v_j;S_{ij}) + \lambda \sum_{i=1}^n \mathcal{L}_{sp}(x_i) +\gamma \sum_{i=1}^n \mathcal{L}_{cp}(x_i) \\ &\mathrm{s.t.}\;\forall i\in\{1,\cdots,n\} \end{aligned}$
Learning algorithm
- alternating algorithm