Paper Reading on Visual Place Recognition

2020.07.08

1. Fast, Compact and Highly Scalable Visual Place Recognition through Sequence-based Matching of Overloaded Representations

1.1 Introduction

Tradeoff：存储空间，计算量，性能
This paper:
- Ultra compact place representation
- near sub-linear storage scaling
- extremely lightweight compute requirements
粗糙标量量化的散列，虽然导致了更多的冲突，但通过基于序列的匹配解决
- 利用了机器人领域空间数据固有的顺序性质
- 反转了典型目标标准
10M数据集上有效地点识别，每个地点8字节存储空间
- 传统方式1300倍计算量，严重失败
- 分析了不同量化矢量长度下，此散列重载方案的有效性

2. IM2GPS: estimating geographic information from a single image

Hays, J., & Efros, A. A. (2008). IM2GPS: Estimating geographic information from a single image. 26th IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 05. link

2.1 Introduction

A probability distribution ove the Earth's surface
human ability: from an image get rich information
- semantic reasoning: people face & clothes, language of signs, types of trees & plants, topograph of terrain, etc
- data association, What is it like; have seen silimar before; even not, helpful to define the type
- gigantic iamge collections make data association feasible
Background
- visual localization is possible when data available
  - Jacobs geolocate a webcam based on correlating its video-stream with satellite weather maps
  - availability of GPS-tagged images
- advances in multi-view geometry & efficient feature matching
  - also used in co-registering online photographs of landmarks for browsing & summary; image retrieval in location-labeled collections
- No datasets large enough to sample the world
  - geometric constraints require exact match, so will retrieve nothing
- Many scenes look similar, e.g. forests, deserts, mountains, cities
Propesed methods:
- for famous landmark with enough matches, returen precise GPS location
- for generic scenes like desert, return a location probability

2.2 Building dataset

intersection images with both GPS coordinates & geographic keywords
- geo-tag images of cats, but less likely to label with "New York City"
Methods
- collect images with keywords of all countries, continents, popular cities, etc
- exclude keywords such as birthday, concert, etc
- get 6,472,304, downsized to 1024, JPEG compress, results 1T data
Test set: 237 images

2.3 Scene Matching

Generic scenes hard to locate, but visual features must be strongly correlate with geography
what features? best exploit the correlation between image & location
- Tiny images: directly in color image space; reducing dimensions drastically
- Color histograms: CIE L*a*b color space; 4, 14, 14 bins respectively, totaly 784 dimensions; $\chi^2$ distance
- Texton histograms: texture features, 512 dimensions, $\chi^2$ distance
- Line features: Canny edges, two histograms with line angles & length
- Gist Descriptor + Color
- Geometric Context: compute geometric class probability of ground, sky & vertical
precomputed all features, per image 15 seconds, all 3 days with a cluster of 400 processors

2.4 Data-driven Geolocation

brute-force matching to find nearest image
k-NN, forms a probabilityu map over the globe
- mean-shift clustering, 3D points, disregard cluster less than 4 matches

2.5 Secondary Geographic Tasks

geographic information, population density, land cover estimates

Summary

提出了一种创建数据集的方法
使用了多种特征组合
证明了图片特征确实可以提高定位精度

3. 24/7 place recognition by view synthesis

3.1 Introduction

visual place recognition when the scene undergoes appearance change, e.g. illumination, seasons, aging, structural modifications
Contributions
- when query & gallery same viewpoint, matching across large changes is easier
- develop new approach with efficient synthesis of novel views & compact indexable image representation
- new Tokyo 24/7 dataset
Recent progress
- obtain accurate camera position within a city by a dataset with 1M images or a reconstructed 3D point cloud
- representation: SIFT
- efficiency: inverted file or product quantization
Observation: matching across large changes in scene appearance is easier when both the query & gallery image depict from the same viewpoint
Idea: synthesizing virtual views
Problems:
- how to efficiently synthesize entire city virtual viewpoints;
- how to deal with the increased database size;
- how to represent the synthetic views which is robust to large changes in appearance
Solutions:
- develop view synthesis method from Google street-view panoramas
- use compact VLAD encoding, efficient compression, storage & indexing
- represent using SIFT, which is robust to large changes in appearance
Related work
- Place recognition with local-invariant features
- Virtual views for instance-level matching
- Modelling scene illumination for place recognition

3.2 Matching across Changes in appearance

Best match when using densely sampled descriptors instead of DoG feature detector, and improved when match to a virtual view
- Tentative matches by finding mutually nearest descriptors
- Geometrically verify by repeatedly finding several homographies using RANSAC

3.3 View synthesis from street-level imagery

Use panoramic imagery together with an approximate piece-wise planar depth map corresponded
Steps

3.4 Summary

合成新的视角，来增加识别能力

4. Visual Place Recognition with Repetitive Structures

4.1 Introduction

Repeated structures may degrade retrieval performance because of over-counting in BoW
But with appropriate representation, they could form a distinguishing feature
Robust detection of repeated structures, then modify weights in BoW

4.2 Review

TF-IDF
- $V$ visual words, image is represented as $v_d = (t_1, t_2, t_V)^T$
- $t_i = \frac{n_{id}}{n_d}\log \frac{N}{N_i}$
- $n_{id}$ num of word $i$ in image $d$ , $n_d$ num of all words in image $d$ , $N_i$ num of images containing word $i$ , $N$ num of all images
Soft-assignment weighting
- soft-assign each descriptor to several closest cluster centers with weights $\exp(-\frac{d^2}{2\sigma^2})$
Burstiness weighting
- a visual-word is much more likely to appear in an image
- downweight by factor $\frac{1}{\sqrt{n_{id}}}$
Proposed method
- explicitly detect localized image areas with repetitive structures
- use the detected retetitions to adaptively adjust weights

4.3 Detection of repetitive structures

Operate on the extracted local features
feature segmentation, finding connected components in a graph
$G=(V,E)$ , $V = \{(x_i, s_i, d_i\}_{i=1}^N$
- $V$ is local features at locations $x_i$ , scales $s_i$ , SIFT $d_i$
- two vertices connected when position close & similar scale & appearance
Group vertices into disjoint groups

4.4 Summary

通过一种图的方法，检测重复结构，再对特征表示进行优化，提升识别能力

5. PlaNet - Photo Geolocation with Convolutional Neural Networks

5.1 Introduction

Classification by subdividing the surface of the earth into thousands of multiscale geographic cells
Train a CNN using geotagged images, inference outputs a discrete probability distribution over the earch
Train a LSTM to exploit temporal conherence in albums
- croisant & Eiffel Tower in the same album, can be located to Paris
Ralated work
- IM2GPS
- Matching aerial images
  - use CNN to learn joint embedding for ground & aerial images, localize by matching against aerial images
  - use CNN to transform ground-level features to the features space of aerial images
- Image retrieval
  - more accurate at matching buildings, but not good at natural scenes or articulated objects
  - local feature based approaches focus on localization with cities, based on websites images or street view images
  - Skyline2GPS segments skyline out and matches it against 3D models
- Exact 6DoF pose
  - Pose estimation: 3D models reconstructed using SfM
  - localized by making correspondences between query & points in 3D model, PnP problem to obtain parameters
  - Matching is expensive, using efficient image retrieval
- Landmark recognition
  - image retrieval
  - SVM
- Scene recognition
  - CNN approach outperforms others
- LSTM
  - exploit potential of temporal coherence to geolocate images
  - HMM on photo albums to learn tourist routes
  - structrued SVM with additional feature
  - Images2GPS, HMM, like this paper
- Attributes
  - population, elevation, household income

5.2 Image Geolocation with CNN

Classification better than regression， as it can express uncertainty by assigning confidence
Adaptive Partition using S2 Cells
- Recursively subdivide cells until no cell contains more than a certain fixed number $t_1$ of photos
- Discard all cells less than $t_2$ photos, and remove images from these cells in training set
- Classes are balanced, effective use of parameter space, street-level accuracy in city areas where cells are small
CNN training
- Inception architecture with BN, label one-hot, random initialize, AdaGrad with lr 0.045
- 126M photos all over the web, very noisy, 91M training & 34M validation
- 2.3M geotagged Flickr photos to testt

Sequence Geolocation with LSTMs

Asigning photos in an album locations is seq2seq, LSTM good fit
Collect 29.7M albums, training 23.5M, testing 6.2M
final layer before softmax as embedding, fed into LSTM unit
keep Inception fixed, train LSTM & SoftMax Layer

6. Revisiting IM2GPS in the Deep Learning Era

6.1 Introduction

estimate the geographic location of query by kernel density estimation
Error threshold
- street 1km
- city 25km
- region 200km
- country 750km
- continent 2500km
Instance retrieval works well if
- images in gallery field of view overlaps query
- content of query is well suited for local feature matching (distinctive geological feature)
More like scene classification
- higher-level understanding of image
This paper
- retrieval approach: visual world is too complex for deep model to memorize, retrieval approach trivially does so
- deep feature learning
  - classification method is better than Siamese
  - for classification, different discretization influence
Befor work
- limited spatial scale, city
- special class of images, landmarks, street-views
- aerial imagery

7. CPlaNet: Enhancing Image Geolocalization by Combinatorial Partitioning of Maps

7.1 Introduction

combinatorial partitioning, generates fine-grained output classes by intersecting multiple coarse-grained partitionings

8. Learned Contextual Feature Reweighting for Image Geo-Localization

8.1 Introduction

Contextual Aware mask
regions of interest
- features from time-varying objects introduce misleading cues into geo-localization
- not only local features, but also context in the scene
This paper
- E2E CNN, image representations adaptively reweight features based on iamge context
- Contextual Reweighting Network: takes feature maps, estimates weight for features based on its surrounding region
- retrieval task, triplet ranking loss
- geometric verification to generate positive images
- hard negative mining
- context-adaptive feature preponderance

8.2 Method

original VLAD $v_k = \sum_{l\in R} a_l^k(d_l-c_k)$
$f=[f_1, f_2, \cdots, f_K]$
$f_k = \sum_{l\in R} m_l\cdot a_l^k(d_l-c_k)$

9. CVM-Net: Cross-View Matching Network for Image-Based Ground-to-Aerial Geo-Localization

9.1 Introduction

Cross-view image-based ground-to-aerial geo-localization
Siamese architecture, metric learning
- fc layers, extract local features
- NetVLAD encoding to global descriptors
Ground view do not cover area
- bird's eye view densely covers earth
challenging because of change of viewpoint, SIFT/SURF fail
- CNN
- old work add branch to estimate orientation & use multiple orientations to find best angle for matching
This work
- NetVLAD, invariant against large viewpoint change
  - NetVLAD aggregates local features to form global represetations independent of locations
- Siamese network
- weighted soft margin ranking loss
  - speeds up training convergence & improves retrieval accuracy
  - embedded in triplet & quadruplet losses
Related work
- hand-crafted features in cross-view matching
  - warped ground image to top-down view
  - aerial image oblique, facade matching
- Learnable feature
  - BoVW, VLAD
  - NetVLAD
- Image retrieval loss
  - max-margin triplet loss
  - soft-margin triplet loss
  - triplet loss on constraint on irrelevant pairs, make inter-class variation small
  - quadruplet & angular loss

9.2 Approach

Local feature extraction
- fc net, $f^L$
- satellite image, $U_s=f^L(I_s;\Theta_s^L)$
- ground image, $U_g=f^L(I_g;\Theta_g^L)$
Global Descriptor Generation
- CVM-Net-I: two independent NetVLADs
  - $v_i=f^G(U_i; \Theta_i^G),i\in\{s,g \}$
  - For NetVLAD parameter, centroids $C_i=\{c_{i,1},\cdots, c_{i,K}\}$ num is same
  - $v_s,v_g$ in same space, used for direct similarity comparison
  - local features $U=\{u_1,\cdots,u_N\}$ , $k^{th}$ VLAD is $V(k)=\sum_{j=1}^N \bar{a}_k(u_j)(u_j-c_k)$
  - residual space
- CVM-Net-II: NetVLADs with shared weights
  - add two fc layers between $f^L$ and NetVLAD
  - first layer $\Theta_s^{T_1},\Theta_g^{T_2}$
  - second layer $\Theta^{T_2}$
  - NetVLAD share weights
  - weight sharing improve metric learning

9.3 Loss

max-margin loss $L_{max} = \max(0,m+d_{pos}-d_{neg})$ , margin $m$ must carefully selected
soft-margin loss $L_{soft}=\ln(1+e^d),d=d_{pos}-d_{neg}$ , slow convergence
weighted loss $L_{weighted}=\ln(1+e^{\alpha d})$ , increase $\alpha$ make rate of convergence improve
quadruplet loss $L_{quad} = \max(0, m_1+d_{pos}-d_{neg}) + \max(0,m_2+d_{pos}-d_{neg}^*)$ , two negative exmaples
weighted soft-margin loss $L_{q,w}=\ln(1+e^{\alpha(d_{pos}-d_{neg})}) + \ln(1+e^{\alpha (d_{pos}-d_{neg})})$

10. Lending Orientation to Neural Networks for Cross-view Geo-localization

use orientation information
azimuth & altitude $\theta, \phi$
- add two channel U-V
- ground-level panorama $\theta, \phi$
- aerial image $\theta, r$
Two schemes
- U-V map injected to the input layer only
- U-V map are injected to all layers
Feature embedding
- features $\mathcal{X}$ , dim WxHxD
- $f = [f^1,\cdots, f^D]^T,f^k = (\frac{1}{WH}\sum_{w=1}^W\sum_{h=1}^H x_{w,h,k}^p)$
Weighted soft-margin ranking loss

11. Ground-to-Aerial Image Geo-Localization With a Hard Exemplar Reweighting Triplet Loss

Hard exemplar mining
- auto allocates weights to triplets according to difficulty levels
This paper
- new triplet loss to improve quality of network training; online hard exemplar mining, end-to-end
- lightweight attention module FCAM
- Siamese network get sota
FCAM
- attention both on channel & spatial dimensions
- Channel attention: CBAM; spatial attention: context-aware feature reweighting
- feature map $U\in\mathbb{R}^{W\times H\times C}$ , attention descriptor $Z^C(U)\in\mathbb{R}^{1\times 1\times C}$ , spatial attention mask $Z^S(U')\in\mathbb{R}^{W\times H\times 1}$
- $U'=Z^C(U)\otimes U, U''=Z^S(U')\otimes U'$
- Channel attention
  - $Z^C(U) = \delta(f_{ext}(f_{max}(U)) + f_{ext}(f_{avg}(U))) = \delta(W_2^e\sigma(W_1^ev^1)) + \delta(W_2^e\sigma(W_1^ev^2))$
- Spatial attention
  - $Z^S(U') = \delta(f^{1\times 1}(f^{3\times 3}(s);f^{5\times 5}(s);f^{7\times 7}(s)))$
Hard exemplar reweighting triplet loss
- Anchor $A_i$ , positive $P_i$ , $k$ -th negative $N_{i,k}$
- Original: $L_{tri}(A_i,P_i,N_{i,k}) = \max(0, m+d_P(i)-d_n(i,k))$
- Soft-margin: $L_{soft}(A_i,P_i,N_{i,k}) = \log(1 + \exp(d_P(i)-d_n(i,k)))$
- Weight allocated according to its difficulty level: $L_{hard} = w_{hard}(A_i,P_i, N_{i,k}) * \log(1 + \exp(d_P(i)-d_n(i,k)))$
- Difficulty level
  - Most difficult: negative with smallest distance
  - $gap(i,k) = d_n(i,k)-d_p(i)$ ; extremely hard: $C_h:gap(i,k)\leq 0$
  - less informative: $C_s:gap(i,k)\geq m$
  - $w_{hard}(A_i,P_i,N_{i,k}) = \begin{cases}\epsilon/B,\;&gap(i,k)\geq m\\\log_2(1+\exp(m/2)),\;&gap(i,k)\leq 0\\\log_2(1+\exp(-gap(i,k) + m/2))\end{cases}$
  - $B$ is num of anchors in mini-batch, $m=\frac{\gamma}{2B}\sum_{i=1}^B (|f(A_i)|^2 +|f(P_i)|^2$
Orientation regression
- angles generated by random rotation as label
- $L_{OR}(A_i,P_i,N_{i,k}) = w_{hard}(A_i,P_i,N_{i,k})*(d_R^1(i)+d_R^2(i))$
$L_{HER}(A_i,P_i,N_{i,k} = \lambda_1*L_{hard}(A_i,P_i,N_{i,k}) + \lambda_2*L_{OR}(A_i,P_i,N_{i,k})$

12. Spatial-Aware Feature Aggregation for Cross-View Image based Geo-Localization

Observation: pixels lying on the same azimuth direction in an aerial image approximately correspond to a vertical image column in the ground view image
This paper
- regular polar transform to warp an aerial image closer to ground image
- Spatial attention mechanism to correspond deep features cluser in embedding space
- feature aggregation via learning multiple spatial embeddings
aligning two domains based on geometric correspondences will reduce the burden of the learning process for domain alignment
Polar transform on aerial images
- some objects may have distortions
- spatial attention based feature embedding to extract position-aware features
- retain iamge content & encodes layout information
Polar transform

13. Where am I looking at? Joint Location and Orientation Estimation by Cross-View Matching

Dynamic Similarity Matching
- estimate cross-view orientation alignment

E. Large-Scale Visual Geo-Localization

E.1 Introduction to Large-Scale Visual Geo-Localization

Introduction

Automatic geo-localization of images & videos: challenging
- Existing solutions limited to highly-visited regions, but scale baddly to large & ordinary regions
- Overview of major research themes in visual geo-localization
- challeges & areas that will benefit from these research themes
- Particular
  - availability of web-scale geo-referenced data affects VGL
  - semantic information
  - textured RGB & untextured non-RGB 3D models
- Realworld applications, emerging trends
Geo-localization: discovering the location where an image or video was captured
- Consumers: when, where, who, what, how
- Local govn: geographic & geological features in region of interest
- Local businesses: marketing by extracting where, what and when
- Law enforcement: find the location of incident
Visual content & location: Relationship
Still difficult
- identifying, extracting & indexing geo-informative features
- discover geo-location cues from imagery
- searching in massive databases e.g. GIS
Need tech advancements
- data-driven geo-info features, geometric modeling, ...
- viewpoints and techniques from diverse areas
Questions
- general principles to geo-localize & what features are geo-spatially informative
- math models of visual analysis, proper search techniques for matching
- enhance vision tasks e.g. object recognition, alignment, ...

Central Themes & Topics

早时，多为飞行器与卫星图像
- 平面配准技术，使用辅助模型
近年来，ground-level images & videos
- 谷歌街景，以及消费者拍摄的图像
- ground-level 图像往往不是水平的，且全图只有一个摄像机的GPS标签
- 大规模数据处理
- 精确定位的迫切需要：手持GPS设备相当或更优的精度
- 视觉特征的奇异与过度相似：人造结构非常相似
- 不理想的摄影效果：光线不足，遮挡，变形
- 数据形式不统一

数据驱动的地理定位

探索网络级数据集，提取地理信息特征
图像检索策略：query内容与地理信息标注的库中图像匹配，基于检索到图像的位置估计query位置；
- 两个问题有很多相似性：大量数据，不理想摄像效果，非水平，频繁遮挡
- 不同：
  - 检索试图在不同的相似度下找到尽可能多的图像；定位只需找出最佳位置，不一定需要大量匹配图像；有少量很相似比大量不那么相似更容易定位
  - 图像检索对所有形式的相似感兴趣，但图像地理定位目的是发现相同位置的图像，而不仅仅是相似的图像
- 因此使用图像检索并不足够；在图像检索启发下设计用于定位的方法很重要
- 扩大地理定位的应用范围，确定地理信息特征
使用跨视图图像和用户分享图像扩大定位范围
- 覆盖区域的大小很重要；应用区域越大越有用；谷歌街景这样系统化采集的图像，在涉及到具体地点时有局限
- 利用空中图像：卫星，飞机；需要进行跨视图匹配；或者使用跨模态数据
- 使用ground-level用户分享的图像
设计地理信息特征
- 利用在捕获地理空间信息上丰富且独特的特征对成功高效的地理定位很重要
- 通用内容匹配特征在捕获地理信息上不很好
- 编码了很多位置信息的建筑风格会在特征提取、量化中丢失
- 大规模地理标注数据可以通过自底向上方法解决这一问题
- 利用大量数据，地理空间判别性的数据驱动中级特征，同时对不理想的条件如光中、视角变化具有不变性

语义地理定位

基于高级与语义线索
- 与人类相关的特性：文本、建筑风格、交通方式、都市结构
  - 标志牌的语言，确定国家地区
  - 建筑风格，将搜索局限到国家甚至城市
- 自然特性：植物类型，天气
  - 特定类型的植物，确定区域
- 将这些线索聚合，与几何排列耦合
人类经常进行这种语义识别
- 空中地理登记使用类似方法
- ground-level 方法仍待探索
- GIS、Wikipedia、Wikimapia、hashtags with GPS-tagged images
挑战
- 使用什么特征
- 如何匹配
- 如何将分散的线索合并到统一系统内

几何匹配

query与参考3D模型的几何对齐
- 先构建地理参考3D场景模型，通过SfM
- 之后进行2D-3D匹配，通过提取到的特征
- 副产品：6DoF 相机参数估计（相机旋转与位置）、图像内容位置的估计
- 两种路径：带纹理的3D点云、不带纹理的模型；后者适合没有密集图像覆盖的区域
挑战
- 构建高保真3D纹理模型，进行匹配
  - 数据集太大，成像参数变化多，广域重建难度高
  - 目前SfM方法可以构建城市级别3D模型；先使用混合离散连续优化构建粗糙方案，再不断挑战提升；利用各种信息，相机与点、地理标签、灭点估计；模型比较鲁棒
  - 2D-3D匹配：3D模型巨大尺寸、有效利用3D模型几何信息、成像变化大时有效匹配
- 非纹理3D模型
  - SfM需要大量图片进行重建
  - 卫星图片，再稀疏地点进行3D建模
  - 挑战：什么特征允许跨模态匹配，如何大规模搜索
  - 山脉轮廓匹配，地表轮廓匹配

真实世界

地理位置能干啥？
如今相机都有GPS，生成的图片有GPS标签
- 帮助理解内容
- GPS标签辅助定位
地理位置帮助理解图像内容
- GIS辅助目标检测

Emerging Trends

新的地理参考数据资源
- 过去：卫星图像，飞行器图像
- 目前：用户地面图像，街景图像
- 未来：用户级无人机图像
- 卫星图像的时间空间解析度在提升
  - 需要时间与地理空间联合建模
时间地理定位与新应用
- 场景在何时发生？
- 法律机构及商业机构关心
基于深度学习的地理定位
- cross-view image geo-localization, semantic feture learning
- cross-view/cross-modality matching
- end-to-end geo-spatial feature learning
- temporal geo-localization using RNN

Organization of the Book

Data-driven
- 发现时间空间中级视觉联系：
  - 弱监督数据挖掘，发现时间空间重复出现的中级元素间的联系，捕获潜在的视觉风格；
  - 目标是发现外观随时间或位置变化而变化的视觉元素，而在标签空间上随时间或地理元素出现一致性的变化；
  - 先确定对样式敏感的块组，再逐步建立对应关系以找到相同的元素，最好训练样式感知回归器，对每个元素的风格差异变化进行建模
- 照片在哪拍的？从Flickr图片学习位置预测
  - 通过位置标注的互联网图片，研究基于视觉内容进行定位
  - 地理簇，覆盖一般的地理区域，而不限于地标
  - 地理簇提供更多训练样本，识别精度更高
  - 通过引入隐变量，解决估计问题
- 跨视图图像地理定位
  - 跨视图特征转换方法，将IGL扩展到没有地面图片的绝大多数区域；定位没有地面图片的query
  - 学习地面外观到高空外观和土地覆盖属性的映射，从稀疏的有地理标签的地面图像和对应的航空和土地覆盖数据中进行学习
- 超宽基线外立面匹配，进行地理定位
  - 匹配street-level图像，难点在于极端的视角与光照变化；颜色、梯度分布、局部描述子失败，只能依赖模式的自相似结构
  - 作者提出使用“尺度选择性自相似”描述子捕获这种结构；新的尺度选择方法，对外立面提取与分割；新的几何方法，将卫星图像与鸟瞰图像对齐，从而在立体图形切割框架中提取外立面
  - 给定鸟瞰图数据库的带标签描述符，通过贝叶斯分类实现查询匹配
  - 讨论了使用建筑物拐角与匹配的航空影像进行相机姿态估计
Semantic Reasoning
- 都市环境中语义引导的地理定位与建模
  - 通过语义分割进行语义标注，帮助检索与查询相似的视图，知道识别参考数据集中常见场景的布局；在城市场景定位，语义概念发现，交叉口识别中有效
- 在大规模社交图像集合中识别地标
  - 200万张，500类分类问题，识别地标
  - 数据集与类别自动从Flickr搜集，在空间地理标签分布中查找与经常拍摄的地标的峰值
  - 使用多类支持向量机建模，使用传统向量量化兴趣点描述符
  - 结合语义非视觉元数据，表明文本标签与时间约束提升分类准确率
  - 应用CNN，发现效果优于传统方法
Geometric Matching
- 使用3D点云进行世界范围姿态估计
  - 通过在大型3D点云地理标注集上估计6DoF与固有的相机姿态，确定照片拍摄地点，结合了图像定位，地标识别，与3D姿态估计
  - 使用两种技术缩放到具有成千上万图像与数千万3D点的数据集上，即RANSAC共现先验与图像特征和3D特征的双向匹配
- 探索空间与共见关系进行基于图像定位
  - 通过利用3D点间的空间关系与共同可见性关系，提升基于2D-3D匹配的方法的有效性
  - 基于几何匹配的定位，估计相对于场景的3D模型，获取给定查询的位置与方向
  - SfM可以快速重建大规模场景；需要快速处理数百万3D点的模型的方法；基于先验特征匹配的方法比较慢
  - 结合2D-to-3D匹配与3D-to-2D匹配
- 使用混合整数二次规划的3D点云缩减
  - 分析训练语料库中点的视图统计信息，加快匹配过程，减少内存占用
  - 给定训练图像集，提出的方法可以识别3D点云的紧凑子集以进行有效定位，效果与完整的点云相当
  - 可以将问题表示为混合整数二次规划问题，通过逐点描述符校正提高匹配能力
- 山区基于图像地理定位
  - 通过数字提升模型提取表示，进行快速视觉数据库查找
  - 提出的方法有效地利用视觉信息（轮廓）与几何约束（方向一致）
- 大规模天际线的表征与匹配的大规模渲染
  - 给定图像，自动提取天际线，并与使用DEM渲染的图像中提取的参考天际线数据库进行匹配
  - 渲染地区的采样密度决定精度与速度；提出的方法结合全局规划与局部贪婪搜索，增量选择新的渲染位置
- 未标注沙漠图像人工辅助地理定位
  - 从DEM生成合成的天际线视图，进而提取基于稳定凹度的特征
  - 用户手动在输入照片上跟踪天际线；根据此估值对天际线进行优化，提取基于凹度的的特征
  - 应用几何约束的匹配技术，进行高效匹配，从而定位
- 通过2D-3D对准的非摄影描述的视觉地理定位
  - 使用建筑的任意2D描绘进行定位，包括图纸，绘画，历史照片
  - 将输入描述与3D模型对齐；此任务因为2D描述的外观与结构可能与3D的外观和几何很不同，且搜索空间很大
  - 为了解决此问题，提出复杂3D场景的紧表示
Real-world
- 定位辅助识别，内存高效的
  - 在具有GPS功能的设备上快速识别城市地标
  - 将GPS上传到服务器，下载紧凑的特定于位置的分类器，在本地进行查询
  - 在GPS标注的图像数据集上训练紧凑随机森林分类器
  - 密集搜索图像中是否存在从训练图像中提取的局部特征，计算出RDF的特征向量；使用灭点进行校正
- 图像地理定位的真实系统
  - 用到了数据驱动、语义、几何等多种方法
- 照片召回：使用互联网标注照片
  - 聚合互联网上的各种信息

E.2 Discovering Mid-level Visual Connections in Space and Time

E.2.1 Introduction

Mid-level representation
- weakly supervised, mining connections between recurring mid-level visual elements in temporal & spatial image collections
- underlying visual style, not visually consistent throughout dataset, but changes due to change in time or location, but consistent variations across label space
- First indetify patches that are style-sensitive; then build correspondences to find the same element across dataset; finally train style-aware regressors to model element's changes
发现与时间、空间信息相关的模式
- low-level 无监督聚类
- mid-level：称为视觉风格，中心思想是创建通用的与风格无关的视觉元素检测器，同时对特定风格的变化使用弱监督进行建模
先发现重复出现的视觉元素，不管其在视觉风格上的变化，再对这些风格上的变化进行建模

E.2.2 路径

先对风格敏感的图像块聚类；对每个簇，训练通用检测器，对整个数据集发现相同视觉元素；对每个对应集，训练风格相关的回归器模型，区分不同实例相似元素微妙的风格变化
挖掘风格敏感的视觉元素
- 使用图像块的HOG特征进行匹配，使用N近邻进行聚类
- 计算时间上的熵， $E(c) = -\sum_{i=1}^n H(i)\log H(i)$ ，熵越高越表示风格敏感
建立对应关系
- 现在一个簇表示一个风格敏感的图像块集，可能多个这样的簇表示的是相似的部分，现在再把这些簇聚合起来；我们认为这些视觉元素会随着标签逐渐改变
- 使用一个簇作为初始正样本，增量地修改正集合，沿着邻近的标签；具体来说，训练了一个SVM，使用自然世界图像作为负样本，逐步扩大标签空间，加入新数据
训练风格感知回归
- 使用标准非线性支持向量回归