Tradeoff:存储空间,计算量,性能
This paper:
Ultra compact place representation
near sub-linear storage scaling
extremely lightweight compute requirements
粗糙标量量化的散列,虽然导致了更多的冲突,但通过基于序列的匹配解决
利用了机器人领域空间数据固有的顺序性质
反转了典型目标标准
10M数据集上有效地点识别,每个地点8字节存储空间
传统方式1300倍计算量,严重失败
分析了不同量化矢量长度下,此散列重载方案的有效性
Hays, J., & Efros, A. A. (2008). IM2GPS: Estimating geographic information from a single image. 26th IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 05. link
A probability distribution ove the Earth's surface
human ability: from an image get rich information
semantic reasoning: people face & clothes, language of signs, types of trees & plants, topograph of terrain, etc
data association, What is it like; have seen silimar before; even not, helpful to define the type
gigantic iamge collections make data association feasible
Background
visual localization is possible when data available
Jacobs geolocate a webcam based on correlating its video-stream with satellite weather maps
availability of GPS-tagged images
advances in multi-view geometry & efficient feature matching
also used in co-registering online photographs of landmarks for browsing & summary; image retrieval in location-labeled collections
No datasets large enough to sample the world
geometric constraints require exact match, so will retrieve nothing
Many scenes look similar, e.g. forests, deserts, mountains, cities
Propesed methods:
for famous landmark with enough matches, returen precise GPS location
for generic scenes like desert, return a location probability
intersection images with both GPS coordinates & geographic keywords
geo-tag images of cats, but less likely to label with "New York City"
Methods
collect images with keywords of all countries, continents, popular cities, etc
exclude keywords such as birthday, concert, etc
get 6,472,304, downsized to 1024, JPEG compress, results 1T data
Test set: 237 images
Generic scenes hard to locate, but visual features must be strongly correlate with geography
what features? best exploit the correlation between image & location
Tiny images: directly in color image space; reducing dimensions drastically
Color histograms: CIE L*a*b color space; 4, 14, 14 bins respectively, totaly 784 dimensions; χ 2 \chi^2 χ 2 distance
Texton histograms: texture features, 512 dimensions, χ 2 \chi^2 χ 2 distance
Line features: Canny edges, two histograms with line angles & length
Gist Descriptor + Color
Geometric Context: compute geometric class probability of ground, sky & vertical
precomputed all features, per image 15 seconds, all 3 days with a cluster of 400 processors
brute-force matching to find nearest image
k-NN, forms a probabilityu map over the globe
mean-shift clustering, 3D points, disregard cluster less than 4 matches
geographic information, population density, land cover estimates
提出了一种创建数据集的方法
使用了多种特征组合
证明了图片特征确实可以提高定位精度
visual place recognition when the scene undergoes appearance change, e.g. illumination, seasons, aging, structural modifications
Contributions
when query & gallery same viewpoint, matching across large changes is easier
develop new approach with efficient synthesis of novel views & compact indexable image representation
new Tokyo 24/7 dataset
Recent progress
obtain accurate camera position within a city by a dataset with 1M images or a reconstructed 3D point cloud
representation: SIFT
efficiency: inverted file or product quantization
Observation: matching across large changes in scene appearance is easier when both the query & gallery image depict from the same viewpoint
Idea: synthesizing virtual views
Problems:
how to efficiently synthesize entire city virtual viewpoints;
how to deal with the increased database size;
how to represent the synthetic views which is robust to large changes in appearance
Solutions:
develop view synthesis method from Google street-view panoramas
use compact VLAD encoding, efficient compression, storage & indexing
represent using SIFT, which is robust to large changes in appearance
Related work
Place recognition with local-invariant features
Virtual views for instance-level matching
Modelling scene illumination for place recognition
Best match when using densely sampled descriptors instead of DoG feature detector, and improved when match to a virtual view
Tentative matches by finding mutually nearest descriptors
Geometrically verify by repeatedly finding several homographies using RANSAC
Use panoramic imagery together with an approximate piece-wise planar depth map corresponded
Steps
Repeated structures may degrade retrieval performance because of over-counting in BoW
But with appropriate representation, they could form a distinguishing feature
Robust detection of repeated structures, then modify weights in BoW
TF-IDF
V V V visual words, image is represented as v d = ( t 1 , t 2 , t V ) T v_d = (t_1, t_2, t_V)^T v d = ( t 1 , t 2 , t V ) T
t i = n i d n d log N N i t_i = \frac{n_{id}}{n_d}\log \frac{N}{N_i} t i = n d n i d log N i N
n i d n_{id} n i d num of word i i i in image d d d , n d n_d n d num of all words in image d d d , N i N_i N i num of images containing word i i i , N N N num of all images
Soft-assignment weighting
soft-assign each descriptor to several closest cluster centers with weights exp ( − d 2 2 σ 2 ) \exp(-\frac{d^2}{2\sigma^2}) exp ( − 2 σ 2 d 2 )
Burstiness weighting
a visual-word is much more likely to appear in an image
downweight by factor 1 n i d \frac{1}{\sqrt{n_{id}}} n i d 1
Proposed method
explicitly detect localized image areas with repetitive structures
use the detected retetitions to adaptively adjust weights
Operate on the extracted local features
feature segmentation, finding connected components in a graph
G = ( V , E ) G=(V,E) G = ( V , E ) , V = { ( x i , s i , d i } i = 1 N V = \{(x_i, s_i, d_i\}_{i=1}^N V = { ( x i , s i , d i } i = 1 N
V V V is local features at locations x i x_i x i , scales s i s_i s i , SIFT d i d_i d i
two vertices connected when position close & similar scale & appearance
Group vertices into disjoint groups
通过一种图的方法,检测重复结构,再对特征表示进行优化,提升识别能力
Classification by subdividing the surface of the earth into thousands of multiscale geographic cells
Train a CNN using geotagged images, inference outputs a discrete probability distribution over the earch
Train a LSTM to exploit temporal conherence in albums
croisant & Eiffel Tower in the same album, can be located to Paris
Ralated work
IM2GPS
Matching aerial images
use CNN to learn joint embedding for ground & aerial images, localize by matching against aerial images
use CNN to transform ground-level features to the features space of aerial images
Image retrieval
more accurate at matching buildings, but not good at natural scenes or articulated objects
local feature based approaches focus on localization with cities, based on websites images or street view images
Skyline2GPS segments skyline out and matches it against 3D models
Exact 6DoF pose
Pose estimation: 3D models reconstructed using SfM
localized by making correspondences between query & points in 3D model, PnP problem to obtain parameters
Matching is expensive, using efficient image retrieval
Landmark recognition
Scene recognition
CNN approach outperforms others
LSTM
exploit potential of temporal coherence to geolocate images
HMM on photo albums to learn tourist routes
structrued SVM with additional feature
Images2GPS, HMM, like this paper
Attributes
population, elevation, household income
Classification better than regression, as it can express uncertainty by assigning confidence
Adaptive Partition using S2 Cells
Recursively subdivide cells until no cell contains more than a certain fixed number t 1 t_1 t 1 of photos
Discard all cells less than t 2 t_2 t 2 photos, and remove images from these cells in training set
Classes are balanced, effective use of parameter space, street-level accuracy in city areas where cells are small
CNN training
Inception architecture with BN, label one-hot, random initialize, AdaGrad with lr 0.045
126M photos all over the web, very noisy, 91M training & 34M validation
2.3M geotagged Flickr photos to testt
Asigning photos in an album locations is seq2seq, LSTM good fit
Collect 29.7M albums, training 23.5M, testing 6.2M
final layer before softmax as embedding, fed into LSTM unit
keep Inception fixed, train LSTM & SoftMax Layer
estimate the geographic location of query by kernel density estimation
Error threshold
street 1km
city 25km
region 200km
country 750km
continent 2500km
Instance retrieval works well if
images in gallery field of view overlaps query
content of query is well suited for local feature matching (distinctive geological feature)
More like scene classification
higher-level understanding of image
This paper
retrieval approach: visual world is too complex for deep model to memorize, retrieval approach trivially does so
deep feature learning
classification method is better than Siamese
for classification, different discretization influence
Befor work
limited spatial scale, city
special class of images, landmarks, street-views
aerial imagery
combinatorial partitioning, generates fine-grained output classes by intersecting multiple coarse-grained partitionings
Contextual Aware mask
regions of interest
features from time-varying objects introduce misleading cues into geo-localization
not only local features, but also context in the scene
This paper
E2E CNN, image representations adaptively reweight features based on iamge context
Contextual Reweighting Network: takes feature maps, estimates weight for features based on its surrounding region
retrieval task, triplet ranking loss
geometric verification to generate positive images
hard negative mining
context-adaptive feature preponderance
original VLAD v k = ∑ l ∈ R a l k ( d l − c k ) v_k = \sum_{l\in R} a_l^k(d_l-c_k) v k = ∑ l ∈ R a l k ( d l − c k )
f = [ f 1 , f 2 , ⋯ , f K ] f=[f_1, f_2, \cdots, f_K] f = [ f 1 , f 2 , ⋯ , f K ]
f k = ∑ l ∈ R m l ⋅ a l k ( d l − c k ) f_k = \sum_{l\in R} m_l\cdot a_l^k(d_l-c_k) f k = ∑ l ∈ R m l ⋅ a l k ( d l − c k )
Cross-view image-based ground-to-aerial geo-localization
Siamese architecture, metric learning
fc layers, extract local features
NetVLAD encoding to global descriptors
Ground view do not cover area
bird's eye view densely covers earth
challenging because of change of viewpoint, SIFT/SURF fail
CNN
old work add branch to estimate orientation & use multiple orientations to find best angle for matching
This work
NetVLAD, invariant against large viewpoint change
NetVLAD aggregates local features to form global represetations independent of locations
Siamese network
weighted soft margin ranking loss
speeds up training convergence & improves retrieval accuracy
embedded in triplet & quadruplet losses
Related work
hand-crafted features in cross-view matching
warped ground image to top-down view
aerial image oblique, facade matching
Learnable feature
Image retrieval loss
max-margin triplet loss
soft-margin triplet loss
triplet loss on constraint on irrelevant pairs, make inter-class variation small
quadruplet & angular loss
Local feature extraction
fc net, f L f^L f L
satellite image, U s = f L ( I s ; Θ s L ) U_s=f^L(I_s;\Theta_s^L) U s = f L ( I s ; Θ s L )
ground image, U g = f L ( I g ; Θ g L ) U_g=f^L(I_g;\Theta_g^L) U g = f L ( I g ; Θ g L )
Global Descriptor Generation
CVM-Net-I: two independent NetVLADs
v i = f G ( U i ; Θ i G ) , i ∈ { s , g } v_i=f^G(U_i; \Theta_i^G),i\in\{s,g \} v i = f G ( U i ; Θ i G ) , i ∈ { s , g }
For NetVLAD parameter, centroids C i = { c i , 1 , ⋯ , c i , K } C_i=\{c_{i,1},\cdots, c_{i,K}\} C i = { c i , 1 , ⋯ , c i , K } num is same
v s , v g v_s,v_g v s , v g in same space, used for direct similarity comparison
local features U = { u 1 , ⋯ , u N } U=\{u_1,\cdots,u_N\} U = { u 1 , ⋯ , u N } , k t h k^{th} k t h VLAD is V ( k ) = ∑ j = 1 N a ˉ k ( u j ) ( u j − c k ) V(k)=\sum_{j=1}^N \bar{a}_k(u_j)(u_j-c_k) V ( k ) = ∑ j = 1 N a ˉ k ( u j ) ( u j − c k )
residual space
CVM-Net-II: NetVLADs with shared weights
add two fc layers between f L f^L f L and NetVLAD
first layer Θ s T 1 , Θ g T 2 \Theta_s^{T_1},\Theta_g^{T_2} Θ s T 1 , Θ g T 2
second layer Θ T 2 \Theta^{T_2} Θ T 2
NetVLAD share weights
weight sharing improve metric learning
max-margin loss L m a x = max ( 0 , m + d p o s − d n e g ) L_{max} = \max(0,m+d_{pos}-d_{neg}) L m a x = max ( 0 , m + d p o s − d n e g ) , margin m m m must carefully selected
soft-margin loss L s o f t = ln ( 1 + e d ) , d = d p o s − d n e g L_{soft}=\ln(1+e^d),d=d_{pos}-d_{neg} L s o f t = ln ( 1 + e d ) , d = d p o s − d n e g , slow convergence
weighted loss L w e i g h t e d = ln ( 1 + e α d ) L_{weighted}=\ln(1+e^{\alpha d}) L w e i g h t e d = ln ( 1 + e α d ) , increase α \alpha α make rate of convergence improve
quadruplet loss L q u a d = max ( 0 , m 1 + d p o s − d n e g ) + max ( 0 , m 2 + d p o s − d n e g ∗ ) L_{quad} = \max(0, m_1+d_{pos}-d_{neg}) + \max(0,m_2+d_{pos}-d_{neg}^*) L q u a d = max ( 0 , m 1 + d p o s − d n e g ) + max ( 0 , m 2 + d p o s − d n e g ∗ ) , two negative exmaples
weighted soft-margin loss L q , w = ln ( 1 + e α ( d p o s − d n e g ) ) + ln ( 1 + e α ( d p o s − d n e g ) ) L_{q,w}=\ln(1+e^{\alpha(d_{pos}-d_{neg})}) + \ln(1+e^{\alpha (d_{pos}-d_{neg})}) L q , w = ln ( 1 + e α ( d p o s − d n e g ) ) + ln ( 1 + e α ( d p o s − d n e g ) )
use orientation information
azimuth & altitude θ , ϕ \theta, \phi θ , ϕ
add two channel U-V
ground-level panorama θ , ϕ \theta, \phi θ , ϕ
aerial image θ , r \theta, r θ , r
Two schemes
U-V map injected to the input layer only
U-V map are injected to all layers
Feature embedding
features X \mathcal{X} X , dim WxHxD
f = [ f 1 , ⋯ , f D ] T , f k = ( 1 W H ∑ w = 1 W ∑ h = 1 H x w , h , k p ) f = [f^1,\cdots, f^D]^T,f^k = (\frac{1}{WH}\sum_{w=1}^W\sum_{h=1}^H x_{w,h,k}^p) f = [ f 1 , ⋯ , f D ] T , f k = ( W H 1 ∑ w = 1 W ∑ h = 1 H x w , h , k p )
Weighted soft-margin ranking loss
Hard exemplar mining
auto allocates weights to triplets according to difficulty levels
This paper
new triplet loss to improve quality of network training; online hard exemplar mining, end-to-end
lightweight attention module FCAM
Siamese network get sota
FCAM
attention both on channel & spatial dimensions
Channel attention: CBAM; spatial attention: context-aware feature reweighting
feature map U ∈ R W × H × C U\in\mathbb{R}^{W\times H\times C} U ∈ R W × H × C , attention descriptor Z C ( U ) ∈ R 1 × 1 × C Z^C(U)\in\mathbb{R}^{1\times 1\times C} Z C ( U ) ∈ R 1 × 1 × C , spatial attention mask Z S ( U ′ ) ∈ R W × H × 1 Z^S(U')\in\mathbb{R}^{W\times H\times 1} Z S ( U ′ ) ∈ R W × H × 1
U ′ = Z C ( U ) ⊗ U , U ′ ′ = Z S ( U ′ ) ⊗ U ′ U'=Z^C(U)\otimes U, U''=Z^S(U')\otimes U' U ′ = Z C ( U ) ⊗ U , U ′ ′ = Z S ( U ′ ) ⊗ U ′
Channel attention
Z C ( U ) = δ ( f e x t ( f m a x ( U ) ) + f e x t ( f a v g ( U ) ) ) = δ ( W 2 e σ ( W 1 e v 1 ) ) + δ ( W 2 e σ ( W 1 e v 2 ) ) Z^C(U) = \delta(f_{ext}(f_{max}(U)) + f_{ext}(f_{avg}(U))) = \delta(W_2^e\sigma(W_1^ev^1)) + \delta(W_2^e\sigma(W_1^ev^2)) Z C ( U ) = δ ( f e x t ( f m a x ( U ) ) + f e x t ( f a v g ( U ) ) ) = δ ( W 2 e σ ( W 1 e v 1 ) ) + δ ( W 2 e σ ( W 1 e v 2 ) )
Spatial attention
Z S ( U ′ ) = δ ( f 1 × 1 ( f 3 × 3 ( s ) ; f 5 × 5 ( s ) ; f 7 × 7 ( s ) ) ) Z^S(U') = \delta(f^{1\times 1}(f^{3\times 3}(s);f^{5\times 5}(s);f^{7\times 7}(s))) Z S ( U ′ ) = δ ( f 1 × 1 ( f 3 × 3 ( s ) ; f 5 × 5 ( s ) ; f 7 × 7 ( s ) ) )
Hard exemplar reweighting triplet loss
Anchor A i A_i A i , positive P i P_i P i , k k k -th negative N i , k N_{i,k} N i , k
Original: L t r i ( A i , P i , N i , k ) = max ( 0 , m + d P ( i ) − d n ( i , k ) ) L_{tri}(A_i,P_i,N_{i,k}) = \max(0, m+d_P(i)-d_n(i,k)) L t r i ( A i , P i , N i , k ) = max ( 0 , m + d P ( i ) − d n ( i , k ) )
Soft-margin: L s o f t ( A i , P i , N i , k ) = log ( 1 + exp ( d P ( i ) − d n ( i , k ) ) ) L_{soft}(A_i,P_i,N_{i,k}) = \log(1 + \exp(d_P(i)-d_n(i,k))) L s o f t ( A i , P i , N i , k ) = log ( 1 + exp ( d P ( i ) − d n ( i , k ) ) )
Weight allocated according to its difficulty level: L h a r d = w h a r d ( A i , P i , N i , k ) ∗ log ( 1 + exp ( d P ( i ) − d n ( i , k ) ) ) L_{hard} = w_{hard}(A_i,P_i, N_{i,k}) * \log(1 + \exp(d_P(i)-d_n(i,k))) L h a r d = w h a r d ( A i , P i , N i , k ) ∗ log ( 1 + exp ( d P ( i ) − d n ( i , k ) ) )
Difficulty level
Most difficult: negative with smallest distance
g a p ( i , k ) = d n ( i , k ) − d p ( i ) gap(i,k) = d_n(i,k)-d_p(i) g a p ( i , k ) = d n ( i , k ) − d p ( i ) ; extremely hard: C h : g a p ( i , k ) ≤ 0 C_h:gap(i,k)\leq 0 C h : g a p ( i , k ) ≤ 0
less informative: C s : g a p ( i , k ) ≥ m C_s:gap(i,k)\geq m C s : g a p ( i , k ) ≥ m
w h a r d ( A i , P i , N i , k ) = { ϵ / B , g a p ( i , k ) ≥ m log 2 ( 1 + exp ( m / 2 ) ) , g a p ( i , k ) ≤ 0 log 2 ( 1 + exp ( − g a p ( i , k ) + m / 2 ) ) w_{hard}(A_i,P_i,N_{i,k}) = \begin{cases}\epsilon/B,\;&gap(i,k)\geq m\\\log_2(1+\exp(m/2)),\;&gap(i,k)\leq 0\\\log_2(1+\exp(-gap(i,k) + m/2))\end{cases} w h a r d ( A i , P i , N i , k ) = ⎩ ⎪ ⎪ ⎨ ⎪ ⎪ ⎧ ϵ / B , log 2 ( 1 + exp ( m / 2 ) ) , log 2 ( 1 + exp ( − g a p ( i , k ) + m / 2 ) ) g a p ( i , k ) ≥ m g a p ( i , k ) ≤ 0
B B B is num of anchors in mini-batch, m = γ 2 B ∑ i = 1 B ( ∣ f ( A i ) ∣ 2 + ∣ f ( P i ) ∣ 2 m=\frac{\gamma}{2B}\sum_{i=1}^B (|f(A_i)|^2 +|f(P_i)|^2 m = 2 B γ ∑ i = 1 B ( ∣ f ( A i ) ∣ 2 + ∣ f ( P i ) ∣ 2
Orientation regression
angles generated by random rotation as label
L O R ( A i , P i , N i , k ) = w h a r d ( A i , P i , N i , k ) ∗ ( d R 1 ( i ) + d R 2 ( i ) ) L_{OR}(A_i,P_i,N_{i,k}) = w_{hard}(A_i,P_i,N_{i,k})*(d_R^1(i)+d_R^2(i)) L O R ( A i , P i , N i , k ) = w h a r d ( A i , P i , N i , k ) ∗ ( d R 1 ( i ) + d R 2 ( i ) )
L H E R ( A i , P i , N i , k = λ 1 ∗ L h a r d ( A i , P i , N i , k ) + λ 2 ∗ L O R ( A i , P i , N i , k ) L_{HER}(A_i,P_i,N_{i,k} = \lambda_1*L_{hard}(A_i,P_i,N_{i,k}) + \lambda_2*L_{OR}(A_i,P_i,N_{i,k}) L H E R ( A i , P i , N i , k = λ 1 ∗ L h a r d ( A i , P i , N i , k ) + λ 2 ∗ L O R ( A i , P i , N i , k )
Observation: pixels lying on the same azimuth direction in an aerial image approximately correspond to a vertical image column in the ground view image
This paper
regular polar transform to warp an aerial image closer to ground image
Spatial attention mechanism to correspond deep features cluser in embedding space
feature aggregation via learning multiple spatial embeddings
aligning two domains based on geometric correspondences will reduce the burden of the learning process for domain alignment
Polar transform on aerial images
some objects may have distortions
spatial attention based feature embedding to extract position-aware features
retain iamge content & encodes layout information
Polar transform
Dynamic Similarity Matching
estimate cross-view orientation alignment
Automatic geo-localization of images & videos: challenging
Existing solutions limited to highly-visited regions, but scale baddly to large & ordinary regions
Overview of major research themes in visual geo-localization
challeges & areas that will benefit from these research themes
Particular
availability of web-scale geo-referenced data affects VGL
semantic information
textured RGB & untextured non-RGB 3D models
Realworld applications, emerging trends
Geo-localization: discovering the location where an image or video was captured
Consumers: when, where, who, what, how
Local govn: geographic & geological features in region of interest
Local businesses: marketing by extracting where, what and when
Law enforcement: find the location of incident
Visual content & location: Relationship
Still difficult
identifying, extracting & indexing geo-informative features
discover geo-location cues from imagery
searching in massive databases e.g. GIS
Need tech advancements
data-driven geo-info features, geometric modeling, ...
viewpoints and techniques from diverse areas
Questions
general principles to geo-localize & what features are geo-spatially informative
math models of visual analysis, proper search techniques for matching
enhance vision tasks e.g. object recognition, alignment, ...
早时,多为飞行器与卫星图像
近年来,ground-level images & videos
谷歌街景,以及消费者拍摄的图像
ground-level 图像往往不是水平的,且全图只有一个摄像机的GPS标签
大规模数据处理
精确定位的迫切需要:手持GPS设备相当或更优的精度
视觉特征的奇异与过度相似:人造结构非常相似
不理想的摄影效果:光线不足,遮挡,变形
数据形式不统一
探索网络级数据集,提取地理信息特征
图像检索策略:query内容与地理信息标注的库中图像匹配,基于检索到图像的位置估计query位置;
两个问题有很多相似性:大量数据,不理想摄像效果,非水平,频繁遮挡
不同:
检索试图在不同的相似度下找到尽可能多的图像;定位只需找出最佳位置,不一定需要大量匹配图像;有少量很相似比大量不那么相似更容易定位
图像检索对所有形式的相似感兴趣,但图像地理定位目的是发现相同位置的图像,而不仅仅是相似的图像
因此使用图像检索并不足够;在图像检索启发下设计用于定位的方法很重要
扩大地理定位的应用范围,确定地理信息特征
使用跨视图图像和用户分享图像扩大定位范围
覆盖区域的大小很重要;应用区域越大越有用;谷歌街景这样系统化采集的图像,在涉及到具体地点时有局限
利用空中图像:卫星,飞机;需要进行跨视图匹配;或者使用跨模态数据
使用ground-level用户分享的图像
设计地理信息特征
利用在捕获地理空间信息上丰富且独特的特征对成功高效的地理定位很重要
通用内容匹配特征在捕获地理信息上不很好
编码了很多位置信息的建筑风格会在特征提取、量化中丢失
大规模地理标注数据可以通过自底向上方法解决这一问题
利用大量数据,地理空间判别性的数据驱动中级特征,同时对不理想的条件如光中、视角变化具有不变性
基于高级与语义线索
与人类相关的特性:文本、建筑风格、交通方式、都市结构
标志牌的语言,确定国家地区
建筑风格,将搜索局限到国家甚至城市
自然特性:植物类型,天气
将这些线索聚合,与几何排列耦合
人类经常进行这种语义识别
空中地理登记使用类似方法
ground-level 方法仍待探索
GIS、Wikipedia、Wikimapia、hashtags with GPS-tagged images
挑战
使用什么特征
如何匹配
如何将分散的线索合并到统一系统内
query与参考3D模型的几何对齐
先构建地理参考3D场景模型,通过SfM
之后进行2D-3D匹配,通过提取到的特征
副产品:6DoF 相机参数估计(相机旋转与位置)、图像内容位置的估计
两种路径:带纹理的3D点云、不带纹理的模型;后者适合没有密集图像覆盖的区域
挑战
构建高保真3D纹理模型,进行匹配
数据集太大,成像参数变化多,广域重建难度高
目前SfM方法可以构建城市级别3D模型;先使用混合离散连续优化构建粗糙方案,再不断挑战提升;利用各种信息,相机与点、地理标签、灭点估计;模型比较鲁棒
2D-3D匹配:3D模型巨大尺寸、有效利用3D模型几何信息、成像变化大时有效匹配
非纹理3D模型
SfM需要大量图片进行重建
卫星图片,再稀疏地点进行3D建模
挑战:什么特征允许跨模态匹配,如何大规模搜索
山脉轮廓匹配,地表轮廓匹配
地理位置能干啥?
如今相机都有GPS,生成的图片有GPS标签
地理位置帮助理解图像内容
新的地理参考数据资源
过去:卫星图像,飞行器图像
目前:用户地面图像,街景图像
未来:用户级无人机图像
卫星图像的时间空间解析度在提升
时间地理定位与新应用
基于深度学习的地理定位
cross-view image geo-localization, semantic feture learning
cross-view/cross-modality matching
end-to-end geo-spatial feature learning
temporal geo-localization using RNN
Data-driven
发现时间空间中级视觉联系:
弱监督数据挖掘,发现时间空间重复出现的中级元素间的联系,捕获潜在的视觉风格;
目标是发现外观随时间或位置变化而变化的视觉元素,而在标签空间上随时间或地理元素出现一致性的变化;
先确定对样式敏感的块组,再逐步建立对应关系以找到相同的元素,最好训练样式感知回归器,对每个元素的风格差异变化进行建模
照片在哪拍的?从Flickr图片学习位置预测
通过位置标注的互联网图片,研究基于视觉内容进行定位
地理簇,覆盖一般的地理区域,而不限于地标
地理簇提供更多训练样本,识别精度更高
通过引入隐变量,解决估计问题
跨视图图像地理定位
跨视图特征转换方法,将IGL扩展到没有地面图片的绝大多数区域;定位没有地面图片的query
学习地面外观到高空外观和土地覆盖属性的映射,从稀疏的有地理标签的地面图像和对应的航空和土地覆盖数据中进行学习
超宽基线外立面匹配,进行地理定位
匹配street-level图像,难点在于极端的视角与光照变化;颜色、梯度分布、局部描述子失败,只能依赖模式的自相似结构
作者提出使用“尺度选择性自相似”描述子捕获这种结构;新的尺度选择方法,对外立面提取与分割;新的几何方法,将卫星图像与鸟瞰图像对齐,从而在立体图形切割框架中提取外立面
给定鸟瞰图数据库的带标签描述符,通过贝叶斯分类实现查询匹配
讨论了使用建筑物拐角与匹配的航空影像进行相机姿态估计
Semantic Reasoning
都市环境中语义引导的地理定位与建模
通过语义分割进行语义标注,帮助检索与查询相似的视图,知道识别参考数据集中常见场景的布局;在城市场景定位,语义概念发现,交叉口识别中有效
在大规模社交图像集合中识别地标
200万张,500类分类问题,识别地标
数据集与类别自动从Flickr搜集,在空间地理标签分布中查找与经常拍摄的地标的峰值
使用多类支持向量机建模,使用传统向量量化兴趣点描述符
结合语义非视觉元数据,表明文本标签与时间约束提升分类准确率
应用CNN,发现效果优于传统方法
Geometric Matching
使用3D点云进行世界范围姿态估计
通过在大型3D点云地理标注集上估计6DoF与固有的相机姿态,确定照片拍摄地点,结合了图像定位,地标识别,与3D姿态估计
使用两种技术缩放到具有成千上万图像与数千万3D点的数据集上,即RANSAC共现先验与图像特征和3D特征的双向匹配
探索空间与共见关系进行基于图像定位
通过利用3D点间的空间关系与共同可见性关系,提升基于2D-3D匹配的方法的有效性
基于几何匹配的定位,估计相对于场景的3D模型,获取给定查询的位置与方向
SfM可以快速重建大规模场景;需要快速处理数百万3D点的模型的方法;基于先验特征匹配的方法比较慢
结合2D-to-3D匹配与3D-to-2D匹配
使用混合整数二次规划的3D点云缩减
分析训练语料库中点的视图统计信息,加快匹配过程,减少内存占用
给定训练图像集,提出的方法可以识别3D点云的紧凑子集以进行有效定位,效果与完整的点云相当
可以将问题表示为混合整数二次规划问题,通过逐点描述符校正提高匹配能力
山区基于图像地理定位
通过数字提升模型提取表示,进行快速视觉数据库查找
提出的方法有效地利用视觉信息(轮廓)与几何约束(方向一致)
大规模天际线的表征与匹配的大规模渲染
给定图像,自动提取天际线,并与使用DEM渲染的图像中提取的参考天际线数据库进行匹配
渲染地区的采样密度决定精度与速度;提出的方法结合全局规划与局部贪婪搜索,增量选择新的渲染位置
未标注沙漠图像人工辅助地理定位
从DEM生成合成的天际线视图,进而提取基于稳定凹度的特征
用户手动在输入照片上跟踪天际线;根据此估值对天际线进行优化,提取基于凹度的的特征
应用几何约束的匹配技术,进行高效匹配,从而定位
通过2D-3D对准的非摄影描述的视觉地理定位
使用建筑的任意2D描绘进行定位,包括图纸,绘画,历史照片
将输入描述与3D模型对齐;此任务因为2D描述的外观与结构可能与3D的外观和几何很不同,且搜索空间很大
为了解决此问题,提出复杂3D场景的紧表示
Real-world
定位辅助识别,内存高效的
在具有GPS功能的设备上快速识别城市地标
将GPS上传到服务器,下载紧凑的特定于位置的分类器,在本地进行查询
在GPS标注的图像数据集上训练紧凑随机森林分类器
密集搜索图像中是否存在从训练图像中提取的局部特征,计算出RDF的特征向量;使用灭点进行校正
图像地理定位的真实系统
照片召回:使用互联网标注照片
Mid-level representation
weakly supervised, mining connections between recurring mid-level visual elements in temporal & spatial image collections
underlying visual style, not visually consistent throughout dataset, but changes due to change in time or location, but consistent variations across label space
First indetify patches that are style-sensitive; then build correspondences to find the same element across dataset; finally train style-aware regressors to model element's changes
发现与时间、空间信息相关的模式
low-level 无监督聚类
mid-level:称为视觉风格,中心思想是创建通用的与风格无关的视觉元素检测器,同时对特定风格的变化使用弱监督进行建模
先发现重复出现的视觉元素,不管其在视觉风格上的变化,再对这些风格上的变化进行建模
先对风格敏感的图像块聚类;对每个簇,训练通用检测器,对整个数据集发现相同视觉元素;对每个对应集,训练风格相关的回归器模型,区分不同实例相似元素微妙的风格变化
挖掘风格敏感的视觉元素
使用图像块的HOG特征进行匹配,使用N近邻进行聚类
计算时间上的熵,E ( c ) = − ∑ i = 1 n H ( i ) log H ( i ) E(c) = -\sum_{i=1}^n H(i)\log H(i) E ( c ) = − ∑ i = 1 n H ( i ) log H ( i ) ,熵越高越表示风格敏感
建立对应关系
现在一个簇表示一个风格敏感的图像块集,可能多个这样的簇表示的是相似的部分,现在再把这些簇聚合起来;我们认为这些视觉元素会随着标签逐渐改变
使用一个簇作为初始正样本,增量地修改正集合,沿着邻近的标签;具体来说,训练了一个SVM,使用自然世界图像作为负样本,逐步扩大标签空间,加入新数据
训练风格感知回归