This paper:
Ultra compact place representation
near sub-linear storage scaling
extremely lightweight compute requirements
Hays, J., & Efros, A. A. (2008). IM2GPS: Estimating geographic information from a single image. 26th IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 05. link
A probability distribution ove the Earth's surface
human ability: from an image get rich information
semantic reasoning: people face & clothes, language of signs, types of trees & plants, topograph of terrain, etc
data association, What is it like; have seen silimar before; even not, helpful to define the type
gigantic iamge collections make data association feasible
visual localization is possible when data available
Jacobs geolocate a webcam based on correlating its video-stream with satellite weather maps
availability of GPS-tagged images
advances in multi-view geometry & efficient feature matching
also used in co-registering online photographs of landmarks for browsing & summary; image retrieval in location-labeled collections
No datasets large enough to sample the world
geometric constraints require exact match, so will retrieve nothing
Many scenes look similar, e.g. forests, deserts, mountains, cities
Propesed methods:
for famous landmark with enough matches, returen precise GPS location
for generic scenes like desert, return a location probability
intersection images with both GPS coordinates & geographic keywords
geo-tag images of cats, but less likely to label with "New York City"
collect images with keywords of all countries, continents, popular cities, etc
exclude keywords such as birthday, concert, etc
get 6,472,304, downsized to 1024, JPEG compress, results 1T data
Test set: 237 images
Generic scenes hard to locate, but visual features must be strongly correlate with geography
what features? best exploit the correlation between image & location
Tiny images: directly in color image space; reducing dimensions drastically
Color histograms: CIE L*a*b color space; 4, 14, 14 bins respectively, totaly 784 dimensions; χ 2 \chi^2 χ 2 distance
Texton histograms: texture features, 512 dimensions, χ 2 \chi^2 χ 2 distance
Line features: Canny edges, two histograms with line angles & length
Gist Descriptor + Color
Geometric Context: compute geometric class probability of ground, sky & vertical
precomputed all features, per image 15 seconds, all 3 days with a cluster of 400 processors
brute-force matching to find nearest image
k-NN, forms a probabilityu map over the globe
mean-shift clustering, 3D points, disregard cluster less than 4 matches
geographic information, population density, land cover estimates
visual place recognition when the scene undergoes appearance change, e.g. illumination, seasons, aging, structural modifications
when query & gallery same viewpoint, matching across large changes is easier
develop new approach with efficient synthesis of novel views & compact indexable image representation
new Tokyo 24/7 dataset
Recent progress
obtain accurate camera position within a city by a dataset with 1M images or a reconstructed 3D point cloud
representation: SIFT
efficiency: inverted file or product quantization
Observation: matching across large changes in scene appearance is easier when both the query & gallery image depict from the same viewpoint
Idea: synthesizing virtual views
how to efficiently synthesize entire city virtual viewpoints;
how to deal with the increased database size;
how to represent the synthetic views which is robust to large changes in appearance
develop view synthesis method from Google street-view panoramas
use compact VLAD encoding, efficient compression, storage & indexing
represent using SIFT, which is robust to large changes in appearance
Related work
Place recognition with local-invariant features
Virtual views for instance-level matching
Modelling scene illumination for place recognition
Best match when using densely sampled descriptors instead of DoG feature detector, and improved when match to a virtual view
Tentative matches by finding mutually nearest descriptors
Geometrically verify by repeatedly finding several homographies using RANSAC
Use panoramic imagery together with an approximate piece-wise planar depth map corresponded
Repeated structures may degrade retrieval performance because of over-counting in BoW
But with appropriate representation, they could form a distinguishing feature
Robust detection of repeated structures, then modify weights in BoW
V V V visual words, image is represented as v d = ( t 1 , t 2 , t V ) T v_d = (t_1, t_2, t_V)^T v d = ( t 1 , t 2 , t V ) T
t i = n i d n d log N N i t_i = \frac{n_{id}}{n_d}\log \frac{N}{N_i} t i = n d n i d log N i N
n i d n_{id} n i d num of word i i i in image d d d , n d n_d n d num of all words in image d d d , N i N_i N i num of images containing word i i i , N N N num of all images
Soft-assignment weighting
soft-assign each descriptor to several closest cluster centers with weights exp ( − d 2 2 σ 2 ) \exp(-\frac{d^2}{2\sigma^2}) exp ( − 2 σ 2 d 2 )
Burstiness weighting
a visual-word is much more likely to appear in an image
downweight by factor 1 n i d \frac{1}{\sqrt{n_{id}}} n i d 1
Proposed method
explicitly detect localized image areas with repetitive structures
use the detected retetitions to adaptively adjust weights
Operate on the extracted local features
feature segmentation, finding connected components in a graph
G = ( V , E ) G=(V,E) G = ( V , E ) , V = { ( x i , s i , d i } i = 1 N V = \{(x_i, s_i, d_i\}_{i=1}^N V = { ( x i , s i , d i } i = 1 N
V V V is local features at locations x i x_i x i , scales s i s_i s i , SIFT d i d_i d i
two vertices connected when position close & similar scale & appearance
Group vertices into disjoint groups
Classification by subdividing the surface of the earth into thousands of multiscale geographic cells
Train a CNN using geotagged images, inference outputs a discrete probability distribution over the earch
Train a LSTM to exploit temporal conherence in albums
croisant & Eiffel Tower in the same album, can be located to Paris
Ralated work
Matching aerial images
use CNN to learn joint embedding for ground & aerial images, localize by matching against aerial images
use CNN to transform ground-level features to the features space of aerial images
Image retrieval
more accurate at matching buildings, but not good at natural scenes or articulated objects
local feature based approaches focus on localization with cities, based on websites images or street view images
Skyline2GPS segments skyline out and matches it against 3D models
Exact 6DoF pose
Pose estimation: 3D models reconstructed using SfM
localized by making correspondences between query & points in 3D model, PnP problem to obtain parameters
Matching is expensive, using efficient image retrieval
Landmark recognition
Scene recognition
CNN approach outperforms others
exploit potential of temporal coherence to geolocate images
HMM on photo albums to learn tourist routes
structrued SVM with additional feature
Images2GPS, HMM, like this paper
population, elevation, household income
Classification better than regression, as it can express uncertainty by assigning confidence
Adaptive Partition using S2 Cells
Recursively subdivide cells until no cell contains more than a certain fixed number t 1 t_1 t 1 of photos
Discard all cells less than t 2 t_2 t 2 photos, and remove images from these cells in training set
Classes are balanced, effective use of parameter space, street-level accuracy in city areas where cells are small
CNN training
Inception architecture with BN, label one-hot, random initialize, AdaGrad with lr 0.045
126M photos all over the web, very noisy, 91M training & 34M validation
2.3M geotagged Flickr photos to testt
Asigning photos in an album locations is seq2seq, LSTM good fit
Collect 29.7M albums, training 23.5M, testing 6.2M
final layer before softmax as embedding, fed into LSTM unit
keep Inception fixed, train LSTM & SoftMax Layer
estimate the geographic location of query by kernel density estimation
Error threshold
street 1km
city 25km
region 200km
country 750km
continent 2500km
Instance retrieval works well if
images in gallery field of view overlaps query
content of query is well suited for local feature matching (distinctive geological feature)
More like scene classification
higher-level understanding of image
This paper
retrieval approach: visual world is too complex for deep model to memorize, retrieval approach trivially does so
deep feature learning
classification method is better than Siamese
for classification, different discretization influence
Befor work
limited spatial scale, city
special class of images, landmarks, street-views
aerial imagery
combinatorial partitioning, generates fine-grained output classes by intersecting multiple coarse-grained partitionings
Contextual Aware mask
regions of interest
features from time-varying objects introduce misleading cues into geo-localization
not only local features, but also context in the scene
This paper
E2E CNN, image representations adaptively reweight features based on iamge context
Contextual Reweighting Network: takes feature maps, estimates weight for features based on its surrounding region
retrieval task, triplet ranking loss
geometric verification to generate positive images
hard negative mining
context-adaptive feature preponderance
original VLAD v k = ∑ l ∈ R a l k ( d l − c k ) v_k = \sum_{l\in R} a_l^k(d_l-c_k) v k = ∑ l ∈ R a l k ( d l − c k )
f = [ f 1 , f 2 , ⋯ , f K ] f=[f_1, f_2, \cdots, f_K] f = [ f 1 , f 2 , ⋯ , f K ]
f k = ∑ l ∈ R m l ⋅ a l k ( d l − c k ) f_k = \sum_{l\in R} m_l\cdot a_l^k(d_l-c_k) f k = ∑ l ∈ R m l ⋅ a l k ( d l − c k )
Cross-view image-based ground-to-aerial geo-localization
Siamese architecture, metric learning
fc layers, extract local features
NetVLAD encoding to global descriptors
Ground view do not cover area
bird's eye view densely covers earth
challenging because of change of viewpoint, SIFT/SURF fail
old work add branch to estimate orientation & use multiple orientations to find best angle for matching
This work
NetVLAD, invariant against large viewpoint change
NetVLAD aggregates local features to form global represetations independent of locations
Siamese network
weighted soft margin ranking loss
speeds up training convergence & improves retrieval accuracy
embedded in triplet & quadruplet losses
Related work
hand-crafted features in cross-view matching
warped ground image to top-down view
aerial image oblique, facade matching
Learnable feature
Image retrieval loss
max-margin triplet loss
soft-margin triplet loss
triplet loss on constraint on irrelevant pairs, make inter-class variation small
quadruplet & angular loss
Local feature extraction
fc net, f L f^L f L
satellite image, U s = f L ( I s ; Θ s L ) U_s=f^L(I_s;\Theta_s^L) U s = f L ( I s ; Θ s L )
ground image, U g = f L ( I g ; Θ g L ) U_g=f^L(I_g;\Theta_g^L) U g = f L ( I g ; Θ g L )
Global Descriptor Generation
CVM-Net-I: two independent NetVLADs
v i = f G ( U i ; Θ i G ) , i ∈ { s , g } v_i=f^G(U_i; \Theta_i^G),i\in\{s,g \} v i = f G ( U i ; Θ i G ) , i ∈ { s , g }
For NetVLAD parameter, centroids C i = { c i , 1 , ⋯ , c i , K } C_i=\{c_{i,1},\cdots, c_{i,K}\} C i = { c i , 1 , ⋯ , c i , K } num is same
v s , v g v_s,v_g v s , v g in same space, used for direct similarity comparison
local features U = { u 1 , ⋯ , u N } U=\{u_1,\cdots,u_N\} U = { u 1 , ⋯ , u N } , k t h k^{th} k t h VLAD is V ( k ) = ∑ j = 1 N a ˉ k ( u j ) ( u j − c k ) V(k)=\sum_{j=1}^N \bar{a}_k(u_j)(u_j-c_k) V ( k ) = ∑ j = 1 N a ˉ k ( u j ) ( u j − c k )
residual space
CVM-Net-II: NetVLADs with shared weights
add two fc layers between f L f^L f L and NetVLAD
first layer Θ s T 1 , Θ g T 2 \Theta_s^{T_1},\Theta_g^{T_2} Θ s T 1 , Θ g T 2
second layer Θ T 2 \Theta^{T_2} Θ T 2
NetVLAD share weights
weight sharing improve metric learning
max-margin loss L m a x = max ( 0 , m + d p o s − d n e g ) L_{max} = \max(0,m+d_{pos}-d_{neg}) L m a x = max ( 0 , m + d p o s − d n e g ) , margin m m m must carefully selected
soft-margin loss L s o f t = ln ( 1 + e d ) , d = d p o s − d n e g L_{soft}=\ln(1+e^d),d=d_{pos}-d_{neg} L s o f t = ln ( 1 + e d ) , d = d p o s − d n e g , slow convergence
weighted loss L w e i g h t e d = ln ( 1 + e α d ) L_{weighted}=\ln(1+e^{\alpha d}) L w e i g h t e d = ln ( 1 + e α d ) , increase α \alpha α make rate of convergence improve
quadruplet loss L q u a d = max ( 0 , m 1 + d p o s − d n e g ) + max ( 0 , m 2 + d p o s − d n e g ∗ ) L_{quad} = \max(0, m_1+d_{pos}-d_{neg}) + \max(0,m_2+d_{pos}-d_{neg}^*) L q u a d = max ( 0 , m 1 + d p o s − d n e g ) + max ( 0 , m 2 + d p o s − d n e g ∗ ) , two negative exmaples
weighted soft-margin loss L q , w = ln ( 1 + e α ( d p o s − d n e g ) ) + ln ( 1 + e α ( d p o s − d n e g ) ) L_{q,w}=\ln(1+e^{\alpha(d_{pos}-d_{neg})}) + \ln(1+e^{\alpha (d_{pos}-d_{neg})}) L q , w = ln ( 1 + e α ( d p o s − d n e g ) ) + ln ( 1 + e α ( d p o s − d n e g ) )
use orientation information
azimuth & altitude θ , ϕ \theta, \phi θ , ϕ
add two channel U-V
ground-level panorama θ , ϕ \theta, \phi θ , ϕ
aerial image θ , r \theta, r θ , r
Two schemes
U-V map injected to the input layer only
U-V map are injected to all layers
Feature embedding
features X \mathcal{X} X , dim WxHxD
f = [ f 1 , ⋯ , f D ] T , f k = ( 1 W H ∑ w = 1 W ∑ h = 1 H x w , h , k p ) f = [f^1,\cdots, f^D]^T,f^k = (\frac{1}{WH}\sum_{w=1}^W\sum_{h=1}^H x_{w,h,k}^p) f = [ f 1 , ⋯ , f D ] T , f k = ( W H 1 ∑ w = 1 W ∑ h = 1 H x w , h , k p )
Weighted soft-margin ranking loss
Hard exemplar mining
auto allocates weights to triplets according to difficulty levels
This paper
new triplet loss to improve quality of network training; online hard exemplar mining, end-to-end
lightweight attention module FCAM
Siamese network get sota
attention both on channel & spatial dimensions
Channel attention: CBAM; spatial attention: context-aware feature reweighting
feature map U ∈ R W × H × C U\in\mathbb{R}^{W\times H\times C} U ∈ R W × H × C , attention descriptor Z C ( U ) ∈ R 1 × 1 × C Z^C(U)\in\mathbb{R}^{1\times 1\times C} Z C ( U ) ∈ R 1 × 1 × C , spatial attention mask Z S ( U ′ ) ∈ R W × H × 1 Z^S(U')\in\mathbb{R}^{W\times H\times 1} Z S ( U ′ ) ∈ R W × H × 1
U ′ = Z C ( U ) ⊗ U , U ′ ′ = Z S ( U ′ ) ⊗ U ′ U'=Z^C(U)\otimes U, U''=Z^S(U')\otimes U' U ′ = Z C ( U ) ⊗ U , U ′ ′ = Z S ( U ′ ) ⊗ U ′
Channel attention
Z C ( U ) = δ ( f e x t ( f m a x ( U ) ) + f e x t ( f a v g ( U ) ) ) = δ ( W 2 e σ ( W 1 e v 1 ) ) + δ ( W 2 e σ ( W 1 e v 2 ) ) Z^C(U) = \delta(f_{ext}(f_{max}(U)) + f_{ext}(f_{avg}(U))) = \delta(W_2^e\sigma(W_1^ev^1)) + \delta(W_2^e\sigma(W_1^ev^2)) Z C ( U ) = δ ( f e x t ( f m a x ( U ) ) + f e x t ( f a v g ( U ) ) ) = δ ( W 2 e σ ( W 1 e v 1 ) ) + δ ( W 2 e σ ( W 1 e v 2 ) )
Spatial attention
Z S ( U ′ ) = δ ( f 1 × 1 ( f 3 × 3 ( s ) ; f 5 × 5 ( s ) ; f 7 × 7 ( s ) ) ) Z^S(U') = \delta(f^{1\times 1}(f^{3\times 3}(s);f^{5\times 5}(s);f^{7\times 7}(s))) Z S ( U ′ ) = δ ( f 1 × 1 ( f 3 × 3 ( s ) ; f 5 × 5 ( s ) ; f 7 × 7 ( s ) ) )
Hard exemplar reweighting triplet loss
Anchor A i A_i A i , positive P i P_i P i , k k k -th negative N i , k N_{i,k} N i , k
Original: L t r i ( A i , P i , N i , k ) = max ( 0 , m + d P ( i ) − d n ( i , k ) ) L_{tri}(A_i,P_i,N_{i,k}) = \max(0, m+d_P(i)-d_n(i,k)) L t r i ( A i , P i , N i , k ) = max ( 0 , m + d P ( i ) − d n ( i , k ) )
Soft-margin: L s o f t ( A i , P i , N i , k ) = log ( 1 + exp ( d P ( i ) − d n ( i , k ) ) ) L_{soft}(A_i,P_i,N_{i,k}) = \log(1 + \exp(d_P(i)-d_n(i,k))) L s o f t ( A i , P i , N i , k ) = log ( 1 + exp ( d P ( i ) − d n ( i , k ) ) )
Weight allocated according to its difficulty level: L h a r d = w h a r d ( A i , P i , N i , k ) ∗ log ( 1 + exp ( d P ( i ) − d n ( i , k ) ) ) L_{hard} = w_{hard}(A_i,P_i, N_{i,k}) * \log(1 + \exp(d_P(i)-d_n(i,k))) L h a r d = w h a r d ( A i , P i , N i , k ) ∗ log ( 1 + exp ( d P ( i ) − d n ( i , k ) ) )
Difficulty level
Most difficult: negative with smallest distance
g a p ( i , k ) = d n ( i , k ) − d p ( i ) gap(i,k) = d_n(i,k)-d_p(i) g a p ( i , k ) = d n ( i , k ) − d p ( i ) ; extremely hard: C h : g a p ( i , k ) ≤ 0 C_h:gap(i,k)\leq 0 C h : g a p ( i , k ) ≤ 0
less informative: C s : g a p ( i , k ) ≥ m C_s:gap(i,k)\geq m C s : g a p ( i , k ) ≥ m
w h a r d ( A i , P i , N i , k ) = { ϵ / B , g a p ( i , k ) ≥ m log 2 ( 1 + exp ( m / 2 ) ) , g a p ( i , k ) ≤ 0 log 2 ( 1 + exp ( − g a p ( i , k ) + m / 2 ) ) w_{hard}(A_i,P_i,N_{i,k}) = \begin{cases}\epsilon/B,\;&gap(i,k)\geq m\\\log_2(1+\exp(m/2)),\;&gap(i,k)\leq 0\\\log_2(1+\exp(-gap(i,k) + m/2))\end{cases} w h a r d ( A i , P i , N i , k ) = ⎩ ⎪ ⎪ ⎨ ⎪ ⎪ ⎧ ϵ / B , log 2 ( 1 + exp ( m / 2 ) ) , log 2 ( 1 + exp ( − g a p ( i , k ) + m / 2 ) ) g a p ( i , k ) ≥ m g a p ( i , k ) ≤ 0
B B B is num of anchors in mini-batch, m = γ 2 B ∑ i = 1 B ( ∣ f ( A i ) ∣ 2 + ∣ f ( P i ) ∣ 2 m=\frac{\gamma}{2B}\sum_{i=1}^B (|f(A_i)|^2 +|f(P_i)|^2 m = 2 B γ ∑ i = 1 B ( ∣ f ( A i ) ∣ 2 + ∣ f ( P i ) ∣ 2
Orientation regression
angles generated by random rotation as label
L O R ( A i , P i , N i , k ) = w h a r d ( A i , P i , N i , k ) ∗ ( d R 1 ( i ) + d R 2 ( i ) ) L_{OR}(A_i,P_i,N_{i,k}) = w_{hard}(A_i,P_i,N_{i,k})*(d_R^1(i)+d_R^2(i)) L O R ( A i , P i , N i , k ) = w h a r d ( A i , P i , N i , k ) ∗ ( d R 1 ( i ) + d R 2 ( i ) )
L H E R ( A i , P i , N i , k = λ 1 ∗ L h a r d ( A i , P i , N i , k ) + λ 2 ∗ L O R ( A i , P i , N i , k ) L_{HER}(A_i,P_i,N_{i,k} = \lambda_1*L_{hard}(A_i,P_i,N_{i,k}) + \lambda_2*L_{OR}(A_i,P_i,N_{i,k}) L H E R ( A i , P i , N i , k = λ 1 ∗ L h a r d ( A i , P i , N i , k ) + λ 2 ∗ L O R ( A i , P i , N i , k )
Observation: pixels lying on the same azimuth direction in an aerial image approximately correspond to a vertical image column in the ground view image
This paper
regular polar transform to warp an aerial image closer to ground image
Spatial attention mechanism to correspond deep features cluser in embedding space
feature aggregation via learning multiple spatial embeddings
aligning two domains based on geometric correspondences will reduce the burden of the learning process for domain alignment
Polar transform on aerial images
some objects may have distortions
spatial attention based feature embedding to extract position-aware features
retain iamge content & encodes layout information
Polar transform
Dynamic Similarity Matching
estimate cross-view orientation alignment
Automatic geo-localization of images & videos: challenging
Existing solutions limited to highly-visited regions, but scale baddly to large & ordinary regions
Overview of major research themes in visual geo-localization
challeges & areas that will benefit from these research themes
availability of web-scale geo-referenced data affects VGL
semantic information
textured RGB & untextured non-RGB 3D models
Realworld applications, emerging trends
Geo-localization: discovering the location where an image or video was captured
Consumers: when, where, who, what, how
Local govn: geographic & geological features in region of interest
Local businesses: marketing by extracting where, what and when
Law enforcement: find the location of incident
Visual content & location: Relationship
Still difficult
identifying, extracting & indexing geo-informative features
discover geo-location cues from imagery
searching in massive databases e.g. GIS
Need tech advancements
data-driven geo-info features, geometric modeling, ...
viewpoints and techniques from diverse areas
general principles to geo-localize & what features are geo-spatially informative
math models of visual analysis, proper search techniques for matching
enhance vision tasks e.g. object recognition, alignment, ...
近年来,ground-level images & videos
ground-level 图像往往不是水平的,且全图只有一个摄像机的GPS标签
ground-level 方法仍待探索
GIS、Wikipedia、Wikimapia、hashtags with GPS-tagged images
副产品:6DoF 相机参数估计(相机旋转与位置)、图像内容位置的估计
cross-view image geo-localization, semantic feture learning
cross-view/cross-modality matching
end-to-end geo-spatial feature learning
temporal geo-localization using RNN
Semantic Reasoning
Geometric Matching
Mid-level representation
weakly supervised, mining connections between recurring mid-level visual elements in temporal & spatial image collections
underlying visual style, not visually consistent throughout dataset, but changes due to change in time or location, but consistent variations across label space
First indetify patches that are style-sensitive; then build correspondences to find the same element across dataset; finally train style-aware regressors to model element's changes
low-level 无监督聚类
计算时间上的熵,E ( c ) = − ∑ i = 1 n H ( i ) log H ( i ) E(c) = -\sum_{i=1}^n H(i)\log H(i) E ( c ) = − ∑ i = 1 n H ( i ) log H ( i ) ,熵越高越表示风格敏感