Assessing Urban Environments with Vision-Language Models: A Comparative Analysis of AI-Generated Ratings and Human Volunteer Evaluations

Felipe A. Moreno, Jorge Poco
IEEE International Joint Conference on Neural Networks · 2025 · July 5, 2025
Assessing Urban Environments with Vision-Language Models: A Comparative Analysis of AI-Generated Ratings and Human Volunteer Evaluations

Our proposed UrbanVLM multimodal model comprises four main components: (i) image description generation using a vision-language model with freeze layers except for the visual projections such as LlaVA and BLIP2, (ii) dual-modality encoding using vision-text encoding models such as CLIP or SigLIP, (iii) contrastive learning to align image-text descriptions, and (iv) classification and regression heads to infer the urban perception labels and scores.

Publication Details

Venue
IEEE International Joint Conference on Neural Networks
Year
2025
Publication Date
July 5, 2025
DOI
https://ieeexplore.ieee.org/

Materials

Abstract

This research investigates the application of vision-language models to automatically assess and rate street view images based on the Place Pulse 2.0 dataset, with a focus on comparing AI-generated ratings with human evaluations. The study introduces a context-sensitive rating system that assigns a 0-10 scale to six key urban perception categories: safety, liveliness, wealth, beauty, boredom, and depression. By comparing these AI-generated ratings with those of human volunteers, the research explores how effectively vision-language models can replicate human judgment in assessing urban environments. The findings provide valuable insights into the potential of vision-language models to scale urban perception analysis, offering an objective methodology that complements and enhances human evaluation. This approach not only contributes to urban planning by enabling more efficient, data-driven decision-making but also enriches the Place Pulse 2.0 dataset by integrating machine-generated ratings, paving the way for future advancements in urban perception studies.

Cite this publication (BIBTEX)

@article{2025-UrbanVLM, 
  title={Assessing Urban Environments with Vision-Language Models: A Comparative Analysis of AI-Generated Ratings and Human Volunteer Evaluations}, 
  author={Felipe A. Moreno and Jorge Poco}, 
  journal={IEEE International Joint Conference on Neural Networks}, 
  year={2025}, 
  url={https://ieeexplore.ieee.org/},
  date={2025-07-05}
}