Assessing Urban Environments with Vision-Language Models: A Comparative Analysis of AI-Generated Ratings and Human Volunteer Evaluations

Our proposed UrbanVLM multimodal model comprises four main components: (i) image description generation using a vision-language model with freeze layers except for the visual projections such as LlaVA and BLIP2, (ii) dual-modality encoding using vision-text encoding models such as CLIP or SigLIP, (iii) contrastive learning to align image-text descriptions, and (iv) classification and regression heads to infer the urban perception labels and scores.
Publication Details
- Venue
- IEEE International Joint Conference on Neural Networks
- Year
- 2025
- Publication Date
- July 5, 2025
- DOI
- https://ieeexplore.ieee.org/
Materials
Abstract
This research investigates the application of vision-language models to automatically assess and rate street view images based on the Place Pulse 2.0 dataset, with a focus on comparing AI-generated ratings with human evaluations. The study introduces a context-sensitive rating system that assigns a 0-10 scale to six key urban perception categories: safety, liveliness, wealth, beauty, boredom, and depression. By comparing these AI-generated ratings with those of human volunteers, the research explores how effectively vision-language models can replicate human judgment in assessing urban environments. The findings provide valuable insights into the potential of vision-language models to scale urban perception analysis, offering an objective methodology that complements and enhances human evaluation. This approach not only contributes to urban planning by enabling more efficient, data-driven decision-making but also enriches the Place Pulse 2.0 dataset by integrating machine-generated ratings, paving the way for future advancements in urban perception studies.
Cite this publication (BIBTEX)
@article{2025-UrbanVLM, title={Assessing Urban Environments with Vision-Language Models: A Comparative Analysis of AI-Generated Ratings and Human Volunteer Evaluations}, author={Felipe A. Moreno and Jorge Poco}, journal={IEEE International Joint Conference on Neural Networks}, year={2025}, url={https://ieeexplore.ieee.org/}, date={2025-07-05} }