Home ITCPhD Defence Wen Zhou | Deep learning methods for multiple building use and urban livability evaluation from multimodal geospatial data

PhD Defence Wen Zhou | Deep learning methods for multiple building use and urban livability evaluation from multimodal geospatial data

Deep learning methods for multiple building use and urban livability evaluation from multimodal geospatial data

The PhD defence of Wen Zhou will take place in the Waaier Building of the University of Twente and can be followed by a live stream.
Live Stream

Wen Zhou is a PhD student in the Department of Earth Observation Science. Promotors are prof.dr.ir. A. Stein and prof.dr.ir. C. Persello from the Faculty ITC.

In the era of big data, an abundance of geospatial data are now widely available. Remote sensing (RS) images, digital surface models (DSM), night light remote sensing (NLRS) images, street view images (SVI), and point of interest (POI) data provide valuable insights into the spatial and social characteristics of cities, making these data important for urban research. The advancement of deep learning methods offers a powerful means to extract and analyze information from these diverse data sources. This thesis explores the application of deep learning algorithms in urban research by leveraging multiple geospatial datasets.

Building use information as a subset of land use mapping is necessary for urban planning, city digital twins, and informed policy formulation. In chapter 2, unlike mainstream research on land use classification, my research has taken mixed-use scenarios into account. To enhance classification accuracy, the proposed strategy focuses on more effectively extracting and leveraging the features contained in various data modalities. It proposes a multimodal transformer-based deep learning method for building use classification. While mainstream research typically employs decision fusion strategies, the proposed strategy adopts feature fusion strategy to integrate multiple modalities. Specifically, a pretrained DenseNet is used as the backbone for extracting features from images and Bidirectional Encoder Representations from Transformers (BERT) for extracting features from the text. An attention mechanism is employed during the classification phase to assign appropriate weights to different features. The proposed multimodal transformer-based feature fusion network is tested across four Chinese cities. The results demonstrate that it effectively predicts both broad and mixed building use, significantly improving classification accuracy. This research highlights the potential of feature fusion strategies for integrating RS images and POI data in urban building use classification.

Existing building use classification methods often focus primarily on broad categories, leaving a significant gap in the classification of buildings’ detailed uses. To address this gap and test the performance of the feature fusion-based method in different regions of the world, chapter 3 is expanded to include DSM and SVI in addition to the RS images and POI data used in the previous study. I employed a multi-label classification strategy dealing with the large number of labels caused by such combinations. An ablation study investigated the synergy between different modalities and examined the attention given to each modality. A novel multi-label multimodal transformer-based feature fusion network was used to effectively extract and integrate features from various modalities, enabling the simultaneous prediction of hierarchical building uses, containing both detailed and its corresponding broad use categories. The model effectively learns the relationships between broad and detailed use categories, including hierarchical consistency, supplementation, and exclusivity. The proposed method’s performance was evaluated in three Dutch cities. For test dataset, it achieved a weighted average  scores (WAF) of 91% for broad categories, 77% for detailed categories, and 84% for all hierarchical categories, and macro average  scores (MAF) of 81%, 48%, and 56%, respectively. This research thus demonstrates that RS data serve as the cornerstone for hierarchical building use classification, while DSM and POI data provide valuable supplementary information. SVI data, however, may introduce noise.

Understanding how building characteristics influence urban livability is important for architects and urban planners in urban designing that promote functionality, sustainability, and community well-being. This includes creating spaces that optimize natural light, energy efficiency, and accessibility, while also considering the social and environmental impact on the surrounding urban fabric. In chapter 4, to address this question, a random forest regression was employed to model urban livability based on buildings’ spatial attributes (e.g., area, perimeter) and functional attributes (e.g., use). The experimental results indicate that urban livability can indeed be predicted by analyzing a building’s spatial and functional characteristics. Specifically, higher-density categories such as stacked residential, industrial, and business areas positively contribute to livability, whereas single-family residences, detached residential areas, row residential zones, and the presence of certain public services negatively impact livability. These findings highlight the significant role building characteristics play in shaping urban livability, offering valuable insights for urban planning and policy-making in alignment with Sustainable Development Goal 11.

Traditional methods for evaluating urban livability rely on surveys and statistical data, which are often time-consuming, costly, and updated irregularly. While chapters 4 demonstrated that building characteristics can partially assess urban livability, the information these attributes provide is limited. Additionally, errors in building use classification can reduce the accuracy of livability regression. To enhance the accuracy of urban livability evaluations, chapter 5 explored the use of multiple data sources and deep learning methods. A Transformer-based multi-task multimodal regression (TMTMR) model was proposed to estimate livability scores for five associated domains and their overall score using RS images, NLRS images, DSM, and POI data. 13 Dutch research areas were involved, and experimental results indicate that geospatial data can effectively predict urban livability conditions with this method, outperforming models based solely on building characteristics. Among the four modalities, their contributions to livability assessment are ranked as follows: RS images, NLRS images, DSM, and POI data.

In summary, this thesis investigates the effectiveness of deep-learning methods using multiple geospatial data to analyze the complex urban spatial structure, with a focus on building use classification and urban livability evaluation. By employing multimodal deep learning methods, this research demonstrates how information from diverse data modalities can be effectively extracted and integrated.