Student: | K. Li |
---|---|
Timeline: | September 2021 - 1 September 2025 |
In the past ten years, there are several downstream tasks proposed to connect Computer Vision (CV) and Natural Language Processing (NLP), to push the boundaries of both fields. Visual question answering (VQA) is one of them as it is to make computers to understand both visual and textual information. Given an image and a question in natural language, it requires reasoning over visual elements of the image and the textual concepts of the question to predict the correct answer. The questions contain various aspects that humans are interested in such as: presence, counting, location, and distribution? Upon the construction of the answering system, computers could read and understand the world automatically and autonomously.
Visual question answering is not only a multi-model task that needs information from two branches, but also a complicated and high-level intelligent task requiring multi-aspect understanding, which is different from the classical sub-stream computer vision tasks, i.e., object detection. It contains numerous aims of single sub-stream tasks such as detection, classification, segmentation and retrieval. However, visual question answering is a significantly more complex problem than those tasks as it frequently requires information not present in the image. The type of this extra required information could range from common sense to specific knowledge about an important element from texts or images. On the other hand, in some cases even the unseen objects could be asked in visual question answering, which makes it even more difficult to handle.
Visual question answering is also an important task in the field of earth observation science, which could be used as the foundation of higher level up-stream tasks. For instance, hazard detection is a potential direction for visual question answering. Upon collection of the satellite or aerial images, the system could make quick actions for the flooded regions to save lives. Visual question answering could also be used as a tool for urban life survey. Given images in the city, it is quite convenient to know the numbers and locations of the infrastructure and provide the possible designing plans.
My research will focus on visual question answering for very high-resolution aerial images. This task is built on the intersection of computer vision and remote sensing, which tries to make use of the high-level interpretation and learning ability of computers to answer the questions that humans are interested in. Therefore, it could serve as an intelligent tool to provide humans with insightful information, knowledge and guidance to make better decisions. This Ph.D. project is under the supervision of Prof. George Vosselman and Prof. Michael Ying Yang. We have proposed a method called HRVQA [1] for the generic visual question answering task for high-resolution aerial image, which takes the positional restrictions into account for detailed attribute-related questions. Further, we proposed an interactive segmentation method [2] to acquire the semantics for the target objects for further reasoning with explanations for VQA [3]. We are working on the visual dialog answering for considering context information with point indication guidance. We believe our work could bridge the gap between users and the machine for more intelligent answering systems.
Figure 1: How many options that you can select if you want to travel through this city?
References
[1] Li, K., Vosselman, G. and Yang, M.Y., 2023. HRVQA: A Visual Question Answering Benchmark for High-Resolution Aerial Images. arXiv preprint arXiv:2301.09460.
[2] Li, K., Vosselman, G. and Yang, M.Y., 2023. Interactive Image Segmentation with Cross-Modality Vision Transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 762-772).
[3] Li, K., Vosselman, G. and Yang, M.Y., 2024. Convincing Rationales for Visual Question Answering Reasoning. arXiv preprint arXiv:2402.03896.