COMMITTEE CHAIR: Dr. Xishuang Dong
TITLE: LLMS-BASED TEXT-TO-SQL FOR GEOSPATIAL INFORMATION RETRIEVAL
ABSTRACT: Text-to-SQL aims to translate natural language questions into SQL queries that can be executed on databases, enabling non-expert users to retrieve information without learning formal query languages. Early Text-to-SQL systems relied on rule-based methods and semantic parsers, while recent advances in deep learning have achieved strong performance by jointly encoding user questions and database schemas. However, these approaches typically require large annotated datasets and specific model architectures. With the emergence of large language models (LLMs), such as GPT-4, Llama, and Gemma, Text-to-SQL systems can leverage powerful natural language understanding capabilities to generate SQL queries using zero-shot or few-shot prompting. Despite these advancements, existing research has largely focused on conventional relational databases, with limited attention given to geospatial databases that involve specialized spatial data types and functions. This thesis addresses this gap by investigating LLM-based Text-to-SQL for geospatial information retrieval. We construct a new benchmark dataset with a PostGIS spatial database, containing natural language questions paired with SQL queries that incorporate spatial operations such as distance calculations, spatial joins, and geometric predicates. To expand the dataset and improve diversity, additional question-query pairs are generated through LLM-based data augmentation. Furthermore, building on this benchmark, we develop a Text-to-SQL pipeline that integrates multiple state-of-the-art LLMs to translate natural language queries into executable spatial SQL statements. The system incorporates database schema information within prompts to improve query generation. Experimental results demonstrate that the proposed pipeline can effectively retrieve geospatial information using natural language queries, achieving competitive performance regarding Execution Accuracy and Valid Efficiency Score.
Keywords: Text-to-SQL, Large Language Models, Information Retrieval, Geospatial Database
Room Location: Electrical Engineering Conference Room 315D