COMMITTEE CHAIR: Dr. Xishuang Dong

TITLE: LARGE LANGUAGE MODELS FOR INFORMATION EXTRACTION

ABSTRACT: Over the past few years, Large Language Models (LLMs) have emerged as powerful tools for natural language processing (NLP) tasks. These LLMs are based on transformer architectures and are pre-trained through self-supervised learning on massive text corpora to capture the intricate features and patterns of human language. Following pre-training, LLMs can be fine-tuned on task-specific labeled datasets to perform particular downstream tasks. Recently, it has been observed that LLMs significantly contribute to information extraction (IE) by leveraging their ability to understand, generate, and contextualize natural language. This dissertation proposal focuses on two IE tasks: named entity recognition (NER) and text to SQL (Text2SQL), aimed at extracting useful information from unstructured and structured data, respectively, for downstream applications. For the NER task, the research of this dissertation proposal explores IE from electronic health records (EHRs). Identifying key factors such as medications, diseases, and their relationships within EHRs and clinical notes has a wide range of clinical applications. The N2C2 2022 competitions presented various tasks to promote the extraction of these key factors using the Contextualized Medication Event Dataset (CMED), where pretrained LLMs demonstrated exceptional performance. This proposal investigates the use of LLMs, specifically ChatGPT, for data augmentation to address the limited availability of annotated EHR data through LLMs-based NER. In addition, various pre-trained BERT models, originally trained on large datasets such as Wikipedia and MIMIC, are fine-tuned on the augmented datasets to develop models capable of accurately identifying key variables in EHRs. The research of this dissertation proposal investigates Text2SQL, a task that enables non-expert users to extract information from relational databases using natural language queries. LLMs, such as GPT and T5, have shown impressive performance on benchmarks like BIRD, though many rely on auxiliary tools or external knowledge to achieve high performance. First, it proposes a novel approach that leverages a SQL quality evaluation mechanism, providing feedback on syntactic correctness and semantic accuracy to iteratively refine generated SQL queries. It achieves competitive performance in both Execution Accuracy and Valid Efficiency Score (VES) compared to state-of-the-art models, demonstrating the potential of LLMs for cost-effective and high-quality Text-to-SQL generation. Furthermore, it proposes a novel LLMs-based Text2SQL method without external knowledge by integrating non-parametric attention, which dynamically weights database schema elements, and a prompt refinement loop guided by SQL confidence scores. Experimental results indicate that this approach improves EX by 6.5\% over a GPT-4o baseline, with ablation studies confirming the importance of both components. Overall, the research in this dissertation proposal highlights the potential of LLMs to enhance both unstructured and structured information extraction efficiently and effectively

Room Location: EE Conference Room 315D