Home / Papers / Building a Retrieval-Augmented Generation (RAG) System for Academic Papers

Building a Retrieval-Augmented Generation (RAG) System for Academic Papers

88 Citations2023
Anna Grigoryan, Habet Madoyan
journal unavailable

The RAG system utilizes a 2-step vector search using the vector search with cosine similarity metric on an HNSW index on the paper’s abstracts and the papers itself to pass only relevant information to LLM; this enables enhanced data retrieval and contextually aware text generation.

Abstract

—This report presents the final results of our capstone project, which focuses on developing a Retrieval-Augmented Generation (RAG) system designed for navigating through the vast amount of academic papers. The Retrieval-Augmented Generation (RAG) system enhances search capabilities by integrating search strategies for retrieving data and LLM models for generating text, addressing the limitations of traditional search engines like Google, which may struggle with interpreting complex, scholarly queries and providing contextually relevant academic insights. Our proposed RAG system seeks to address these challenges by leveraging advanced techniques in document retrieval and natural language processing to offer precise, contex-tually relevant excerpts in response to user queries. The system utilizes a 2-step vector search using the vector search with cosine similarity metric on an HNSW index on the paper’s abstracts and the papers itself to pass only relevant information to LLM; this enables enhanced data retrieval and contextually aware text generation. This report shows our achievements in implementing various system components, including document retrieval, search methods, text generation, and initial performance evaluation. We experimented with a number of search strategies for knowledge retrieval, found our best-performing RAG search style, experimented with a number of LLMs, and made the final RAG system. We also discuss the encountered limitations, insights gained, and potential avenues for further improvement.