Home / Papers / L-LLM: Large Language LEGO Models

L-LLM: Large Language LEGO Models

88 Citations2023
Stanford CS224N CustomProject, Alex Wang, Calvin Laughlin
journal unavailable

No TL;DR found

Abstract

Our project introduces a multifaceted approach to generating novel LEGO instruction manuals in a text-based format. We leverage the vision capabilities of GPT-4o and fine-tune models such as GPT-3.5-turbo, Llama-2-7B-chat-hf, and Mistral-7B using a corpus of 90 existing text-based LEGO manuals. We detail our methodology, which includes fine-tuning these models on both existing and synthetically generated manuals from GPT-4o vision prompt engineering. Our contributions include a novel vision-to-text agent and the generation of new, small-scale LEGO instructions. Using our custom dataset comprised of instructions, most human-created for Bricks for the Blind and some translated from PDFs, we finetune our models to generate instructions for simple LEGO builds such as cars, castles, houses, boats, and space-ships. Additionally, we parse through visual instruction sets native to the LEGO website and translate them into the text-based format, enhancing the Bricks for the Blind dataset with synthetic data. For evaluation, we use two grading rubrics to score each generated build and instruction manual out of 100. GPT-4o evaluates the quality of instructions, while human scoring assesses the actual builds. We aim to highlight the creative potential of LLMs and their limitations in planning, creativity, and instruction. Results show that GPT-4o and fine-tuned LLama-2-7B have shown the most promise in novel instruction generation, but there is still much work do be done in planning and data gathering.