This work explores the feasibility of using large language understanding capabilities (LLMs) to identify rule violations on Reddit and examines how an LLM-based moderator (LLM-Mod) reasons about 744 posts across 9 subreddits that violate different types of rules.
Content moderation is critical for maintaining healthy online spaces. However, it remains a predominantly manual task. Moderators are often exhausted by low moderator-to-posts ratio. Researchers have been exploring computational tools to assist human moderators. The natural language understanding capabilities of large language models (LLMs) open up possibilities to use LLMs for online moderation. This work explores the feasibility of using LLMs to identify rule violations on Reddit. We examine how an LLM-based moderator (LLM-Mod) reasons about 744 posts across 9 subreddits that violate different types of rules. We find that while LLM-Mod has a good true-negative rate (92.3%), it has a bad true-positive rate (43.1%), performing poorly when flagging rule-violating posts. LLM-Mod is likely to flag keyword-matching-based rule violations, but cannot reason about posts with higher complexity. We discuss the considerations for integrating LLMs into content moderation workflows and designing platforms that support both AI-driven and human-in-the-loop moderation.