Anthropic’s Groundbreaking Research on Interpretable Features

from blog Jamie Lord, | ↗ original
In a groundbreaking new paper, researchers at Anthropic have made significant strides in understanding the inner workings of large language models like Claude 3 Sonnet. By applying a technique called sparse dictionary learning, they were able to extract millions of interpretable “features” that shed light on how these AI systems represent...