Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks

Jain, Samyak; Kirk, Robert; Lubana, Ekdeep Singh; Dick, Robert P.; Tanaka, Hidenori; Grefenstette, Edward; Rocktäschel, Tim; Krueger, David Scott

Computer Science > Machine Learning

arXiv:2311.12786 (cs)

[Submitted on 21 Nov 2023 (v1), last revised 21 Aug 2024 (this version, v2)]

Title:Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks

Authors:Samyak Jain, Robert Kirk, Ekdeep Singh Lubana, Robert P. Dick, Hidenori Tanaka, Edward Grefenstette, Tim Rocktäschel, David Scott Krueger

View PDF HTML (experimental)

Abstract:Fine-tuning large pre-trained models has become the de facto strategy for developing both task-specific and general-purpose machine learning systems, including developing models that are safe to deploy. Despite its clear importance, there has been minimal work that explains how fine-tuning alters the underlying capabilities learned by a model during pretraining: does fine-tuning yield entirely novel capabilities or does it just modulate existing ones? We address this question empirically in synthetic, controlled settings where we can use mechanistic interpretability tools (e.g., network pruning and probing) to understand how the model's underlying capabilities are changing. We perform an extensive analysis of the effects of fine-tuning in these settings, and show that: (i) fine-tuning rarely alters the underlying model capabilities; (ii) a minimal transformation, which we call a 'wrapper', is typically learned on top of the underlying model capabilities, creating the illusion that they have been modified; and (iii) further fine-tuning on a task where such hidden capabilities are relevant leads to sample-efficient 'revival' of the capability, i.e., the model begins reusing these capability after only a few gradient steps. This indicates that practitioners can unintentionally remove a model's safety wrapper merely by fine-tuning it on a, e.g., superficially unrelated, downstream task. We additionally perform analysis on language models trained on the TinyStories dataset to support our claims in a more realistic setup.

Subjects:	Machine Learning (cs.LG)
Cite as:	arXiv:2311.12786 [cs.LG]
	(or arXiv:2311.12786v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2311.12786

Submission history

From: Samyak Jain [view email]
[v1] Tue, 21 Nov 2023 18:51:04 UTC (10,736 KB)
[v2] Wed, 21 Aug 2024 16:37:20 UTC (17,373 KB)

Computer Science > Machine Learning

Title:Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators