Exploring the Potential of LLMs for Code Deobfuscation

Beste, David; Menguy, Grégoire; Hajipour, Hossein; Fritz, Mario; Cina', Antonio Emanuele; Bardin, Sébastien; Holz, Thorsten; Eisenhofer, Thorsten; Schönherr, Lea

doi:10.1007/978-3-031-97620-9_15

Code obfuscation alters software code to conceal its logic while retaining functionality, aiding intellectual property protection but hindering security audits and malware analysis. To address this, automated deobfuscation techniques have been developed, though existing approaches remain constrained by limited scope and specificity. Motivated by these challenges, this paper explores a novel approach for code deobfuscation based on Large Language Models (LLMs). First, we investigate the general capabilities of LLMs in reducing code complexity by choosing five different source-to-source obfuscation methods. Despite challenges regarding semantical correctness, our findings indicate that LLMs can be very effective in this task. Building on this, we fine-tune two versatile models capable of simplifying code obfuscated through up to seven different chained obfuscation transformations while consistently outperforming deobfuscation based on compiler optimizations and general-purpose LLMs. Our best model demonstrates an average Halstead metric program length reduction of 89.21% for our most challenging scenario. Finally, we conduct a memorization test to assess if performance stems from memorized code rather than true deobfuscation capabilities, which our models pass.

Exploring the Potential of LLMs for Code Deobfuscation

David Beste;Grégoire Menguy;Hossein Hajipour;Mario Fritz;Antonio Emanuele Cinà;Sébastien Bardin;Thorsten Holz;Thorsten Eisenhofer;Lea Schönherr

2025-01-01

Abstract

Code obfuscation alters software code to conceal its logic while retaining functionality, aiding intellectual property protection but hindering security audits and malware analysis. To address this, automated deobfuscation techniques have been developed, though existing approaches remain constrained by limited scope and specificity. Motivated by these challenges, this paper explores a novel approach for code deobfuscation based on Large Language Models (LLMs). First, we investigate the general capabilities of LLMs in reducing code complexity by choosing five different source-to-source obfuscation methods. Despite challenges regarding semantical correctness, our findings indicate that LLMs can be very effective in this task. Building on this, we fine-tune two versatile models capable of simplifying code obfuscated through up to seven different chained obfuscation transformations while consistently outperforming deobfuscation based on compiler optimizations and general-purpose LLMs. Our best model demonstrates an average Halstead metric program length reduction of 89.21% for our most challenging scenario. Finally, we conduct a memorization test to assess if performance stems from memorized code rather than true deobfuscation capabilities, which our models pass.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2025
			
	ISBN
	
				978-3-031-97620-9
			
	Appare nelle tipologie:
	
				04.01 - Contributo in atti di convegno

File in questo prodotto:

File	Dimensione	Formato
2025_LLM_supported_Deobfuscator.pdf accesso chiuso Tipologia: Documento in Post-print Dimensione 661.15 kB Formato Adobe PDF Visualizza/Apri Richiedi una copia	661.15 kB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11567/1258096

Citazioni

ND

ND

ND

Exploring the Potential of LLMs for Code Deobfuscation

David Beste;Grégoire Menguy;Hossein Hajipour;Mario Fritz;Antonio Emanuele Cinà;Sébastien Bardin;Thorsten Holz;Thorsten Eisenhofer;Lea Schönherr

2025-01-01

Abstract

Scheda breve Scheda completa Scheda completa (DC)

Informazioni

Citazioni

social impact

Conferma cancellazione

Scheda breve

Scheda completa

Scheda completa (DC)