loading page

Multi-feature based Function Embedding Network for Binary Code Similarity
  • +1
  • XIANGYU LI,
  • GUOHAO WU,
  • ZIHUI GUO,
  • HONGLIANG LIANG
XIANGYU LI
Beijing University of Posts and Telecommunications
Author Profile
GUOHAO WU
Beijing University of Posts and Telecommunications
Author Profile
ZIHUI GUO
Beijing University of Posts and Telecommunications
Author Profile
HONGLIANG LIANG
Beijing University of Posts and Telecommunications

Corresponding Author:[email protected]

Author Profile

Abstract

Binary similarity detection determines whether two given binary code snippets are similar or not, usually on function granularity. This task is challenging due to different compilation optimizations and CPU architectures. Recently, deep-learning methods have made great achievements in this field, although most of them use artificially selected features or ignore some important semantic information like code literals or function signatures during feature processing. In addition, random samples and pair loss function are used in similarity training, which only covers limited similarity relations between functions. In this paper, a new framework MFEN-Sim is proposed to detect similar binary functions. The framework contains three stages: feature extraction and normalization, mutli-feature based function feature embedding network (MFEN) and similarity learning network. Multiple features including assembly instructions, CFG structures and function code literals are extracted from binary functions. Then these features are fed into MFEN composed of three modules: function semantic and structure embedding module, function signature prediction module, and function code literal embedding module. The three modules generate embeddings representing the function semantic and structural features, the function signature prediction features and the function code literal features. Finally, MFEN-Sim utilizes a similarity training network based on contrastive learning to make MFEN recognize more similarity relations between functions. MFEN-Sim is evaluated on 281,601 functions in 144 binaries and 17 CVEs in real-world software. Experimental results show that our work outperforms state-of-the-art systems ( i.e., Gemini, FIT and SAFE) by 7.1%, 9.9% and 8.2% on AUC metric in cross-architecture, optimization-level similarity detection, and achieves higher recall than baselines in searching vulnerabilities in real-world applications.
24 Apr 2023Submitted to Journal of Software: Evolution and Process
24 Apr 2023Assigned to Editor
24 Apr 2023Submission Checks Completed
15 May 2023Reviewer(s) Assigned
10 Jul 2023Review(s) Completed, Editorial Evaluation Pending
13 Jul 2023Editorial Decision: Revise Major