Identifying string decoding functions in IDA Pro

Motivation and background

When triaging malicious executable files I always try the FireEye Labs Obfuscated String Solver (FLOSS) to quickly decode obfuscated strings. In short, FLOSS uses heuristics to identify decoding routine candidates and emulates them using vivisect’s disassembly and emulation modules.

While vivisect is an awesome tool, it sometimes is not as robust as IDA Pro in parsing and disassembling binaries. In addition, IDA Pro provides the Fast Library Identification and Recognition Technology (FLIRT) that helps to distinguish standard library functions and functions written by the program’s author.

To help with automatically identifying string decoding routines in IDA Pro I have ported some of the heuristics FLOSS uses to IDAPython. You can find the script on my GitHub page at https://github.com/mr-tz/idapython/blob/master/identify_string_decoders.py.

Usage and example output

You simply run the IDAPython script in IDA Pro: File – Script File… (ALT + F7 on Windows). Here is an example output of the script:

  n   Score     Function VA
  1   1.16667   0x0040166C 
  2   0.83333   0x0040261E 
  3   0.18667   0x00402647 
  4   0.17333   0x00403BC1 
  5   0.13000   0x0040229D 
  6   0.10667   0x0040172F 
  7   0.09000   0x00402ECB 
  8   0.06667   0x0040499B 
  9   0.04000   0x0040185B 
 10   0.04000   0x00404F63 
 11   0.03333   0x004031D6 
 12   0.03333   0x00404662 
 13   0.02667   0x0040430C 
 14   0.02000   0x00403393 
 15   0.02000   0x00403163 
 16   0.02000   0x004034EA 
 17   0.01333   0x004026D7 
 18   0.01333   0x00404491 
 19   0.01333   0x00402D15 
 20   0.01333   0x00401698

How it works

The script distinguishes between functions defined by the program’s author and library and thunk functions. To identify potential string decoding routines, heuristics are only run on “user” functions.

The identification of string decoding functions happens in two steps. First, different heuristic are used to identify function candidates. Second, weights are applied for each identified heuristic and function. The individual weights added together result in the final score.

The current heuristics identify functions based on:

  • the number of cross-references to a function;
  • non-zeroing XOR instructions;
  • shift (SHL, SHR, SAL, SAR, ROL, ROR) instructions and
  • suspicious MOV instructions in tight loops.

Here is the disassembly of a string decoding function that identify_string_decoders.py correctly identified. Note the tight loop (0x401675 – 0x401694), the suspicious MOV instruction (0x401692), the non-zeroing XOR instruction (0x401681), and the shift instructions (0x401685 and 0x401688). Additionally, this function is called more than 20 times in this binary. This screams string decoder!

Example disassembly
Disassembly of a string decoding function

Conclusion

Note, that the exact function rankings and scores will likely differ from FLOSS’s results. When debugging and tweaking FLOSS this plugin has been very useful to me, nonetheless. I hope the script will assist you as well. This IDA Pro implementation is also a great fallback option if vivisect fails to generate a workspace or does not analyze a binary correctly.

Who knows, maybe I will further integrate IDA Pro and vivisect to leverage the advantages of both tools. Obviously, FLOSS will continue to be a stand-alone tool, but the combination could provide enhanced analysis results for reverse engineers using IDA Pro.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.