This post is a general presentation about obfuscation. It is mostly a summary of:
It addresses the following points:
- What is obfuscation?
- Why use obfuscation?
- How to evaluate obfuscation?
- What are the main transformations?
What is obfuscation?
Obfuscation is the process of transforming a program code into a form that’s more difficult to understand or change. It must not be mistaken with minification which is simply the fact to decrease the size of a file. Although minification makes the file unreadable, it’s extremly easy to give it a normal look, and then to understand or modify the code at ease.
More formally, obfuscation can be defined as the process of transforming a program P into a program P’ such as P and P’ have the same observable behavior. In particular, if the original program P doesn’t terminate or terminate with an error, then its obfuscated version P’ may or may not terminate. However, in case of an original program P that terminates, it’s obfuscated version P’ must returns the same value as P.
The notion of observable behavior is important. Indeed, strictly speaking, P and P’ won’t have the same exact behavior since as we will see in the next part, some obfuscation techniques consist in changing the execution flow of the program by reordering variables or functions for example. This notion of observable behavior means that P and P’ may have different behaviors, but these differences should not be perceived by the end user.
Why use obfuscation?
Although encryption may seem more effective than obfuscation, it is of no interest if the code is decrypted on the client’s machine since there will still be a moment where the client will be able to see the unencrypted version of the code. An other solution to protect the code is to move the critical logic that needs to be protected to the server side. By providing an API that makes remote calls to the server containing the critical logic, the client has no longer access to the code. This solutions adds overhead both for the client since it is dependent on network conditions, and also for the company that provides the API since it needs to provides reliable servers.
This part summarizes how to characterize the efficiency of obfuscation, as well as its impact on the program that has been obfuscated. The evaluation process is split in three components that respectively answer to the following questions:
- How much obscurity has been added to the program? (human)
- How difficult is it to automatize the deobfuscation process? (machine)
- How much computational overhead is being added by the obfuscation
Potency aims at measuring how the obfuscation process makes it more difficult for a human to understand the obfuscated program. In order to measure if a new obfuscated programm P’ is less readable than the original program P, we use metrics from software engineering. These metrics reflect the quality, readability and maintainability of a program. Traditionally, the goal is to optimize these metrics so that a program becomes easier to maintain and read. In case of obfuscation, the goal is to worsen these metrics so that the program becomes more obscure. For example, when usually one would try to minimize the program’s size or variable dependencies, in case of obfuscation the goal is to maximize them.
Resilience aims at measuring how well an obfuscated program can resist to a deobfuscator. In contrary to potency which measures the complexity for a human, resilience measures the complexity added for a machine. It can be seen as the time required for a deobfuscator to reduce the potency of the obfuscated program, i.e. the time needed to make it more readable by a human.
Thus, the global quality of an obfuscation can be seen as a function of the potency (confusing a human), resilience (confusing a deobfuscator), and the overhead it introduces (execution time and size of the files).
In order to obfuscate a program, an obfuscator applies different kinds of transformations on the original program. We present these transformations in three categories depending on the targets they are applied on.
Layout transformations change the visual representation of the program. They are one way since once the transformation has been made, the previous state can’t be recovered. This category contains simple transformations such as renaming variables, functions, or removing comments. By doing this kind of transformation, it removes the semantic contained in the name of variables, or the indications present in the comments. It also contains operations such as changing the code formatting, which generally consist in removing spaces and lines.
We give an example on a simple program:
The original program could be transformed in this new program below:
Control transformations aims at obsuring the control flow of the application. These transformations are separated in three subcategories presented hereafter.
Aggregation transformations separate computations that are logically together, or merge code that are not. It is based on the idea that code that has been aggregated in a function or a class by a programmer must probably be linked. By separating it into different functions or classes, it makes it more difficult to understand. On the opposite, merging unrelated blocks of code together makes it look like they have a semantic link.
Ordering transformations randomize the order of instruction execution. It relies on the idea that spatial locality in the code, i.e. functions or statements close in the code, plays an important role in making the code more understandeable.
Computation transformations insert dead, redundant code, or make algorithmic changes. Their goal is to hide the real control flow by polluting it with irrelevant statements. In contrary to transformations that target the visual representation of the program, this set of transformations may have an impact on the execution time of the program (cost metric).
The example below presents how we could transform the previous sum function by adding dead code and altering the control flow, without changing the value it returns.
Besides the visual representation of a program, transformations can also be applied to data structures. These transformations are classified in three categories:
Storage and encoding transformations
This strategy relies on using a non “natural” way to encode or store data. For example, we can replace a boolean variable by two variables and a mapping function used to reconstruct the original value. If a variable v = false, we can represent it with the tuple (false, false, AND), (false, true, AND), or (true, false, AND) where AND is the logical AND operator. By using this tuple of 3 elements, we can compute the final value. Besides boolean variables, this technique can be generalized to other type of variables.
The example below illustrates how we can split a boolean variable into a tuple of 3 elements:
An other technique often used on strings is to not use the raw string directly, but instead convert the string into a program that produces the string.
Aggregation transformation aim at aggregating data structures to hide their original representation. For example, it may operate on arrays by restructuring them. An array can be split into multiple sub arrays, multiple arrays can be merged into one. A one dimensional array can be folded into a higher dimensional array, or the opposite, called flattening, which is the process of decreasing the dimension of an array. For example, by representing a 2D grid using a 1D array, it makes it more difficult to understand the purpose of the variable.
Besides arrays, these transformations may also focus on inheritance relationships. It may increase the level of inheritance, or on the opposite decrease it by merging subclasses.
The example below presents how to flatten a 2D array to a 1D array:
Ordering transformations randomize the order of declarations in the source code. In particular, the order of methods and instances variables of a class, as well as parameters in functions. It is also possible to randomize the order in which elements are stored in a data by providing a function that given an index i, maps it back to its original position in the array.
The table below, extracted from “A taxonomy of obfuscation transformations”, provides an overview of the different obfuscating transformations as well as their impact on the potency, the resilience and the cost. A + indicates that the value of the metric depends on the context.
|Layout||Scramble identifiers||medium||one way||free|
|Change Formatting||low||one way||free|
|Remove Comments||high||one way||free|
|Control||Computations||Insert dead code||Depends on the quality of the opaque predicate and
the nesting depth at which the construct is inserted.
|Extend loop condition|
|Reducible to non reducible|
|Add redundant operands|
|Remove programming idioms||medium||strong||+|
|Aggregation||Inline method||medium||one way||free|
|Interleave methods||Depends on the quality of the opaque predicate.|
|Ordering||Reorder statements||low||one way||free|
|Reorder loops||low||one way||free|
|Reorder expressions||low||one way||free|
|Data||Storage and encoding||Change encoding||Depends on the complexity of the encoding function.|
|Promote scalar to object||low||strong||free|
|Change variable lifetime||low||strong||free|
|Split variable||Depends on the number of variables the
original variable is split into.
|Convert static to procedural data||Depends on the complexity of the generated function.|
|Aggregation||Merge scalar variables||low||weak||free|
|Insert bogus class||medium||+||free|
|Ordering||Reorder methods and instance variables||low||one way||free|
Some of the transformations present in the table have not been presented in this post, but their name is rather explicit about their purpose. This post also didn’t tackle the complexity of creating efficient opaque predicates capable of resisting to deobfuscators. These different points may be treated in future posts on obfuscation. Meanwhile, you can refer to this article for more details.