This post is a general presentation about obfuscation. It is mostly a summary of:

It addresses the following points:

  1. What is obfuscation?
  2. Why use obfuscation?
  3. How to evaluate obfuscation?
  4. What are the main transformations?

I wrote this post because I regularly have to look at obfuscated JS code and wanted to know more about the different obfuscation strategies. The second reason is that I am also interested in protecting sensitive JS code executed in the browser, and as we’ll see in this post, obfuscation is often considered as a good candidate for these kinds of problems. Thus, even though this article is about obfuscation in general, some parts may be more related to Javascript obfuscation in particular.

What is obfuscation?

Obfuscation is the process of transforming a program code into a form that’s more difficult to understand or change. It must not be mistaken with minification which is simply the fact to decrease the size of a file. Although minification makes the file unreadable, it’s extremly easy to give it a normal look, and then to understand or modify the code at ease.

More formally, obfuscation can be defined as the process of transforming a program P into a program P’ such as P and P’ have the same observable behavior. In particular, if the original program P doesn’t terminate or terminate with an error, then its obfuscated version P’ may or may not terminate. However, in case of an original program P that terminates, it’s obfuscated version P’ must returns the same value as P.

The notion of observable behavior is important. Indeed, strictly speaking, P and P’ won’t have the same exact behavior since as we will see in the next part, some obfuscation techniques consist in changing the execution flow of the program by reordering variables or functions for example. This notion of observable behavior means that P and P’ may have different behaviors, but these differences should not be perceived by the end user.

Why use obfuscation?

Obfuscation is mostly used to protect code. Usually it is because you are distributing your code, for example in case of Javascript you send it to browsers that request a web page, but you don’t necessarily want to expose your business logic contained in it. By obfuscating your program, you make it harder for someone to understand its behavior. Moreover, in case where someone would simply steal the code to use it on its website, it would make it more difficult to maintain and add new features.

Although encryption may seem more effective than obfuscation, it is of no interest if the code is decrypted on the client’s machine since there will still be a moment where the client will be able to see the unencrypted version of the code. An other solution to protect the code is to move the critical logic that needs to be protected to the server side. By providing an API that makes remote calls to the server containing the critical logic, the client has no longer access to the code. This solutions adds overhead both for the client since it is dependent on network conditions, and also for the company that provides the API since it needs to provides reliable servers.

Evaluate obfuscation

This part summarizes how to characterize the efficiency of obfuscation, as well as its impact on the program that has been obfuscated. The evaluation process is split in three components that respectively answer to the following questions:

  • How much obscurity has been added to the program? (human)
  • How difficult is it to automatize the deobfuscation process? (machine)
  • How much computational overhead is being added by the obfuscation

Potency

Potency aims at measuring how the obfuscation process makes it more difficult for a human to understand the obfuscated program. In order to measure if a new obfuscated programm P’ is less readable than the original program P, we use metrics from software engineering. These metrics reflect the quality, readability and maintainability of a program. Traditionally, the goal is to optimize these metrics so that a program becomes easier to maintain and read. In case of obfuscation, the goal is to worsen these metrics so that the program becomes more obscure. For example, when usually one would try to minimize the program’s size or variable dependencies, in case of obfuscation the goal is to maximize them.

Resilience

Resilience aims at measuring how well an obfuscated program can resist to a deobfuscator. In contrary to potency which measures the complexity for a human, resilience measures the complexity added for a machine. It can be seen as the time required for a deobfuscator to reduce the potency of the obfuscated program, i.e. the time needed to make it more readable by a human.

Cost

Finally, the cost measures the overhead added by the obfuscation to the application. The overhead may be both in term of size of the file and in term of execution time. Indeed, in case of Javascript where files must be sent through the internet to browsers when they request a page, it might be a problem if the obfuscated program becomes too big.

Thus, the global quality of an obfuscation can be seen as a function of the potency (confusing a human), resilience (confusing a deobfuscator), and the overhead it introduces (execution time and size of the files).

Main transformations

In order to obfuscate a program, an obfuscator applies different kinds of transformations on the original program. We present these transformations in three categories depending on the targets they are applied on.

Layout

Layout transformations change the visual representation of the program. They are one way since once the transformation has been made, the previous state can’t be recovered. This category contains simple transformations such as renaming variables, functions, or removing comments. By doing this kind of transformation, it removes the semantic contained in the name of variables, or the indications present in the comments. It also contains operations such as changing the code formatting, which generally consist in removing spaces and lines.

We give an example on a simple program:

// Compute the sum of 2 numbers
function sum(number1, number2) {
	return number1 + number2
}

The original program could be transformed in this new program below:

function sUSNO0(qsqzu_Pmj, azsd_hFZh){return qsqzu_Pmj+azsd_hFZh;}

Control transformations

Control transformations aims at obsuring the control flow of the application. These transformations are separated in three subcategories presented hereafter.

Aggregation transformations

Aggregation transformations separate computations that are logically together, or merge code that are not. It is based on the idea that code that has been aggregated in a function or a class by a programmer must probably be linked. By separating it into different functions or classes, it makes it more difficult to understand. On the opposite, merging unrelated blocks of code together makes it look like they have a semantic link.

Ordering transformations

Ordering transformations randomize the order of instruction execution. It relies on the idea that spatial locality in the code, i.e. functions or statements close in the code, plays an important role in making the code more understandeable.

Computation transformations

Computation transformations insert dead, redundant code, or make algorithmic changes. Their goal is to hide the real control flow by polluting it with irrelevant statements. In contrary to transformations that target the visual representation of the program, this set of transformations may have an impact on the execution time of the program (cost metric).

The example below presents how we could transform the previous sum function by adding dead code and altering the control flow, without changing the value it returns.

function sum(number1, number2) {
	var a = 42;
	var z ;
	var res;
	if(number1 < 753) {
		z = 890;
		res = number1 + z;
	} else {
		z = 56 + a;
		res = number1 + z;
	}
	return res+number2 - z;
}

Data

Besides the visual representation of a program, transformations can also be applied to data structures. These transformations are classified in three categories:

Storage and encoding transformations

This strategy relies on using a non “natural” way to encode or store data. For example, we can replace a boolean variable by two variables and a mapping function used to reconstruct the original value. If a variable v = false, we can represent it with the tuple (false, false, AND), (false, true, AND), or (true, false, AND) where AND is the logical AND operator. By using this tuple of 3 elements, we can compute the final value. Besides boolean variables, this technique can be generalized to other type of variables.

The example below illustrates how we can split a boolean variable into a tuple of 3 elements:

function evalBool(v1, v2) {
	return v1 && v2;
}

// originalValue = false;
// Instead of directly storing false, we store it as (v1: true, v2: false)
var encodedValue = {
    v1: true,
    v2: false
};

// Instead of storing directly the value in a boolean, we split it in 2 values and a function
if(evalBool(encodedValue.v1, encodedValue.v2)) {
	// do things
}

An other technique often used on strings is to not use the raw string directly, but instead convert the string into a program that produces the string.

Aggregation transformations

Aggregation transformation aim at aggregating data structures to hide their original representation. For example, it may operate on arrays by restructuring them. An array can be split into multiple sub arrays, multiple arrays can be merged into one. A one dimensional array can be folded into a higher dimensional array, or the opposite, called flattening, which is the process of decreasing the dimension of an array. For example, by representing a 2D grid using a 1D array, it makes it more difficult to understand the purpose of the variable.

Besides arrays, these transformations may also focus on inheritance relationships. It may increase the level of inheritance, or on the opposite decrease it by merging subclasses.

The example below presents how to flatten a 2D array to a 1D array:

function getNewIndex(i, j) {
	return i + (i+1)*j;
}

var grid1D = new Array(m*n);

for(var i = 0; i < m; i++) {
	for(var j = 0; j < n; j++) {
		grid1D[getNewIndex(i, j)] = grid2D[i, j];
	}
}

Ordering transformations

Ordering transformations randomize the order of declarations in the source code. In particular, the order of methods and instances variables of a class, as well as parameters in functions. It is also possible to randomize the order in which elements are stored in a data by providing a function that given an index i, maps it back to its original position in the array.

Summary

The table below, extracted from “A taxonomy of obfuscation transformations”, provides an overview of the different obfuscating transformations as well as their impact on the potency, the resilience and the cost. A + indicates that the value of the metric depends on the context.

Obfuscation Quality
Target Operation Transformation Potency Resilience Cost
Layout Scramble identifiers medium one way free
Change Formatting low one way free
Remove Comments high one way free
Control Computations Insert dead code Depends on the quality of the opaque predicate and
the nesting depth at which the construct is inserted.
Extend loop condition
Reducible to non reducible
Add redundant operands
Remove programming idioms medium strong +
Table interpretation high strong costly
Parallelize code high strong costly
Aggregation Inline method medium one way free
Outline statements medium strong free
Interleave methods Depends on the quality of the opaque predicate.
Clone methods
Block loop low weak free
Unroll loop low weak cheap
Loop fission low weak free
Ordering Reorder statements low one way free
Reorder loops low one way free
Reorder expressions low one way free
Data Storage and encoding Change encoding Depends on the complexity of the encoding function.
Promote scalar to object low strong free
Change variable lifetime low strong free
Split variable Depends on the number of variables the
original variable is split into.
Convert static to procedural data Depends on the complexity of the generated function.
Aggregation Merge scalar variables low weak free
Factor class medium + free
Insert bogus class medium + free
Refactor class medium + free
Split array + weak free
Merge arrays + weak free
Fold array + weak cheap
Flatten array + weak free
Ordering Reorder methods and instance variables low one way free
Reorder arrays low weak free


Some of the transformations present in the table have not been presented in this post, but their name is rather explicit about their purpose. This post also didn’t tackle the complexity of creating efficient opaque predicates capable of resisting to deobfuscators. These different points may be treated in future posts on obfuscation. Meanwhile, you can refer to this article for more details.


Antoine Vastel

Head of research at Datadome.