Polymorphic Compilation
(just some crazy thoughts)


Generally, we speak about "mobile" code (we say just "code").
This means some special look to some snippet of the executable code,
when this code is considered as cpu/os/file format - independent;
the thing remaining is code itself.
It can be virus, worm, trojan, shellcode, or anything that
1) can be changed and executed, and
2) must exists in many instances and forms, and
3) (probably) should "move", or replicate, changing its physical location.
Other properties are not significant, and we concentrate
on polymorphism property, because it helps us to understand
better some relations between code and different external factors
of the environment, this code lives in.
Some of these factors are distribution in time of probabilities
of understanding/detection of the code algorithm.

In other words, we try to find relation between
1) form of the code, changes which could be applied to that code,
and such things as
2) how much time could 'they' spend to understand
a) what is it and how it works, and
b) to write code detection algorithms and/or implement some
countermeasures to reduce code effectivity.

There are exists a set of algorithms allowing to write code
looking more complex, to make it harder to understand (programming tricks);
and a set of algorithms allowing to write code which can
change itself on each replication step (polymorphic and metamorphic engines).

However, linking such an engines into the code leads to algorithm disclosure,
which leads to possibility of full understanding and detection.

This problem can be solved by an external _private_ system,
which will generate all of the code variants in all possible forms.

The simplest example is a code which changes itself,
but eliminates an ability of subsequent changes from the generated
child variants; parent variant remains secure and hidden,
generated variants continue existence in constant form.

The universal model is a code wich generates code which generates code
and so on, losing some abilities at each step;
on the 1st step there is completely private knowledge,
on the last step it is completely public code without any abilites to
change itself.

If we compare ability to change itself and complexity with physical speed,
then good analogy is bullet and rocket:
rocket can change its speed and direction depending on some algorithms,
while bullet is launched once, moving forward until speed is lost.
Then, analogy of previously shown universal model is
a rocket with separating warhead, i.e. big carrying rocket with
more smaller rockets inside, and so on,
until something stupid but enough destructive is launched ;-)

To detect mobile code one should collect enough set of samples.
If these samples are all generated by some parent code,
it is required to find it, or to collect enough samples
to reconstruct parent algorithm to make detection possible.

However, if parent code exists in multiple (also polymorphic) forms,
it is required to do the same step again, and so on;
(this is because parent generators can differ in its algorithm);
but if 1st main generator is really private, it can not be found.
(hint: main generator is man ;-)

As such, it all looks like a tree.

Interesting feature of such a model is that if you have a set of Nth
generation samples which are (probably) produced by different generators, they
1) could vary significantly and
2) they can not be classified,
because information is always incomplete.
Moreover, you never have a set of only Nth-generation samples;
samples collected are always related to different generation steps,
i.e. some of them are parent, some of them are child samples,
and each one differ from another one;
also, any set of samples can relate to only some node of the whole
tree, then while analysis something will be missed.

Also, positive feature is that you need to write the single
program, which will generate everything; new samples could be generated
in one second, while analysis nowadays is handmade.

Negative feature is that such a program should contain somehow
all the complexity of the future sub-generators.


Lets you have some C/C++ sources.
First, you need to convert these sources into template.
1. template should contain variations of some blocks/functions of the code,
   including possible compiler directives, variations of the hll "trash"
   constructions, etc.; the more variants, the better.
2. using this template, each time generate different source.
3. use randomly choosen compiler (4example choose from bcc/msvc/intel)
   to generate .asm sources.
4. inject some random stuff into each .asm source, i.e. randomize it.
5. compile .asm src into some .obj; use different standard libs to link with
   .objs into .exe
6. pack/encrypt .exe with randomly choosen (from the set) packer/compressor.
7. generate some thousand of such executables, upload 'em to server;
   take one by one and distribute.
8. each .exe could perform additional self-changes with following replication.
9. periodically, go to step 1 and add/modify some variants of some subroutines.

In such situation, each .exe will be unique on both assembly/source level,
except (maybe) data strings.
To modify all the resulting samples highly, your efforts will only consist
in modifying a single template file; then these changes will propogate
(and increase) to the lower level, and then to target machines.

Some features
-------------

All the randomizing things should use "slow" strategy:
never use all the possibilities at once, but distribute feature usage in time.

Counteract to sample collecting:
try to use in-memory execution; encrypt samples with machine-dependend info.

Keep complete sample database locally.
Periodically check it with all possible updated antiviruses/mail filters;
if something is detected, try to automatically update corresponding machine
with latest .exe version.

Local sample database should contain binary sample + all the properties
which were used to generate it.
When many samples are detected at once, dump corresponding properties and
manually analyse properties with identical values,
then try to fix problem at step 1.
In other words, when N .exe's are detected, query your db for properties
of these files and find which things were used in all of them
at the generation step. Note, that this can be automatic.

Since av products distribution is localized,
it is possible to introduce some relation between sample properties
and target's country: country can be simply detected using IP address.
Then, each localized av company will first get some special samples
with higher probability.

Try to run multiple different binaries at the single machine:
when one of them is killed somehow, other can counteract and/or report
signatures of all the active processes.

eof