ÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄ
ÍÍÍÍËÍÍÍØÍÍÍÍÍÍÍÍÍÍËÍÍËÍÍËÍÍËÍÍËÍÍËÍÍËÍÍËÍÍËÍÍËÍÍËÍÍËÍÍËÍÍËÍÍÍÍÍÍÍÍÍÍØÍÍÍËÍÍÍÍ
    º   ³       ÚÄÄÐÄÄÐÄÄÐÄÄÐÄÄÐÄÄÐÄÄÐÄÄÐÄÄÐÄÄÐÄÄÐÄÄÐÄÄÐÄÄÐÄÄ¿       ³   º
    º   ÃÄÄÄÄÄÄÄ´  Advanced polymorphic engine construction  ÃÄÄÄÄÄÄÄ´   º
    º   ³       ÀÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ       ³   º
    ÈÍÍÍØÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍËÍÍËÍÍËÍÍËÍÍËÍÍËÍÍËÍÍËÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍØÍÍÍ¼
        ³                ÚÄÄÐÄÄÐÄÄÐÄÄÐÄÄÐÄÄÐÄÄÐÄÄÐÄÄ¿                ³
        ÀÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´            by            ÃÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ
                         ³ The Mental Driller / 29A ³
                         ÀÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ


This article is assumed upon a basis on polymorphic engines construction, so
you need an adquired good knowledge about decryptor generators and its
construction (it's not for newbies! ;)

I wrote this for win32 engines. I'm not very versated in Linux/Unix virusing,
but modifying some words on this article (and some points in the index) it
can be extrapolated to engines under these systems.

Ú---úú  .
| Index |
`  úú--ÄÙ
0. Some comments
1. Making more complex polymorphic engines
  1.1 Size of decryptors
  1.2 Algorithmical applications
     1.2.1 PRIDE technology
     1.2.2 Branching technology
  1.3 Internal recursivity
2. Don't give a chance to AVs
  2.1 Coherent decryptor structures
  2.2 Opcodes to avoid
3. Advanced garbage generation
  3.1 Memory accesses
  3.2 API calls
  3.3 Recursive garbage functions
4. Last words

ÚÄÄÄÄÄÄ---úú       .
| 0. Some Comments |
`       úú--ÄÄÄÄÄÄÄÙ
This article is made for those who made its polymorphic engine and they want
to improve their knowledge and their techniques, making a better polymorphic
engine. I have to advice that the techniques I'm going to explain are very
time consumming (an error in the coding, being little or not, can generate
huge errors that aren't easy to trace back, or little errors in the code
generation that can pop up in the least expected moment).

Well, so let's on. I've tried to make both organized article and clear
explanations, but sometimes it can be a little hard to understand. Well,
considering that I'm not an expert on article writing, I've done what I
could ;).


ÚÄÄÄÄÄÄ---úú                                 .
| 1. Making more complex polymorphic engines |
`                                 úú--ÄÄÄÄÄÄÄÙ

 1.1 Size of decryptors
 ----------------------
Many people believe nowadays in the (old) advise of virus coding: you have to
code them little. This advise was valid while there were 40 Mb harddisks and
no Pentiums+ (or similars), but now this fact is obsolete. An average user
has now a huge harddisk (2+ Gb) and s/he don't know the real free space of
his/her disk due to the Windblows swap file. Then, we can make a huge virus
(10 Kb or more) without being noticed at all. We can apply this to decryptors,
then. Why we can't generate a 4 Kb decryptor? We can, and moreover, we are
nearly obligated to do it (a 100 bytes decryptor isn't a challenge to any
nowadays AV emulator). But we have to have a GOOD garbage generator, since
a single instruction size is 3 or 4 bytes as an average, which are about some
thousands of instructions in a big decryptor, which can make easier the AVers
task if they decide to detect our virus by algorithmical approaches or
heuristical techniques. So, we have to code a good garbage generator with
quite a lot of chances.

Other thing that we have in favour of big decryptors is the impossibility for
an emulator to determine in a few instructions if it's a decryptor or not,
forcing it to emulate deeply. 1 Kb of garbage before starting the decryption
should be enough, but you have to thing that every time processors are better
(faster, cheaper, etc. etc.) so emulators too. Garbage is executed very fast
upon normal execution, but it can take a good while upon an emulator. The more
garbage you put, the more time an emulator needs and the less possibilities
the emulator reach the decrypted virus, always you put coherent garbage to
avoid the heuristic detections of "strange" instruction using, and also you
have to mantain a good balance between quantity and quality of code generation
(putting 20 Kb of very complex garbage can slow the initial execution of the
application, noticing the user there is something unusual on his/her system).

 1.2 Algorithmical applications
 ------------------------------
When possible, avoid linear decryption. This is valid for both decryption and
looping. Although we have a 10 Kb decryptor, if we make a main loop (easily
detectable by an emulator) and we access consecutively to the encrypted data
when decrypting it, it's stupid to make very complex coding, because we have
an algorithmical clue for detection that many emulators use to defeat complex
decryptors (they only put a breakpoint after the big loop and wait until that
part is decrypted). I've developed two techniques that are useful to avoid
this situation: PRIDE and Branching.

  1.2.1 PRIDE Technology
  ----------------------
The name comes from Pseudo-Random Index DEcryption, and it was an idea that
I had from the beginning, when I began to code poly engines. Due to the lack
of information about this, I had to research this subject by myself, and now
that's what I bring to you.

The idea is that a "normal" program doesn't make normally a sequential
read/write of a memory zone, which is being made by the decryptors of all
polymorphic virus. There are some techniques that try to avoid that in a
certain way (look Zhengxi's engine(29A#1) or MeDriPolEn in Squatter(29A#3)),
like adding some bytes to leave "holes" and then do several decryptions of
the same code but adding each time a different number, leaving the code full
decrypted.

This particularity is also detected by AVs emulators. The only way to hide
this is to make an "in-appearance" random accessing to that memory, to cheat
the emulator and force it to determine that they are part of a normal memory
access of an application, and that's what I researched a lot until I found
this formula, very easy to do polymorphically. It's adapted to byte-to-byte
decryption, but I explained how to adapt it to others down.

ÚÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ¿
³Random(Number) symbolizes a random number between 0 and Number-1 (just like³
³ the C function)                                                           ³
³                                                                           ³
³Encrypted_Data_Size = The size of encrypted part, rounded to the next power³
³                      of 2 (I explain this later)                          ³
³InitialValue = Random(Encrypted_Data_Size)                                 ³
³                                                                           ³
³ The formula                                                               ³
³ -----------                                                               ³
³      Register1 = Random(Encrypted_Data_Size)                              ³
³      Register2 = InitialValue                                             ³
³ Loop_Label:                                                               ³
³      Decrypt [(Register1 XOR Register2)+Begin_Address_Of_Encrypted_Data]  ³
³      Register1 += Random (Encrypted_Data_Size) AND -2                     ³
³                                     ÀÄÄÄÄÄ>  Take care with this one!     ³
³      Register1 = Register1 MOD Encrypted_Data_Size                        ³
³      Register2++                                                          ³
³      Register2 = Register2 MOD Encrypted_Data_Size                        ³
³      if Register2!=InitialValue GOTO Loop_Label                           ³
³      GOTO Begin_Address_Of_Encrypted_Data                                 ³
ÀÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ

That's it! Very short, very easy to code, and very randomized. Let's see it
by parts, and I'll explain the mathematical aspects of the formula (why it's
like this and no like other):

The first thing to consider is the fact that the encrypted part must be a pow
of 2. If you look at the formula, you can see that the generated decryption
address came from a XOR between two random numbers. The fact is that a XOR
(unlike ADD, SUB, etc.) never modifies a bit higher than the highest bit of
the two numbers, which allows us to know always the top limit of the
resulting number (always power of 2).

Now, the used registers: Register1 is used as a modifier for the Register2,
and it's a pseudo-random number every time, due to the fact that we generate
its initial value randomly and we add to it a random number every loop. The
work of this formula is done by the Register2, and if you look at it, you can
see that Register2 is no other thing than a counter, so you can increase it
or decrease it, it's up to you (or up to the engine :). Just keep it inside
the limits (between 0 and Encrypted_Data_Size).

Now the real revolution of this formula: I find out, after many tests, that
when you have a counter (Register2) and you XOR a random number to it (always
inside the limits and that, I'm not going to repeat this anymore :) you get a
different number, and if you add to the XORing value a little special random
number and increase the counter, the next time you do the XOR to the counter
you will get another different number. When you have completed all the count
sequence (from 0 till NumberPowerOf2), you get a sequence of random numbers
which touch all the numbers from 0 till NumberPowerOf2, but without anyone
repeated! It's like making a permutation of a sequence, but you don't have to
store any vector nor generate any data. Since the formula randomizes all its
numbers, it doesn't vary very much from the "standard" decryptor.

I referred sometimes to a "special random number", which is the one that you
have to XOR to the counter. This one is special because you have to keep it
in a "level" down than the other random numbers. I explain this, and please
keep attention to it, because is very important:

 ÉÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍ»
 º   When you use this technology, most random numbers are from 0 until the º
 º size of the data to decrypt  (pow of 2). There's a  particularity in the º
 º formula that  it's necessary for the  correct development  and returning º
 º reliable  values:  you  must  "align" the  numbers  (so, the  result  of º
 º Random(Encrypted_Data_Size)  must be multiple of 1 if we decrypt by byte º
 º ptr (nothing special here, then), be multiple of 2 if we decrypt by word º
 º ptr, and multiple of 4 if we decrypt by dword ptr). But the  number that º
 º we add to the  XORing value is slightly  special  because it  has to  be º
 º aligned in an upper  grade, I mean, if you use  byte ptr for decryption, º
 º that number  must be multiple of 2, multiple of 4 for  word ptr and of 8 º
 º for dword ptr.  That is  easily achieved by  getting, before  coding the º
 º opcode of the instruction,  the random number to add,  and then doing an º
 º instruction:                                                             º
 º AND Value,Encrypted_Data_Size-2 (for bytes), or                          º
 º AND Value,Encrypted_Data_Size-4 (for words), etc.                        º
 º (in the engine, not in the decryptor!).                                  º
 ÈÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍ¼

Just take this into account because, if not, the decrypted part will be
touched two times, leaving it corrupted (I found out this after becoming mad
and furious after several hours of seeing a correct engine and an incorrect
decryption, and after coding thousands of little programs to test that :).

This method, although powerful, can be defeated with the detection of the
code loops, so we must do anything to break the linearity of the decryptor
execution. The easiest way is to put some conditional jumps in the middle,
but it seems that the emulators detect which zone of code is more frequently
executed (or something like that), so I thought about it and created the next
technique:

 1.2.1 Branching Technology
  --------------------------
This one, combined with PRIDE (it seems a joke :P ) will allow us to defeat
normal emulators (at least until this date), and of course with the help of
the normal polymorphism techniques and complex garbage generation.

When you look at a legitimal application, you can see that it has many
conditional jumps, followed by code, and the normal thing is that a portion
of code isn't executed an ununderstandable number of times as a decryption
loop does. We must break this, and the way can be this:

ÚÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ¿
³First we have this arrays and values:                                       ³
³                                                                            ³
³ArrayOfJumps       dd N dup (0)                                             ³
³ArrayOfJumpsNdx    dd 0                                                     ³
³JumpsToComplete    dd N dup (0)                                             ³
³JumpsToCompleteNdx dd 0                                                     ³
³                                                                            ³
³                                                                            ³
³        ³                                                                   ³
³        ³ This is the beginning of the decryptor. This is the part when the ³
³        ³ registers are setted to their starting value, and all things that ³
³        ³ that we must put at the beginning.                                ³
³        ³                                                                   ³
³        ³                                                                   ³
³        x First address stored into ArrayOfJumps                            ³
³        ³                                                                   ³
³        ³ Garbage                                                           ³
³        ³                                                                   ³
³      .ù*ù. Random conditional jump with a very random probability of jump  ³
³      ³   ³ Garbage                                                         ³
³      x   x 2nd and 3rd address stored into ArrayOfJumps                    ³
³      ³   ³ Garbage                                                         ³
³     .*. .*. Random conditional jumps                                       ³
³     ³ ³ ³ ³                                                                ³
³     ³ ³ ³ ³ Four decryption algorithms that perform the same op. but with  ³
³     ³ ³ ³ ³  different code.                                               ³
³     ³ ³ ³ ³                                                                ³
³     ³ ³ ³ ³                                                                ³
³     | | | | Final-of-decryption check                                      ³
³     R R R R Loop to continue decrypting (jump randomly to one of the addr. ³
³     | | | |  stored in ArrayOfJumps)                                       ³
³     ³ ³ ³ ³ Garbage                                                        ³
³     | | | |                                                                ³
³     V V V V Jump to decrypted virus                                        ³
³                                                                            ³
³(This would be generated with 3 levels of recursivity. Just look down to see³
³the explanation)                                                            ³
ÀÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ

I think the technique is quite clear looking at the diagram, but I'll explain
it in words:

   1. Â First step Â
      ÀÄÄÄÄÄÄÄÄÄÄÄÄÙ
      You must code a recursive function that I'm going to call "DoBranch".
     This function has to manage the coding as if it were a tree. Once in
     the engine, when you begin to construct a decryptor, you insert first
     the instructions that set our going-to-be-used registers to their
     operative value. Once you have this, and after generating some garbage,
     you call to "DoBranch". You must sure that the function, since it's
     recursive, is going to execute several times, so don't put fixed memory
     variables. Use stack or indexed variables, instead.

   2. Â Recursivity r0x! Â
      ÀÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ
      DoBranch takes the control, and the function doesn't return completely
     until the whole decryptor is finished. The function must know which
     level of recursivity is, so you have to put a variable that increases
     every time you enter into "DoBranch" (INC [RecursivityLevel] at the
     very beginning). Every time you return from the function you must
     decrease that variable.

   3. Â Save this address Â
      ÀÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ
      Look if the recursivity level is the last one that we allow. If it is
     not, we store the actual instruction insertion address into our prepared
     set of variables: ArrayOfJumps+ArrayOfJumpsNdx, and we increase
     ArrayOfJumpsNdx.

   4. Â Interjumping code Â
      ÀÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ
      If you didn't arrive to the desired level of recursivity (yet), and
     after saving the actual address (point 3), generate garbage (a good
     amount of). When you decide that you have enough, then generate a random
     conditional jump. It must be very random, just like CMP Reg1,Reg2/JA xxx
     or similar, begin Reg1 and Reg2 garbage registers, if possible
     (the ones you put on garbage instructions). There is a huge set of
     possibilities (another very good one is TEST Reg,Value / J(N)Z xxx,
     being Value a number power of 2 - only one bit set).
      We must store then the address of the conditional jump we made, because
     we don't know yet to which address we have to jump, so we save
     this and later we'll calculate and complete this jumps. It's enough to
     push the address onto the stack.
      After saving this, we code this leave of the binary tree: you call
     "DoBranch" again, and when it returns... voil…! We have a complete
     branch coded, and of course the index of instruction insertion points
     to the place where the next branch will be. So, we pop the address
     that we pushed before, we calculate the distance between the actual
     insertion index and the saved address, and then we complete the
     conditional jump that the save address point at with that calculated
     value. After this little operation, call "DoBranch" again, decrease
     [LevelOfRecursivity] and RET.

   5. Â Control the number of branches Â
      ÀÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ
      The number of branches created by this function is going to be
     2^MaxRecursivityLevelAllowed, so take care, since this function can
     generate a huge decryptor without many coding (I recommend 3 or 4 levels
     of recursion, which would generate 8 or 16 branches).

   6. Â The last recursivity level allowed Â
      ÀÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ
      When you arrive to the last level of recursivity, then you code the
     decryption algorithm (just the decryption instruction, the index
     modification, the counter inc/decrementation, etc (all mixed with
     garbage) and all that. Make sure you are doing this very randomly,
     because you are coding this part 2^MaxRecursivityLevelAllowed times
     (usually 8 or 16), and if you code this very similar to the others, it's
     not polymorphism at all.
      When you arrive to the comparision/cond_jmp pair (the one which loops
     to continue decryption if the encrypted part isn't decrypted yet), you
     have to code the comparision and leave the jump uncompleted, since we
     must wait to the complete return of "DoBranch" to assure that all
     branches are coded. So, we store this address into the other prepared
     array, JumpsToComplete, just like we made with the ArrayOfJumps.
      Then insert some garbage, code the jump to the decrypted part and
     RET.

   7. Â The complete return of "DoBranch" Â
      ÀÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ
      After the return of the first call to DoBranch, we have to complete
     the jumps we stored in JumpsToComplete. This time we are going to use
     the two arrays that we constructed while coding all branches. All the
     power of the technique is based on this:

      1) Take the first address in JumpsToComplete
      2) Select randomly an address in ArrayOfJumps
      3) Complete the address from JumpsToComplete with the other address,
        so we have finally a conditional jump to one place of the tree in a
        random way.
      4) Do it with all the addresses in JumpsToComplete


When we finish, we'll have a decryptor that its behaviour is exactly the same
as a normal one, but that you never know which branch of code is going to be
executed every time, since when you jump to loop, you perform a random number
of random comparisions and conditional jumps that will drive you to a random
decryption part. Due to the fact that every final part of the branch does the
same than the others, we don't care which one becomes executed every time, so
we broke the linearity of execution and now, "from the outside", the decryptor
resembles a normal application following its conditions, not a decryption
loop.

 1.3 Internal recursivity
 ------------------------
We have seen that Branching requires recursivity for an easy coding of the
technique, but since we have done it, we can make the entire engine oriented
to recursive functions, specially to generate indirect register/memory
modifications. We are going to code some usual functions recursively, adding
a variable that I call "recursivity level", which is a variable that is
increased every time the function is called, and it's used to control the
active instances of this function (so, when we arrive to a pre-decided number
of callings, then we avoid calling this function recursively again). Let's see
the MOV Reg,Value instruction and what happens if we code the function that
generates this type of instruction in a recursive way:

     1. Decide the MOV types
     -----------------------
     Normally I use MOV Reg,Value, PUSH Value/POP Reg and LEA Reg,[Value],
     but it's up to you. There is another opcode for MOV Reg,Value (C7 C?),
     but try to avoid it since no compiler will generate it (although is
     completely correct for the processor, since it's the opcode for
     MOV DWORD PTR [Address],Value with the register-selection bits activated)
     since there are more optimized ways of doing it (concretely, the B?
     one-byte opcodes for direct value MOVs).
     Now that we have this MOVs, we select them as a "last chance": we use
     them, for example, in a 25% of callings, and when we are in a quite deep
     recursivity level (5 or 6 is enough, I think). We'll call this function
     "DoMOVRegValue". So, at the beginning, we can put:
                        inc     byte ptr [RecursivityLevel]
                        cmp     byte ptr [RecursivityLevel], 5
                        jae     MakeNoRecursive
                        ...

     And, to finish, we jump here instead of making RET:

          Return:       dec     byte ptr [RecursivityLevel]
                        ret


     2. Not only this function, but more!
     ------------------------------------
     To increase complexity of generated code, we have to make other functions
     that follow the same method as DoMOVRegValue. We can code DoMOVRegReg,
     DoMOVRegMem (this will make a MOV Reg,[Address] or similar), DoMOVMemReg,
     and the modifications (DoADDRegValue, *SUB*, *XOR*, etc.). We have to
     be careful with this ones, and we have to be sure that there's no error
     in its code, as we are going to code it all together at the same time
     (well, I prefer this method, but you can make it easier at first and
     increase the function's internal options when you have tested it). Then,
     DoMOVRegValue won't generate only the direct movings that we decided
     before, but we have more chances to make a value moving, for example:

       ; DL=Register to use
       ; EAX=Value to move to the register

                        call    GiveMeABufferAddress
                        ; Now, EBX=Buffer address where we can store a dword
                        call    DoMOVMemValue ; Using EBX as address and EAX
                                              ; as value
                        call    DoMOVRegMem   ; Using EBX as address and DL
                                              ; as register
                        ret

     This would be a direct case, but what if we do this?:
                        call    GiveMeABufferAddress
                        call    AdjustMemToValue
                        call    DoMOVRegMem

                AdjustMemToValue:
                        mov     ecx, eax
                        call    Random ; EAX=Random number
                        sub     ecx, eax
                        xchg    ecx, eax
                        call    DoMOVMemValue ; Move EAX to [EBX]
                        call    MakeGarbage
                        mov     eax, ecx
                        call    DoADDMemValue ; Add EAX to [EBX]
                        ret

     Of course, we don't use only ADD, but XOR, SUB, etc. And we don't use
     this only in memory addresses, so we can use it in another chance:

     (this is inside DoMOVRegValue and we arrive here randomly, since there
     are other options, of course):
                        call    AdjustRegToValue
                        ret

               AdjustRegToValue:
                        mov     ecx, eax
                        xchg    ecx, eax
                        call    DoMOVRegValue ; Recursive call
                        call    MakeGarbage
                        mov     eax, ecx
                        call    DoADDRegValue
                        ret

     Then, possibilities are infinite. We can combine all the functions we
     made before to generate a quite complex MOVing. We can make the same
     with DoMOVRegReg (taking as not-recursives, for example, MOV Reg1,Reg2,
     LEA Reg1,[Reg2] or PUSH Reg2/POP Reg1). We can use also other functions
     to increase this: DoPUSHReg, DoPUSHMem, etc. Let's see an example of deep
     recursivity:

          1 - We call DoMOVRegValue, and we arrive to:
                                call    GiveMeABufferAddress
                                call    AdjustMemToValue
                                call    DoMOVRegMem
                                jmp     Return

          2 - So, we execute AdjustMemToValue, which internally calls to
             DoMOVMemValue and later to DoADDMemValue. Inside DoMOVMemValue
             we arrive to:
                                call    DoPUSHValue
                                call    DoPOPMem
                                jmp     Return

          3 - Executing DoPUSHValue, we arrive here:
                                call    GiveMeABufferAddress
                                call    AdjustMemToValue
                                call    DoPUSHMem
                                jmp     Return

              Buf! We are very deep now in recursive instances of that
              functions. After a while, it'll exit from AdjustMemToValue, and
              maybe it called other recursive functions that again called
              AdjustMemToValue, so that's why we have to control the number
              of recursive calls, since it can generate an enormous amount
              of code without a great effort. After doing DoPUSHValue, it
              executes DoPOPMem which strategically isn't recursive (and
              logically too, since there isn't a DoMOVMemMem instruction).

          4 - After returning completely from AdjustMemToValue, the function
             DoMOVRegMem will be executed. This function can also be very deep
             in recursivity calls, for example:
                                call    DoPUSHMem
                                call    DoPOPReg
                                jmp     Return

             We know that DoPUSHMem doesn't make recursive calls, but the next
             function, DoPOPReg, can make them, for example:
                                call    GiveMeABufferAddress
                                call    DoPOPMem
                                call    DoMOVRegMem
                                jmp     Return

             Again, more recursive calls :).

After this, you can see how powerful is the recursive code generation, and how
a simple MOVing can derive in a quite complex set of assignations from to
memory/registers, giving as a final result the desired value in the desired
register. Many functions can be done in this way, and later we'll see its
application to make garbage.


ÚÄÄÄÄÄÄ---úú                    .
| 2. Don't give a chance to AVs |
`                    úú--ÄÄÄÄÄÄÄÙ
But even the most recursive engine in the world can make all the work
worthless if it's detected heuristically because it put a strange instruction
or a quite common polymorphic structure, like:
                        JMP Next
                  Subroutine:
                        ...
                        ret
                  Next: call    Subroutine
or similars, because no normal application does that.

 2.1 Coherent decryptor structures
 ---------------------------------
What to do, then? It's easy: just have an array of "inconcluded calls", I
mean: you code the CALL instruction, but you don't code the subroutine yet.
Then, you save the address of this instruction in an array and upon a random
decision, or at the end of the decryptor, the engine generates the subroutines
and completes the CALLs that are in the array to make them point to the new
generated subroutines. Also a good way is pre-generate some subroutines before
the entrypoint of the decryptor, and make calls to them (combining this type
of CALLing with the "inconcluded call" type).

With this, the decryptor looks more like a compiler-generated application,
at least using CALL instructions. Other very polwerful thing is using a stack
frame inside the generated subroutines: you make PUSH EBP / MOV EBP,ESP at
the beginning and POP EBP at the end of the subroutine and then you can't
determine in a first sight if the decryptor is a product of a compiler or not.
Even better if you use that stack frame to retrieve values! :)

Other thing: avoid this:
                JMP Label
                x Random bytes
             Label:

Do you think it's normal to find this in a normal application? Then, the
emulator thinks the same :). Avoid this and avoid inserting direct random
data without being linked to any direct instruction.

 2.2 Opcodes to avoid
 --------------------
Have you ever scanned a virus that, although extremely polymorphic, inserts
one-byte instructions like CMC, STI, etc.? If you have tried it with AVP, for
example, you maybe noticed that automatically the antivirus enters in deep
scan. Why? Because it's HIGHLY suspicious to use these instructions. Who uses
CMC nowadays? And not only that, since the random generator trends to insert
quite a lot of this dummy instructions, and the antivirus emulator knows that
fact, so when it founds a more or less big number of these instructions (and
some instructions directly, even if there is only one), it decides that the
file is so suspicious that maybe worth the time to enter in deep scan. Maybe
it doesn't found anything, but an average user can think that that file has
anything more than the original application.

This advide is also for some 16 bits instructions under win32 applications.
While coding the TUAREG engine, I put nearly all the instructions the garbage
generator could generate to use 8, 16 or 32 bits, and then, when scanning with
AVP, the emulator switched always to deep scan. After thinking about it, I
removed the generation of some 16 bits instructions and AVP didn't make that
again. I don't know which instructions make AVP to activate that heuristic
flag, but invariably I recommend to use as less as possible 16 bits
instructions.


ÚÄÄÄÄÄÄ---úú                     .
| 3. Advanced garbage generation |
`                     úú--ÄÄÄÄÄÄÄÙ
Now, we are going to enter in one of my favorite subjects: garbage generation.
My opinion is that the main power of a polymorphic engine is the function to
generate garbage, since the garbage is the code that makes the emulators to
give up tracing or help them to determine the nature of the program. So, the
more normal the garbage seems, the less suspicious the decryptor seems, and
the more complex the garbage is, the less an emulator can enter into the
decryptor to emulate. Let's see some types, although there aren't all
(obviously I'm not going to explain the easy ones). Imagination also counts!

  3.1 Memory accesses
  -------------------
Nowadays is a must to have it in an engine, if one wants to make it complex.
Which type of application doesn't make any random-access memory operation in
the first 300 bytes? Only a very weird program or an infected program with an
attached polymorphic virus that doesn't implement memory writing instructions
can do this (besides decryption process).

But the fact is that, while in MS-DOS we had all the memory accesible, and we
could read from everywhere, this isn't true for win32, and an attempt to read
from "everywhere" will cause, in the great majority of times, an exception.
Writing is much more restrictive under win32, because we can only write on
the sections we defined as WRITABLE on the PE header (although we want to
generate an exception, of course :). So, we can use some tricks to have frames
of memory to read and/or write indiscriminatedly.

In win32 executables, we have a section that exists in nearly all them: the
".bss" section. This section has a physical size (in file size) of 0, but
virtually can be quite big (its size is normally 1000h bytes at least, but in
huge programs can arrive to 64K or more). We can use that section to read
and/or write anything we want, but always if we make our virus to execute at
first, not doing Entry-Point Obscuring and such things, since the application
would set all the void data in the section to whatever it needs. There is
another solution, and is to use the void holes in the virus that we use to
retrieve, for example, the current directory with GetCurrentDirectory or
similar functions. Since we don't need that fields until the virus is
executing, we can use that holes, if they are big enough, as frames of memory
in the same way as .bss, where we can read and write things.

So, once we have that frames, and we are sure that at least 256 or 512 bytes
are free to do animalities :), we can code a function to retrieve a random
memory address, for example:

                        call    Random
                        and     eax, 0FCh
                        add     eax, [AddressOfMemoryFrame]
                        ret

That will return in EAX a memory address, which moreover is aligned in a
dword boundary.

 3.2 API calls
 -------------
OK, so after coding a good decryptor structure and memory accesses, to fuck
up completely any initial suspiciousness about the "undesired traveller", we
insert API calls.

It's not easy to put them, and we only can call those ones that we know how
to call, so we have to have information about all them (there isn't a
"generic" way of calling them), and moreover we need to find and scan the
import directory of the victim host while infecting it, so we have to deduce,
from the virtual address (since we only know that about the import directory
from the PE header), which physical address is, and then from the physical
address convert to virtual address to have the imported API calling address.

The method I follow is the next, assuming we have the host mapped in memory:

        1) We get the virtual address of the import directory, which is
          located at PE_Header+80h.

        2) Now, we scan all sections of the file to find in which section
          that virtual address will be.

        3) After finding the section, we subtract the virtual address of the
          section to the virtual address of the import, to get the relative
          position of the import directory in that section, and we add the
          result to the physical address of the section to get the physical
          address of the import.

        4) Now, we save the values we have obtained in the process, and we
          begin to scan the mapped import directory as if it were virtual,
          since later the program will be in that way in memory.

        5) We scan the imported modules and we look for known functions, but
          taking in account that every time we get an RVA first we have to
          convert it to physical (this applies while getting values from the
          array of RVAs to the names of the functions), so, having the RVA,
          we subtract the RVA of the section and we add the physical address
          of that section, so we get the physical address of the name of the
          function.
        6) Then, when we find the desired functions, we get the order in the
          array of imported addresses. We add to that number the virtual
          address of that array in order to get the virtual address where the
          imported address will be stored.
        7) We save that number, and we continue searching for more functions.

After that, we get the addresses to the import where the virtual addresses to
the functions will be stored. Now, a call like CALL [Obtained_Address] will
make a call to the API. Just be careful with the parameters and with the
functions were a buffer is required.

Another thing: as Micro$oft programmers are dumb or worse, there are functions
that can hang the application, like GetModuleHandleA. I've tried to pass to it
a random pointer as the module name to get the handle of, but then an
exception occurs, instead of returning an error telling "no valid string" or
"module not found" or something like that, so be careful with some functions.

 3.3 Recursive garbage functions
 -------------------------------
Before we saw the potential of recursive calls to make things, and now we
apply that to garbage. There are some types of garbage that we can make (and
we must make) in a recursive way, like CALLs, random loops and some more. Here
I'll explain CALLs and random loops.

Before I explained a method of how to make CALLs without making suspicious
structures, and I told that, later, at the end of the decryptor (for example),
you can construct the subroutines that are called by that CALLs. To make the
subroutines, we should use recursivity, so we have to code the DoGarbage
function in a recursive way, I mean, the function can be called by itself.
This has to be made to make better garbage inside the CALLs, and moreover in
that way in that subroutines we can have other CALLs, also. But be careful,
because something like this can happen:

                Subroutine1:
                        ...
                        call    Subroutine2
                        ...
                        ret

                Subroutine2:
                        ...
                        call    Subroutine1
                        ...
                        ret

That situation can produce a hang of the decryptor, and thus the application
never executes. To avoid that, we can use arrays to store "levels" of calls,
in this way:

CallsLevel1     db      x dup (?)
CallsLevel2     db      x dup (?)
CallsLevel3     db      x dup (?)
...

When we enter in the CALL generation part of the engine, we must increase a
variable, like an internal recursivity level, to control the level of call
that we are generating now, in a way that makes that Level 1 calls never will
generate a call to Level 1 or upper, and the same for Level 2 and so on. In
thas way we avoid situations like the one above, since a subroutine won't
call other subroutine which will drive execution to the same subroutine again.

Other recursive garbage generation is the random loop generation. We can put
little, annoying loops that do nothing but that, loop (just seven or eight
times, nothing that last very long). Since a random loop inside a random loop
is equal to a geometrical increment of the loop duration, better if we avoid
them, having a variable telling "I'm in a loop now, so don't make another".
The same applies for CALLs, since we code them later, not in that moment, and
maybe we put another loop inside the generated subroutine, which would produce
the same problem. So, when generating looping code, avoid making other loops
or calls. To make the loop, of course, we call to DoGarbage to fill some
looping space (a void loop isn't normal, you know).

And, as you could deduct, maybe, we can use DoMOVRegValue and all those
functions that we coded to generate more garbage: just take a garbage register
and a random number and use that functions.

ÚÄÄÄÄÄÄ---úú    .
| 4. Last words |
`    úú--ÄÄÄÄÄÄÄÙ
Well, this article is shorter than I expected, but I hope it'll be useful for
you to bring you ideas while coding your new polymorphic engine. The great
majority of ideas I expose here have been used in the TUAREG engine, and
sometimes in the source code of the TUAREG I refer to this article to get the
explanation of some techniques. Laterz!

ÄÄÄÄÄ---úú
The Mental Driller / 29A
ÄÄÄÄÄ---úú