ЪДДДДДДДДДДДДДДДДДДДДДї
 і 32 bit optimization і                                    Billy BelcebЈ/DDT
 АДДДДДДДДДДДДДДДДДДДДДЩ

 Ehrm... Super should do this instead me, anyway, as i'm his pupil, i'm gonna
 write here  what i have learnt  in the  time while i am  inside Win32 coding
 world. I will guide  this tutorial through  local  optimization  rather than 
 structural optimization, because this is up to you and your style (for exam-
 ple, personally i'm *VERY* paranoid  about the stack and delta offset calcu-
 lations, as you  could  see in my codes, specially in Win95.Garaipena). This
 article is full of my own  ideas and  of  advices that  Super  gave to me in
 Valencian meetings. He's  probably the best  optimizer in  VX world ever. No
 lie. I won't discuss here how  to optimize to the max as he does. No. I only
 wan't to make you see the most obvious optimizations that could be done when
 coding for Win32, for example. I won't comment the VERY obvious optimization
 tricks, already explained in my Virus Writing Guide for MS-DOS.

 % Check if a register is zero %
 ДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДД

 I'm sick of see the same  always, specially in Win32 coders, and this is re-
 ally killing me  slowly and very painfully. No, no, my mind can't assimilate
 the idea of a CMP EAX,0 for example. Ok, let's see why:

        cmp     eax,00000000h                   ; 6 bytes
        jz      bribriblibli                    ; 2 bytes (if jz is short)

 Heh, i know life's a shit, and  you are wasting many code in shitty compari-
 sons. Ok, let't see how  to solve  this situation, with a code that does the
 same, but with less bytes.

        or      eax,eax                         ; 2 bytes
        jz      bribriblibli                    ; 2 bytes (if jz is short)

 And there is a  way to do  this even more  optimized, anyway it's okay if it
 doesn't matter where  should be the content of EAX (after what i am going to
 put here, EAX content will finish in ECX). Here you have:

        xchg    eax,ecx                         ; 1 byte
        jecxz   bribriblibli                    ; 2 bytes (if it is short)

 Do you see? No  excuses  about "i don' t optimize because i lose stability",
 because with this  tips you will  optimize  without  losing anything besides
 bytes of code ;) Heh, we  passed from a  8  bytes routine to 3 bytes... Heh?
 what do you say about it? Hahahaha.

 % Check if a register is -1 %
 ДДДДДДДДДДДДДДДДДДДДДДДДДДДДД

 As many APIs in Ring-3 return you a value of -1 (0FFFFFFFFh) if the function
 failed, and as  you should compare  if it  failed, you must compare for that
 value. But there is  the same  problem  as before, many many people do it by
 using CMP EAX,0FFFFFFFFh and it could be done more optimized...

        cmp     eax,0FFFFFFFFh                  ; 6 bytes
        jz      insumision                      ; 2 bytes (if short)

 Let's do it as it could be more optimized:

        inc     eax                             ; 1 byte
        xchg    eax,ecx                         ; 1 byte
        jecxz   insumision                      ; 2 bytes (if short)
        dec     ecx                             ; 1 byte 

 And another thingy could be this:

        inc     eax                             ; 1 byte
        jz      insumision                      ; 2 bytes
        dec     eax                             ; 1 byte

 Heh, maybe it  occupies more lines, but occupies less  bytes so far (4 bytes
 against 8).

 % Clear a 32 bit register and move something to its LSW %
 ДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДД

 The most  clear example is  what all  viruses  do when loading the number of
 sections  of PE file in AX (as this value occupies 1 word in the PE header).
 Well, let's see what do the majority of VX:

        xor     eax,eax                         ; 2 bytes
        mov     ax,word ptr [esi+6]             ; 4 bytes

 I'm still  wondering why all  VX use  this "old" formula, specially when you
 have a 386+ instruction  that avoids us to  make register  to be zero before
 putting the word in AX. This instruction is MOVZX.

        movzx   eax,word ptr [esi+6]            ; 4 bytes

 Heh, we avoided 1 instruction of 2 bytes. Cool, huh?

 % Calling to an address stored in a variable %
 ДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДД

 Heh, this is  another thing  that  some  VX do, and makes me to go crazy and
 scream. Let me remember it to you:

        mov     eax,dword ptr [ebp+ApiAddress]  ; 6 bytes
        call    eax                             ; 2 bytes

 We can  call to an  address directly guys... It  saves bytes and doesn't use
 any register that could be useful for another things.

        call    dword ptr [ebp+ApiAddress]      ; 6 bytes

 Another time  again, we are saving  an unuseful, and not needed instruction,
 that occupies 2 bytes, and we are making exactly the same.

 % Fun with push %
 ДДДДДДДДДДДДДДДДД

 Almost the same as above, but with push. Let's see what to don't do and what
 to do:

        mov     eax,dword ptr [ebp+variable]    ; 6 bytes
        push    eax                             ; 1 byte

 We could do the same with 1 byte less. See.

        push    dword ptr [ebp+variable]        ; 6 bytes

 Cool, huh? ;) Well, if we  need  to push many times (if the value is big, is 
 more optimized if you push that value 2+ times, and if the value is small is
 more optimized to push it when you need to push the value 3+ times) the same
 variable is more optimized  to put  it in a register, and push the register.
 For  example, if we  need to push zero 3  times, is more optimized  to xor a
 register with itself and later push the register. Let's see:

        push    00000000h                       ; 2 bytes
        push    00000000h                       ; 2 bytes
        push    00000000h                       ; 2 bytes

 And let's see how to optimize that:

        xor     eax,eax                         ; 2 bytes
        push    eax                             ; 1 byte
        push    eax                             ; 1 byte
        push    eax                             ; 1 byte

 Another thing passes  while using  SEH, as we  need to push fs:[0] and such
 like. Let's see how to optimize that:

        push    dword ptr fs:[00000000h]        ; 6 bytes
        mov     fs:[0],esp                      ; 6 bytes
        [...]
        pop     dword ptr fs:[00000000h]        ; 6 bytes

 Instead that we should do this:

        xor     eax,eax                         ; 2 bytes
        push    dword ptr fs:[eax]              ; 3 bytes
        mov     fs:[eax],esp                    ; 3 bytes
        [...]
        pop     dword ptr fs:[eax]              ; 3 bytes

 Heh, seems a silly thing, but we have 7 bytes less! Whoa!!!

 % Get the end of an ASCIIz string %
 ДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДД

 This is very useful, specially  in our API search engines. And of course, it
 could  be done more  optimized rather  than the  typical way in all viruses.
 Let's see:

        lea     edi,[ebp+ASCIIz_variable]       ; 6 bytes
 @@1:   cmp     byte ptr [edi],00h              ; 3 bytes
        inc     edi                             ; 1 byte
        jz      @@2                             ; 2 bytes
        jmp     @@1                             ; 2 bytes
 @@2:   inc     edi                             ; 1 byte

 This same code could be very reduced, if you code it in this way:

        lea     edi,[ebp+ASCIIz_variable]       ; 6 bytes
        xor     al,al                           ; 2 bytes
 @@1:   scasb                                   ; 1 byte
        jnz     @@1                             ; 2 bytes

 Hehehe. Useful, short and good looking. What else do you need? ;)

 % Multiply shitz %
 ДДДДДДДДДДДДДДДДДД

 For example, while  seeing the code for  get the last section, the code most
 used includes this (we have in EAX the number of sections - 1):
       
        mov     ecx,28h                         ; 5 bytes
        mul     ecx                             ; 2 bytes

 And this  saves the result in EAX, right? Well, we have a much better way to
 do this, with an only one instruction:

        imul    eax,eax,28h                     ; 3 bytes

 IMUL stores in the first register indicated the result, result that is given
 to us multiplying the  second register indicated  with the third operand, in
 this case, it's an  immediate. Heh, we saved  4 bytes of  substituing only 2
 instructions of code!

 % Infection mark %
 ДДДДДДДДДДДДДДДДДД

 It should work, anyway i'm not sure, because it doesn't in my computer. Pff,
 maybe an intel bug, or my system is crazy or something. Not sure, but anyway
 try it, as it is very interesting. Look how it should be unoptimized:

        cmp     dword ptr [esi+44h],"MARK"      ; 7 bytes
        jz      oro_y_grana                     ; 2 bytes
        mov     dword ptr [esi+44h],"MARK"      ; 7 bytes

 Optimized, this should  be  in this way (i already said that it SHOULD work,
 but it doesn't in my PC):

        mov     eax,"MARK"                      ; 5 bytes
        cmpxchg dword ptr [esi+44h],eax         ; 4 bytes
        jz      oro_y_grana                     ; 2 bytes

 Pfff, a really good optimization, 16 bytes reduced to 11 bytes ;)

 % UNICODE to ASCIIz %
 ДДДДДДДДДДДДДДДДДДДДД

 There are many to do here. Specially done for Ring-0 viruses, there is a VxD
 service for  do that, firstly i'm gonna  explain how to do  the optimization
 based in the use of this service, and finally i'll show Super's method, that
 saves TONS of  bytes. Let's see  the typical  code (assumming EBP  as ptr to 
 ioreq structure and EDI pointing to file name:

        xor     eax,eax                         ; 2 bytes
        push    eax                             ; 1 byte
        mov     eax,100h                        ; 5 bytes
        push    eax                             ; 1 byte
        mov     eax,[ebp+1Ch]                   ; 3 bytes
        mov     eax,[eax+0Ch]                   ; 3 bytes
        add     eax,4                           ; 3 bytes
        push    eax                             ; 1 byte
        push    edi                             ; 1 byte
@@3:    int     20h                             ; 2 bytes
        dd      00400041h                       ; 4 bytes

 Well, particulary only 1 improve could  be done to that code, substitute the
 third line with this:

        mov     ah,1                            ; 2 bytes

 Heh, but i  said that Super improved  this to the max. I  haven't copied his
 code to get the ptr to the unicode name of file, because is almost ununders-
 tandable, but i catched the  concept. Assumptions are  EBP  as ptr  to ioreq
 structure and buffer as a 100h bytes buffer. Here goes some code:

        mov     esi,[ebp+1Ch]                   ; 3 bytes
        mov     esi,[esi+0Ch]                   ; 3 bytes
        lea     edi,[ebp+buffer]                ; 6 bytes
 @@l:   movsb                                   ; 1 byte  Дї
        dec     edi                             ; 1 byte   і This loop was 
        cmpsb                                   ; 1 byte   і made by Super ;)
        jnz     @@l                             ; 2 bytes ДЩ

 Heh, the first of all routines (without local optimization) is 26 bytes, the
 same with that local  optimization is  23 bytes, and  the last  routine, the
 structural optimization is 17 bytes. Whoaaaa!!! 

 % VirtualSize calculation %
 ДДДДДДДДДДДДДДДДДДДДДДДДДДД

 This title is an excuse for show you another strange opcode, very useful for
 VirtualSize calculations, as we have to add to it a value, and get the value
 that was there before our addition. Of course, the opcode i am talking about
 is XADD. Ok, ok, let's see the unoptimized VirtualSize calculation (i assume
 ESI as a ptr to last section header):

        mov     eax,[esi+8]                     ; 3 bytes
        push    eax                             ; 1 byte
        add     dword ptr [esi+8],virus_size    ; 7 bytes
        pop     eax                             ; 1 byte

 And let's see how it should be with XADD:

        mov     eax,virus_size                  ; 5 bytes
        xadd    dword ptr [esi+8],eax           ; 4 bytes

 With XADD we saved 3 bytes ;) Btw, XADD is a 486+ instruction.

 % Setting STACK frames %
 ДДДДДДДДДДДДДДДДДДДДДДДД

 Another Ring-0 thingy. Let's see it unoptimized:

        push    ebp                             ; 1 byte
        mov     ebp,esp                         ; 2 bytes
        sub     esp,20h                         ; 3 bytes

 And if we optimize...

        enter   20h,00h                         ; 4 bytes

 Charming, isn't it? ;)

 % Tips & tricks %
 ДДДДДДДДДДДДДДДДД

 Here i will  put unclassificalble tricks for  optimize, or if i assumed that
 you know them while making this article ;)

 - Never use JUMPS directive in your code. 
 - Use string operations (MOVS, SCAS, CMPS, STOS, LODS)
 - Use LEA reg,[ebp+imm32] rather than MOV reg,offset imm32 / add reg,ebp
 - Make your assembler pass many times over the code (in TASM, /m5 is good)
 - Use the STACK, and avoid as much as possible to use variables
 - Try to avoid use AX,BX,CX,DX,SP,SI,DI and BP, as they occupy 1 byte more
 - Many operations (logical ones specially) are optimized for EAX/AL register
 - Use CDQ for clean EDX if EAX is lower than 80000000h
 - Use XOR reg,reg or SUB reg,reg for make a register to be zero
 - For bit operations use the "family" of BT (BT,BSR,BSF,BTR,BTF,BTS)
 - Use XCHG instead MOV if the register order doesn't matter
 - While pushing all values of IOREQ structure, use a loop
 - Use the HEAP as much as possible (API addresses, temp infection vars, etc)
 - If you like, use conditional MOVs (CMOVs), but they are 586+
 - If you know how to, use the coprocessor (its stack, for example)
 - Use SET family of opcodes for use kinda semaphores in yer code

 % Final words %
 ДДДДДДДДДДДДДДД

 I expect you understood at least the first optimizations put in this article
 because they are the  ones that make me go mad. I know  i am not the best at
 optimization, neither one of  them. For me, the size doesn't matter. Anyway,
 the obvious optimizations must be done, at least for demonstrate you know to
 something in your life. Less  unuseful bytes means  a better  virus, believe
 me. And don't come to me using the same words that QuantumG used in his Next
 Step virus. The  optimizations i showed  here WON'T make your  virus to lose
 stability. Just try to use them, ok? It's very logic, guyz. 

 Billy BelcebЈ,
 mass killer and ass kicker.