эшшшшшэ эшшшшшэ эшшшшшэ
      зд Optimization of 32bit code д©                шшш шшш шшш шшш шшш шшш
      Ё              by              Ё                 эээшшъ ъшшшшшш шшшшшшш
      юдддддддд Benny / 29A ддддддддды                шшшээээ ээээшшш шшш шшш
                                                      шшшшшшш шшшшшшъ шшш шшш


 здддддддддддддд©
 Ё 1. Disclamer Ё
 юдддддддддддддды

 The followin' document is an education purpose only. Author isn't
 responsible for any misuse of the things written in this document.


 зддддддддддддд©
 Ё 2. Foreword Ё
 юддддддддддддды

 Eeeh, why da fuck I wrote this article ? There r many documents about
 optimizations. Yes, that's truth, and there r many very gewd and kewl tutes
 [* Billy, your tute rox! *]. But how can u see, not every tute has on the
 mind, that the term "optimize" doesn't fully mean, your code will be only
 small. There r many aspects of optimization and I wanna discuss it here and
 make u complex view on the thing.
 When I started to write this article, I was really drunk and totaly under the
 drugs (hehe, no lie X-D), so if u feel, I made any mistake or u think, things
 written here aren't true or simply u wanna give me some credits (do it
 please X-D), u can find me on IRC UnderNet, channels #vir and/or #virus or
 mail benny@post.cz. Thanx for all possitive (and also negative) comments.

 
 зддддддддддддддддд©
 Ё 3. Introduction Ё
 юддддддддддддддддды

 As I said some seconds before, optimization has many aspects.
 Generaly, we can optimize our code, so:
        -       code will be smaller
        -       code will be faster
        -       code will be smaller and faster


 Well, it gives us some new space for thinkin'. If we optimize our code:
        -       code will be smaller, but also slower
        -       code will be bigger, but faster
        -       code will be smaller and faster


 We should find compromise (if we can't reach third point) between first
 and second point. I'm sure, u don't wanna alert user by slowin' down system
 performace due to:
        -       huge and unoptimized code
        -       small, but slow code

 or alert user by rapidly decreasin' space on the disk.


 It's up to us, which way will we choose. Here we have a clue:
        -       if our code (or block of code, e.g. thread procedure) is
                small, we should optimize it for faster code
        -       if our code (or block of code) is big, we should optimize
                it for smaller/faster (find compromise, prefer speed) code

 However, we should optimize our code by decreasin' its size and increasin'
 speed, but u know, how is it difficult.


 Is it clear ? I think, u already knew this. But still, there r still many
 aspects of optimization. We have for example two instructions, that do the
 same thing, but:
        -       one instruction is bigger
        -       one instruction is slower
        -       one instruction changes another registers
        -       one instruction writes to memory
        -       one instruction changes flags
        -       one instruction is faster on one processor, but slower on
                another one


 Example:       LODSB             MOV AL, [ESI] + INC ESI
                -----------------------------------------
     size:      smaller           bigger
     speed:     faster on 80386   faster on 80486, on Pentium only 1 cycle
     flags:     preserved         changed


 And why is LODSB faster on 80386 and why it takes only 1 cykle on Pentium ?
 Pentium is superscalar processor supportin' pipelinin', so it can execute
 pair of some integer instructions in a PIPE, i.e. it can execute those
 instructions simultaneously. Two instructions, that can be executed
 simultaneously r called "pairable instructions".

 Hehe, don't worry, this arcticle won't be about Pentium processor
 architecture, so u can forget words I said about pipes. Maybe l8r, if I will
 write another article about Pentium processor optimization, I will explain
 more in details terms such as pipes, V-pipe, U-pipe, pairin' and so on. For
 now, u can forget them. Just remember, what does "pairin'" word mean.


 Now, I will discuss step by step every optimization techniques.


 зддддддддддддддддддддддддд©
 Ё 4. Optimizin' our code  Ё
 юддддддддддддддддддддддддды

 Well, let's go optimize. I will start from the easiest operation.
 Beginners, hold on...


 4.1. Zero register
дддддддддддддддддддд

        I don't wanna see this anymore:

        1)      mov eax, 00000000h                    ;5 bytes

        This is the worst instruction I've ever seen. Well, it seems
        logical, that u will move zero to register, but u can do it
        more optimizely like now:

        2)      sub eax, eax                          ;2 bytes

                     or

        3)      xor eax, eax                          ;2 bytes

        3 bytes on one instruction saved, great ! X-D But what's better
        to use, SUB or XOR ? I prefer XOR, coz Micro$oft prefers SUB and I
        know, that Windozes r slooooow, hehe. Noo, that's not true reason.
        What do u think, is better (for u) to substact two numbers or say
        "where's 1 and 1, write 0" ? So u know, why I prefer XOR (as I hate
        mathematix X-D).


 4.2. Test if register is zero
ддддддддддддддддддддддддддддддд

        Hmmm, let's see the brightest solution:

        1)      cmp eax, 00000000h                    ;5 bytes
                je _label_                            ;2/6 bytes (short/near)

        [* NOTE: Many aritmetical instructions r optimized for register EAX,
        so code usin' EAX register will be faster and smaller.
        Example: CMP EAX, 12345678h (5 bytes). If I would use another register
        instead of EAX, CMP instruction would have 6 bytes *]

        Argh! Who normal can do this ? That's 7 or 15(!) bytes for simple
        comparsion. No, no, no, don't do it and try this:

        2)      or eax, eax                           ;2 bytes
                je _label_                            ;2/6 (short/near)

                    or

        3)      test eax, eax                         ;2 bytes
                je _label_                            ;2/6 (short/near)

        Hmm, much better, 4/8 bytes is really better than 7/15 bytes. So,
        again, whats better, OR or TEST ? OR prefers Micro$oft so again, I
        prefer TEST |-). Now seriously, TEST doesn't write to register (OR
        does), so there will be better pairin' => faster code. I hope, u still
        remember, what does "pairin'" word mean...If not, read again
        Introduction section.

        Now, the biggest magic. If u don't care of ECX register or u don't
        care, where will be stored content of registers (EAX and ECX), u can
        do it this way:

        4)      xchg eax, ecx                         ;1 byte
                jecxz _label_                         ;2 bytes

        [* NOTE: XCHG is optimized for EAX register, so if XCHG will use
        EAX register, it will be 1 byte long, otherwise 2 bytes *]

        Great! We optimized our code, so we saved 4 bytes.


 4.3. Test if register is 0FFFFFFFFh
ддддддддддддддддддддддддддддддддддддд

        Many APIs return -1, when function fail, so it is important to
        test for this value. I'm always astonished, when I see how some
        coders test for this value like now me:

        1)      cmp eax, 0ffffffffh                   ;5 bytes
                je _label_                            ;2/6 bytes

        I hate this. And now look, how can it be optimized:

        2)      inc eax                               ;1 byte
                je _label_                            ;2/6 bytes
                dec eax                               ;1 byte

        Yes, yes, yes, we saved 3 bytes and made code faster ;)


 4.4. Move 0FFFFFFFFh to register
дддддддддддддддддддддддддддддддддд
        
        Some APIs need as parameter -1 value. Let's see, how can we set it:

        Least optimized:

        1)      mov eax, 0ffffffffh                   ;5 bytes

        More optimized:

        2)      xor eax, eax / sub eax, eax           ;2 bytes
                dec eax                               ;1 byte

        Or this with same result (by Super/29A):

        3)      stc                                   ;1 byte
                sbb eax, eax                          ;2 bytes

        This code is very useful in same cases, such as:
                jnc _label_
                sbb eax, eax                          ;2 bytes only!
       _label_: ...


 4.5. Zero register and move something to LSW
ддддддддддддддддддддддддддддддддддддддддддддддд

        Example of unoptimized code:

        1)      xor eax, eax                          ;2 bytes
                mov ax, word ptr [esi+xx]             ;4 bytes

        386+ supports new instruction called MOVZX (MOVe with Zero Extension).
        [* NOTE: MOVZX is faster on 386, on 486+ is slower *] Example of
        optimized code, where we can save 2 bytes:

        2)      movzx eax, word ptr [esi+xx]          ;4 bytes

        Next example of "ugly code":

        3)      xor eax, eax                          ;2 bytes
                mov al, byte ptr [esi+xx]             ;3 bytes

        Now we can save valuable 1 byte X-D:

        4)      movzx eax, byte ptr [esi+xx]          ;4 bytes

        This is very effective, when u r readin' bytes/words from PE header.
        Becoz u need to work with bytes/words/dwords altogether, MOVZX is
        the best for this case.

        And last example:

        5)      xor eax, eax                          ;2 bytes
                mov ax, bx                            ;3 bytes

        Better use this formula, which discards 2 bytes:

        6)      movzx eax, bx                         ;3 bytes

        I use MOVZX evertime I can. It is small and it isn't so slow
        as another instructions.


 4.6. Push shit
дддддддддддддддд

	Tell me, how will u store 50h to EAX...
        ----------------------------------------

        Badly:

        1)      mov eax, 50h                          ;5 bytes

        Better:

        2)      push 50h                              ;2 bytes
                pop eax                               ;1 byte

        Usin' PUSH and POP is little slower, but smaller too. When is operand
        short (1 byte long), push takes 2 bytes. Otherwise it takes 5 bytes.

        Let's try another thing. Push 7x 0 to stack...
        -----------------------------------------------

        Unoptimizely:

        3)      push 0                                ;2 bytes
                push 0                                ;2 bytes
                push 0                                ;2 bytes
                push 0                                ;2 bytes
                push 0                                ;2 bytes
                push 0                                ;2 bytes
                push 0                                ;2 bytes

        Optimizely, but still biggy X-D:

        4)      xor eax, eax                          ;2 bytes
                push eax                              ;1 byte
                push eax                              ;1 byte
                push eax                              ;1 byte
                push eax                              ;1 byte
                push eax                              ;1 byte
                push eax                              ;1 byte
                push eax                              ;1 byte

        Compactly, but slower:

        5)      push 7                                ;2 bytes
                pop ecx                               ;1 byte
      _label_:  push 0                                ;2 bytes
                loop _label_                          ;2 bytes

        Wow, without any pain, we saved 7 bytes ;))

        And now, life story... U wanna move something from one variable
        into another variable. All registers must be preserved.
        U probably do this:
        ----------------------------------------------------------------

        6)      push eax                               ;1 byte
                mov eax, [ebp + xxxx]                  ;6 bytes
                mov [ebp + xxxx], eax                  ;6 bytes
                pop eax                                ;1 byte

        And now, usin' only stack, no registers:

        7)      push dword ptr [ebp + xxxx]            ;6 bytes
                pop dword ptr [ebp + xxxx]             ;6 bytes

        This is useful, when u haven't any register free to use. I use it,
        when I wanna save old entrypoint to another variable...

        8)      push dword ptr [ebp + header.epoint]   ;6 bytes
                pop dword ptr [ebp + originalEP]       ;6 bytes

        This saves wonderful 2 bytes |-). Though it is little slower than
        normal manipulation by EAX (without savin' it), it still come handy,
        when u don't wanna (or can't) use any register.


 4.7. Multiply fun
ддддддддддддддддддд
                
        Tell me, how u will calculate offset of last section, when u have
        in EAX number_of_sections-1 ?

        Badly:

        1)      mov ecx, 28h                          ;5 bytes
                mul ecx                               ;2 bytes

        Better:

        2)      push 28h                              ;2 bytes
                pop ecx                               ;1 byte
                mul ecx                               ;2 bytes

        Much better:

        3)      imul eax, eax, 28h                    ;3 bytes

        What IMUL does ? IMUL multiplies second register with third operand
        and stores it in first register (EAX). U can so multiply 28h with EBX
        and store it to EAX by this:

        4)      imul eax, ebx, 28h

        Simple, and effective (as size, as speed). I dont wanna imagine, how
        would u do this by MUL instruction... X-D


 4.8. Stringz in action
дддддддддддддддддддддддд

        I wanna jump into the wall when I see unoptimized string operations.
        Here u have some hints, how can u optimize your code usin' string
        instructions. Do it please, or I will really do it ! X-D

        Startin' from the scratch, how can u load a byte ?
        ---------------------------------------------------

        Faster:

        1)      mov al, [esi]                         ;2 bytes
                inc esi                               ;1 byte

        Smaller:

        2)      lodsb                                 ;1 byte

        I recommand to use *Smaller* version. This is one byte instruction,
        that does exactly the same thing as *Faster* version. It's faster on
        80386, but very slower on 80486+. On pentium, *Faster* takes one cycle
        due to pairin'. However, I think the best to use is still *Smaller*
        version.

        And how can u load word ? Ehrm, DO NOT load words, it's too much
        slow in 32bit enviroment such as Win32. But if u seriously wanna
        load it, here is the clue...
        -----------------------------------------------------------------

        Faster:

        3)      mov ax, [esi]                         ;3 bytes  
                add esi, 2                            ;3 byte

        Smaller:

        4)      lodsw                                 ;2 bytes

        Whata 'bout speed and size ? See previous description (LODSB).

        Aaaah, loadin' dwords is always funny. Look at this:
        -----------------------------------------------------

        Faster:

        5)      mov eax, [esi]                        ;2 bytes
                add esi, 4                            ;3 byte

        Smaller:

        6)      lodsd                                 ;1 byte

        See description of LODSB.

        And next very useful thing... Movin' something from somewhere
        to somewhere. It's in fact LODSB/LODSW/LODSD + STOSB/STOSW/STOSD.
        Here is the example of MOVSD:
        ------------------------------------------------------------------

        Faster:

        7)      mov eax, [esi]                        ;2 bytes
                add esi, 4                            ;3 bytes
                mov [edi], eax                        ;2 bytes
                add edi, 4                            ;3 bytes

        Smaller:

        8)      lodsd                                 ;1 byte

        *Faster* is faster on 486+, *Smaller* is smaller ;).
                                                     
        Finaly, I would like to say, that u should always load dwords instead
        bytes or words, coz u run 386+ processor, which is 32bit. I.e. your
        processor worx with 32 bits, so if u wanna work with one byte,
        processor must load dword and then truncate it. Aaaa, too much work,
        so if it's not neccesery to use bytes/words, don't use them.

        Next fun... how can u get the end of string ?
        ----------------------------------------------

        Here is the JQwerty's method:

        9)      lea esi, [ebp + asciiz]               ;6 bytes
       s_check: lodsb                                 ;1 byte
                test al, al                           ;2 bytes
                jne s_check                           ;2 bytes

        And Super's method:

        10)     lea edi, [ebp + asciiz]               ;6 bytes
                xor al, al                            ;2 bytes
       s_check: scasb                                 ;1 byte
                jne s_check                           ;2 byte

        Now, which is the best one ? Hmmm, hard to say truth...X-D
        On 80386+ is faster Super's method, but on Pentium's, Jacky's method
        is faster due to pairin'. Hehe, all these methods has the same size,
        so choose, which would u like to use... |-)


 4.9. Complex aritmetix
дддддддддддддддддддддддд

        Now my favourite stuff. Its a pity, that this great technique hasn't
        found usage at VX coderz. However, instructions, I wanna talk about
        r WELL KNOWN (heh, but still, noone knows how slightly does it run and
        what more it can do), VERY SMALL and VERY FAST on every processor.

        Imagine, u have a table of DWORDs. Pointer to table is stored in EBX
        register, index to table is in ECX. U wanna increment ECX. dword in
        table, so something like this: EBX+(4*ECX). U don't want to modify any
        register.
        U can do it this way (everybody does it):

        1)      pushad                                ;1 byte
                imul ecx, ecx, 4                      ;3 bytes
                add ebx, ecx                          ;2 bytes
                inc dword ptr [ebx]                   ;2 bytes
                popad                                 ;1 byte

        Or do it better (nobody does it):

        2)      inc dword ptr [ebx+4*ecx]             ;3 bytes

        This really rox !!! U saved processor time (this is very fast), space
        in memory (very small, as u can see) and make better readable your
        source code !!! U saved 6 bytes by simple ONE INSTRUCTION !!!

        That's not all (not all for INC instruction). Imagine another
        situation: EBX - pointer to memory, ECX - index to table, u wanna
        increase ECX. dword + 4096 bytes, so this: EBX+(4*ECX)+1000h. Yeah,
        and u wanna preserve all registers. U can do it unoptimizly like this:

        3)      pushad                                ;1 byte
                imul ecx, ecx, 4                      ;3 bytes
                add ebx, ecx                          ;2 bytes
                add ebx, 1000h                        ;6 bytes
                inc dwor ptr [ebx]                    ;2 bytes
                popad                                 ;1 byte

        Or very optimizely...

        4)      inc dword ptr [ebx+4*ecx+1000h]       ;7 bytes

        Yahoooooo, we saved 8 bytes by one instruction (and we used IMUL
        instead of MUL), great !

        This magic can do EVERY aritmetical instructions, not only INC.
        Imagine, how much space will u save, when u will use this in
        instructions such as ADD, SUB, ADC, SBB, INC, DEC, OR, XOR, AND, etc.

        The biggest magic is commin' now. Hey guy, tell me, what does the LEA
        instruction. U probably know, that it's instruction we use for
        manipulatin' with variables in virus. But only some ppl know, how to
        use this intruction really effectively.

        LEA instruction can be translated as Load Effective Address. This name
        is little claimin'. Let's have a look, what LEA really does.

        Try to hardcode this:

                lea eax, [12345678h]

        What do u think, what will be in EAX after execution this opcode ?
        Rite answer is 12345678h.

        Another example (EBP = 1):

                lea eax, [ebp + 12345678h]

        What will be in register EAX ? Right answer is 12345679h. Yes, on the
        least significant digit is 9h. So let's translate this instruction
        to "normal" language:

                lea eax, [ebp + 12345678h]            ;6 bytes
                ==========================
                mov eax, 12345678h                    ;5 bytes
                add eax, ebp                          ;2 bytes

        As u can see, LEA doesn't work with memory or addressed. It only worx
        with its operands and makin' some operations with it, then it stores
        result into first operand (EAX in our example). Now look at sizes.
        Weird, it does exactly the same thing (not so exactly, LEA preserves
        flags), but it is shorter. Let's show the whole magic...

        5) Look at this unoptimized stuff:

                mov eax, 12345678h                    ;5 bytes
                add eax, ebp                          ;2 bytes
                imul ecx, 4                           ;3 bytes
                add eax, ecx                          ;2 bytes

        6) Open your mouth and look at this:

                lea eax, [ebp+ecx*4+12345678h]        ;7 bytes

        Close your mouth now. LEA is shorter, faster (much faster) and
        preserves flags. Look at it once again, we saved 5 bytes by one
        instruction and processor time (LEA is much faster on every processor).

        I won't explain here every aritmetical instruction, I think, it
        wouldn't have a sense, coz it has the same syntax. U saw everything
        important, now u can use it. If u wanna use these technique, the only
        thing u have to have on the mind is the syntax:

                OPCODE <SIZE PTR> [BASE + INDEX*SCALE + DISPLACEMENT]


 4.10. Delta offset optimization
ддддддддддддддддддддддддддддддддд

        Naaah, u probably think, I'm mad. If u, as a reader of this e-paper,
        aren't beginner, u must know, what da fuck delta offset is. However,
        I saw at many VX coderz, that they don't use delta offset effectively.
        If u have a look on my first viruses, u will see, I also fucked the
        space it takes. And I wasn't alone. Let's see it in details..


        [* Ehrm, let's have a pause. I think, u have to be tired from this
        BIIIIIG paper. I will tell ya something... Before some minutes, I
        went out to buy new cig-box (uuuh, to many drugs in my body now X-D).
        Hot, sunny weather changed before some moments to hot, windy weather,
        darky, total STOOOORM, but without any rain, I can see big lightenings,
        I like it. It's the best weather to have a minute for thinkin' about
        some things - girls, VX, friends, politix, ... I'm back now. I'm
        plug-inin' some very kewl CD with very kewl music, czech music. Now I
        can hear one very gewd song from one very gewd czech rock-group. Hehe,
        90% of their songs were written when they were totally doped. But wait,
        they r very gewd. Many things u can understand, only when u r doped.
        They r singin' (rite now X-D) about Earth. It's very slow song, it's
        like Indian music (but they also play hard rock, so hard, that Billy
        would like it. Hehe, I will bring this CD sometimes, when we will be
        on some meetin', somewhere, maybe. Billy, u will 100% like it, my
        friend ! X-D). Hmmm, I will tell ya know some lyrix... Very gewd lyrix,
        I hope, u will understand it, I will translate it for ya X-D...

                She defence on and on,
                there r ages, when someone like her,
                Both of nice and cruel,
                U can touch, she will give it also to u,
                Now it is waitin' for that step,
                which makes walk a fly,
                And when then, when, if not now ?????
                Politix can invent only atomic shit,
                let's kick it back to them,
                And when then, when, if not now ?????
                She defence on and on, ....

        Ooooh, my god, whata hell I'm doin' now ? Hehe, if u think, I'm mad,
        be sure it's truth X-DDD. Ok, ok, back to reality... *]


        So, again, let's look at that stuff.
        This is the way, how is standardly delta offset handled...

        1)      call gdelta
        gdelta: pop ebp
                sub ebp, offset gdelta

        That's normal way (but less efficent). Let's look, how we can work
        with it...

                lea eax, [ebp + variable]

        Hmmm, if u look at it under some debugger, u will see followin' line:

        3)      lea eax, [ebp + 401000h]              ;6 bytes

        In the first generation of virus, EBP register will be nulified.
        Ok, but let's look, what happens, if u code this:

        4)      lea eax, [ebp + 10h]                  ;3 bytes

        Hmmm, weird. Sometimes it's 6 bytes, next time it's 3 bytes. It's
        normal. Many instructions r optimized for SHORT (one byte long) values,
        e.g. SUB EBX, 3 will be 3 bytes long too. If u code SUB EBX, 1234h, it
        will have 6 bytes. Not only SUB instruction, also many other
        instructions.

        Look, what happens, if we will use "another" way, how to get delta
        offset...

        5)      call gdelta
        gdelta: pop ebp

        Only ! As I said, in first generation of virus, EBP will be nulified
        (in previous version of gdelta) and variable will be e.g. 401000h.
        That's not good. What do u say, we will have 401000h value in EBP
        and increment value will be that variable ? Thanx to our new version
        of gdelta, we can use SHORT version of LEA and so save 3 bytes on
        variable addressin'. Here is the sample...

        6)      lea eax, [ebp + variable - gdelta]    ;3 bytes

        We got it. Next thing, what should we do is insert all initialized
        variables around the gdelta call. This will make our work (no more
        6 bytes, but 3 bytes instructions) - THIS IS REALLY IMPORTANT. If u
        won't do it, variable would be somewhere FAR (ehrm, I wanted say
        NEAR X-D) from gdelta, so SHORT version of LEA wouldn't be used.
        Heh, u probably think, that there is some trick, that it has some
        limitation or something like that, coz if this would work, everybody
        would use it. Don't worry, there aren't any limitation.
        And why da fuck noone use it ? It's not easy hard to answer. I can say,
        that I dont know. Really don't know.

        [* Let me say my feelings. U probaly know Super/29A. He is the best
        optimizer, I and VX world know. It's fact. U probably also know
        JQwerty/29A. He is also VERY GOOD optimizer, but noone say "Super and
        JQwerty r the best optimizers". I don't know why. I saw this delta
        offset handlin' firstly at his code, noone use it before him (I think).
        And that is soooo easy to use it. If u look at Win32.Cabanas u will
        see MANY and MANY features. And it's only 2999 bytes !!! Who else than
        Super or JQwerty could code it ? I don't know. I wanna only say, that
        "someone" forgot to other kewl guy. *]

        My new virus uses this delta offset handlin' too, and I saved TONS of
        bytes. So why don't u use it too ?


 4.11. Misc optimalizations
дддддддддддддддддддддддддддд

        Here r included those optimization techniques, that I couldn't sort
        to groups above... Just read it, something can be useful...


        Zero EDX register, if EAX is less than 80000000h:
        --------------------------------------------------

        1)      xor edx, edx                          ;2 bytes, but faster

        2)      cdq                                   ;1 byte, but slower

        I always use CDQ instead XOR. Why ? Why not ? X-D


        Save space by usin' all registers, instead of EBP and ESP:
        -----------------------------------------------------------

        1)      mov eax, [ebp]                        ;3 bytes
        2)      mov eax, [esp]                        ;3 bytes

        3)      mov eax, [ebx]                        ;2 bytes


        Wanna have mirror effect of register content ? Try BSWAP.
        ---------------------------------------------------------

        Example:

                mov eax, 12345678h                    ;5 bytes

                bswap eax                             ;2 bytes

                ;eax = 78563412h now

        I haven't ever found this instruction useful for any viral work.
        However, someone maybe will X-D.


        Wanna save some bytes replacin' CALL ?
        ---------------------------------------

        1)      call _label_                          ;5 bytes
                ret                                   ;1 byte

        2)      jmp _label_                           ;2/5 (SHORT/NEAR)

        Huh, we saved 4 bytes and processor time. Always replace call/ret with
        jmp instruction, if call doesn't want any parameters on the stack...


        Wanna save time while comparin' reg/mem ?
        ------------------------------------------

        1)      cmp reg, [mem]                        ;slower

        2)      cmp [mem], reg                        ;1 cycle faster


        Wanna save space and CPU time while dividin'/multiplyin' by
        power of 2 ?
        ------------------------------------------------------------

        Dividin':

        1)      mov eax, 1000h
                mov ecx, 4                            ;5 bytes
                xor edx, edx                          ;2 bytes
                div ecx                               ;2 bytes

        2)      shr eax, 4                            ;3 bytes

        Multiplyin':

        3)      mov ecx, 4                            ;5 bytes
                mul ecx                               ;2 bytes

        4)      shl eax, 4                            ;3 bytes

        No comment...


        Loops, loops and loops:
        ------------------------

        1)      dec ecx                               ;1 byte
                jne _label_                           ;2/6 bytes (SHORT/NEAR)

        2)      loop _label_                          ;2 bytes

        Next example:

        3)      je $+5                                ;2 bytes
                dec ecx                               ;1 byte
                jne _label_                           ;2 bytes

        4)      loopXX _label_ (XX = E, NE, Z or NZ)  ;2 bytes

        LOOP is smaller, but slower on 486+.


        And next unforgetable thing. Noone normal can code this:
        ---------------------------------------------------------

        1)      push eax                              ;1 byte
                push ebx                              ;1 byte
                pop eax                               ;1 byte
                pop ebx                               ;1 byte
      
        Do this and only this. Nothing other than this:

        2)      xchg eax, ebx                         ;1 byte

        And again, if XCHG's operand is EAX, it takes 1 byte otherwise
        it takes 2 bytes. So when u wanna exchange ECX with EDX, XCHG will
        be 2 bytes long:

        3)      xchg ecx, edx                         ;2 bytes

        If u only want to move content of one register to another one, use
        simple MOV instruction. It has better pairin' on Pentium and takes
        less CPU time than XCHG without EAX register as operand:

        4)      mov ecx, edx                          ;2 bytes


        Discard repeated code (and procedure code):
        --------------------------------------------

        1) Unoptimized:

        lbl1:   mov al, 5                             ;2 bytes
                stosb                                 ;1 byte
                mov eax, [ebx]                        ;2 bytes
                stosb                                 ;1 byte
                ret                                   ;1 byte
        lbl2:   mov al, 6                             ;2 bytes
                stosb                                 ;1 byte
                mov eax, [ebx]                        ;2 bytes
                stosb                                 ;1 byte
                ret                                   ;1 byte
                                                      ---------
                                                      ;14 bytes
        2) Optimized:

        lbl1:   mov al, 5                             ;2 bytes
        lbl:    stosb                                 ;1 byte
                mov eax, [ebx]                        ;2 bytes
                stosb                                 ;1 byte
                ret                                   ;1 byte
        lbl2:   mov al, 6                             ;2 bytes
                jmp lbl                               ;2 bytes
                                                      ---------
                                                      ;11 bytes

        Remember, if u have any redundant code, and is greater than jump
        instruction, replace code with it. If u write your own poly engine,
        u will have many opportunities to do that. Don't lose them !


        Manipulatin' with variables:
        -----------------------------

        1) Unoptimized:

                mov eax, [ebp + variable]             ;6 bytes
                ...
                ...
                mov [ebp + variable], eax             ;6 bytes
                ...
                ...
       variable dd      12345678h                     ;4 bytes

        2) Optimized:

                mov eax, 12345678h                    ;5 bytes
      variable = dword ptr $ - 4
                ...
                ...
                mov [ebp + variable], eax             ;6 bytes

        Have u got it ? We use variable as hardcode. This is very effective
        for decreasin' space, which our code takes. As u can see, we saved
        5 bytes without any pain or losin' stability (we only invalidate
        cache content, so it will be little, but VERY little slower).


        And finaly one Intel undocumented instruction. We called it
        SALC (Set AL on Carry) and it worx on Intel 8086+. I tested on my
        AMD K5 166MHz and it also worked. SALC does this thing:
        ------------------------------------------------------------------

        1)      jc _lbl1                              ;2 bytes
                mov al, 0                             ;2 bytes
                jmp _end                              ;2 bytes
          _lbl: mov al, 0ffh                          ;2 bytes
          _end: ...

        2)      SALC   db    0d6h                     ;1 byte ;)

        This is perfect for codin' poly engines. I don't think, that heuristic
        emulator knows all undocumented opcodes X-D

        And that's all folx.


 зддддддддддддддддддддддддддддддддддд©
 Ё 5. And finally some tips and trix Ё
 юддддддддддддддддддддддддддддддддддды

        I will resume here the most important things into points. It's only
        brief theoretical view on optimization techniques. U should remember
        it and try to use it in your own virus.

        -       Avoid as much as possible usin' of STACK and variables
                Remember, that registers r much faster than memory (and STACK
                and variables r in the memory !), so...
        -       Use registers as much as possible (use MOV instead PUSH/POP)
        -       Try to use EAX register as frequently as possible
        -       Remove all unnecessary NOPs by increasin' number of passes
                (use TASM /m9)
        -       Do not use JUMPS directive
        -       For calculatin' large expressions use LEA instruction
        -       Use 486/Pentium instructions for faster code
        -       DO NOT fuck with your sister !
        -       Do not use 16bit registers and opcodes in your 32bit code
        -       Use string operations
        -       Do not use instructions to calculate values, that can be
                calculated by preprocessor (use parentheses)
        -       Avoid CALLs if they aren't needed and use direct code
        -       Use 32bit DEC/INC instead of 8/16bit DEC/INC/SUB/ADD
        -       Use coprocessor and undocumented opcodes
        -       Have on the mind, that instructions that haven't any conflict
                with memory/register r pairable, so they can be executed min.
                2x faster on Pentium processor
        -       If some code is used many times and is greater than 6 bytes
                ("call label" and "ret" instructions r 6 bytes), make it
                procedure and use it instead of writin' repeated code
        -       Avoid conditional jumps to minimum, speculative execution is
                implemented startin' P6+. Too many conditional jumps will slow
                your code by x-timez. Unconditional jumps r OK, but still,
                every byte can be optimized |-)
        -       For aritmetical calculates + next operations use aritmetical
                extension of instructions
        -       Try to use every your variable as hardcode. Perfect use of
                hardcodes is as semaphores. HardMOVe it to ECX and then test
                it by JECXZ jump instruction. I really recommand it, it will
                solve many your troubles with semaphores
        -       Ufff, I don't know what more can I recommend u (maybe u could
                send me some credits, hehe). Mmmm, read this stuff again X-D

        And that's all folx. Let's meet somewhere in next lifes...


 здддддддддддд©
 Ё 6. Closin' Ё
 юдддддддддддды

 Ufff, u r good if u get here after readin' that looooong paper. What should
 I say ? I hope u understood all things (or at least 50% of them) descripted
 here and that u will use them in your code. I know, I'm not one of those
 guys, that makes his code 100% optimized. However, I'm tryin' to do that.
 Generally, I think, that optimization of code isn't any luxus or work u can
 (but needn't) make after everything other is done. It's one of many
 things which makes u profesional coder. Coder, that can't optimize his own
 code isn't profesional coder. Remember it. Hehe, and again my favourite stuff
 ==> If u like this tute, if u know something u think I should know or if u
 only (dis)like it, I will be very grateful to u if u mail me to benny@post.cz.
 Very, very thanx.


 Some greetz: Darkman/29A, Super/29A, Jacky Qwerty/29A, GriYo/29A,
              VirusBust/29A, MDriler/29A, Billy_Bel/???, MrSandman and to all
              I forgot...


                                                    здмммммммммммммммммммм╩
                                                    Ё  Benny / 29A,  1999 ╨
                                                    юддддддддддддддддддддды