ÜÛÛÛÛÛÜ ÜÛÛÛÛÛÜ ÜÛÛÛÛÛÜ ÚÄ Optimization of 32bit code Ä¿ ÛÛÛ ÛÛÛ ÛÛÛ ÛÛÛ ÛÛÛ ÛÛÛ ³ by ³ ÜÜÜÛÛß ßÛÛÛÛÛÛ ÛÛÛÛÛÛÛ ÀÄÄÄÄÄÄÄÄ Benny / 29A ÄÄÄÄÄÄÄÄÄÙ ÛÛÛÜÜÜÜ ÜÜÜÜÛÛÛ ÛÛÛ ÛÛÛ ÛÛÛÛÛÛÛ ÛÛÛÛÛÛß ÛÛÛ ÛÛÛ ÚÄÄÄÄÄÄÄÄÄÄÄÄÄÄ¿ ³ 1. Disclamer ³ ÀÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ The followin' document is an education purpose only. Author isn't responsible for any misuse of the things written in this document. ÚÄÄÄÄÄÄÄÄÄÄÄÄÄ¿ ³ 2. Foreword ³ ÀÄÄÄÄÄÄÄÄÄÄÄÄÄÙ Eeeh, why da fuck I wrote this article ? There r many documents about optimizations. Yes, that's truth, and there r many very gewd and kewl tutes [* Billy, your tute rox! *]. But how can u see, not every tute has on the mind, that the term "optimize" doesn't fully mean, your code will be only small. There r many aspects of optimization and I wanna discuss it here and make u complex view on the thing. When I started to write this article, I was really drunk and totaly under the drugs (hehe, no lie X-D), so if u feel, I made any mistake or u think, things written here aren't true or simply u wanna give me some credits (do it please X-D), u can find me on IRC UnderNet, channels #vir and/or #virus or mail benny@post.cz. Thanx for all possitive (and also negative) comments. ÚÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ¿ ³ 3. Introduction ³ ÀÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ As I said some seconds before, optimization has many aspects. Generaly, we can optimize our code, so: - code will be smaller - code will be faster - code will be smaller and faster Well, it gives us some new space for thinkin'. If we optimize our code: - code will be smaller, but also slower - code will be bigger, but faster - code will be smaller and faster We should find compromise (if we can't reach third point) between first and second point. I'm sure, u don't wanna alert user by slowin' down system performace due to: - huge and unoptimized code - small, but slow code or alert user by rapidly decreasin' space on the disk. It's up to us, which way will we choose. Here we have a clue: - if our code (or block of code, e.g. thread procedure) is small, we should optimize it for faster code - if our code (or block of code) is big, we should optimize it for smaller/faster (find compromise, prefer speed) code However, we should optimize our code by decreasin' its size and increasin' speed, but u know, how is it difficult. Is it clear ? I think, u already knew this. But still, there r still many aspects of optimization. We have for example two instructions, that do the same thing, but: - one instruction is bigger - one instruction is slower - one instruction changes another registers - one instruction writes to memory - one instruction changes flags - one instruction is faster on one processor, but slower on another one Example: LODSB MOV AL, [ESI] + INC ESI ----------------------------------------- size: smaller bigger speed: faster on 80386 faster on 80486, on Pentium only 1 cycle flags: preserved changed And why is LODSB faster on 80386 and why it takes only 1 cykle on Pentium ? Pentium is superscalar processor supportin' pipelinin', so it can execute pair of some integer instructions in a PIPE, i.e. it can execute those instructions simultaneously. Two instructions, that can be executed simultaneously r called "pairable instructions". Hehe, don't worry, this arcticle won't be about Pentium processor architecture, so u can forget words I said about pipes. Maybe l8r, if I will write another article about Pentium processor optimization, I will explain more in details terms such as pipes, V-pipe, U-pipe, pairin' and so on. For now, u can forget them. Just remember, what does "pairin'" word mean. Now, I will discuss step by step every optimization techniques. ÚÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ¿ ³ 4. Optimizin' our code ³ ÀÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ Well, let's go optimize. I will start from the easiest operation. Beginners, hold on... 4.1. Zero register ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ I don't wanna see this anymore: 1) mov eax, 00000000h ;5 bytes This is the worst instruction I've ever seen. Well, it seems logical, that u will move zero to register, but u can do it more optimizely like now: 2) sub eax, eax ;2 bytes or 3) xor eax, eax ;2 bytes 3 bytes on one instruction saved, great ! X-D But what's better to use, SUB or XOR ? I prefer XOR, coz Micro$oft prefers SUB and I know, that Windozes r slooooow, hehe. Noo, that's not true reason. What do u think, is better (for u) to substact two numbers or say "where's 1 and 1, write 0" ? So u know, why I prefer XOR (as I hate mathematix X-D). 4.2. Test if register is zero ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ Hmmm, let's see the brightest solution: 1) cmp eax, 00000000h ;5 bytes je _label_ ;2/6 bytes (short/near) [* NOTE: Many aritmetical instructions r optimized for register EAX, so code usin' EAX register will be faster and smaller. Example: CMP EAX, 12345678h (5 bytes). If I would use another register instead of EAX, CMP instruction would have 6 bytes *] Argh! Who normal can do this ? That's 7 or 15(!) bytes for simple comparsion. No, no, no, don't do it and try this: 2) or eax, eax ;2 bytes je _label_ ;2/6 (short/near) or 3) test eax, eax ;2 bytes je _label_ ;2/6 (short/near) Hmm, much better, 4/8 bytes is really better than 7/15 bytes. So, again, whats better, OR or TEST ? OR prefers Micro$oft so again, I prefer TEST |-). Now seriously, TEST doesn't write to register (OR does), so there will be better pairin' => faster code. I hope, u still remember, what does "pairin'" word mean...If not, read again Introduction section. Now, the biggest magic. If u don't care of ECX register or u don't care, where will be stored content of registers (EAX and ECX), u can do it this way: 4) xchg eax, ecx ;1 byte jecxz _label_ ;2 bytes [* NOTE: XCHG is optimized for EAX register, so if XCHG will use EAX register, it will be 1 byte long, otherwise 2 bytes *] Great! We optimized our code, so we saved 4 bytes. 4.3. Test if register is 0FFFFFFFFh ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ Many APIs return -1, when function fail, so it is important to test for this value. I'm always astonished, when I see how some coders test for this value like now me: 1) cmp eax, 0ffffffffh ;5 bytes je _label_ ;2/6 bytes I hate this. And now look, how can it be optimized: 2) inc eax ;1 byte je _label_ ;2/6 bytes dec eax ;1 byte Yes, yes, yes, we saved 3 bytes and made code faster ;) 4.4. Move 0FFFFFFFFh to register ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ Some APIs need as parameter -1 value. Let's see, how can we set it: Least optimized: 1) mov eax, 0ffffffffh ;5 bytes More optimized: 2) xor eax, eax / sub eax, eax ;2 bytes dec eax ;1 byte Or this with same result (by Super/29A): 3) stc ;1 byte sbb eax, eax ;2 bytes This code is very useful in same cases, such as: jnc _label_ sbb eax, eax ;2 bytes only! _label_: ... 4.5. Zero register and move something to LSW ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ Example of unoptimized code: 1) xor eax, eax ;2 bytes mov ax, word ptr [esi+xx] ;4 bytes 386+ supports new instruction called MOVZX (MOVe with Zero Extension). [* NOTE: MOVZX is faster on 386, on 486+ is slower *] Example of optimized code, where we can save 2 bytes: 2) movzx eax, word ptr [esi+xx] ;4 bytes Next example of "ugly code": 3) xor eax, eax ;2 bytes mov al, byte ptr [esi+xx] ;3 bytes Now we can save valuable 1 byte X-D: 4) movzx eax, byte ptr [esi+xx] ;4 bytes This is very effective, when u r readin' bytes/words from PE header. Becoz u need to work with bytes/words/dwords altogether, MOVZX is the best for this case. And last example: 5) xor eax, eax ;2 bytes mov ax, bx ;3 bytes Better use this formula, which discards 2 bytes: 6) movzx eax, bx ;3 bytes I use MOVZX evertime I can. It is small and it isn't so slow as another instructions. 4.6. Push shit ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ Tell me, how will u store 50h to EAX... ---------------------------------------- Badly: 1) mov eax, 50h ;5 bytes Better: 2) push 50h ;2 bytes pop eax ;1 byte Usin' PUSH and POP is little slower, but smaller too. When is operand short (1 byte long), push takes 2 bytes. Otherwise it takes 5 bytes. Let's try another thing. Push 7x 0 to stack... ----------------------------------------------- Unoptimizely: 3) push 0 ;2 bytes push 0 ;2 bytes push 0 ;2 bytes push 0 ;2 bytes push 0 ;2 bytes push 0 ;2 bytes push 0 ;2 bytes Optimizely, but still biggy X-D: 4) xor eax, eax ;2 bytes push eax ;1 byte push eax ;1 byte push eax ;1 byte push eax ;1 byte push eax ;1 byte push eax ;1 byte push eax ;1 byte Compactly, but slower: 5) push 7 ;2 bytes pop ecx ;1 byte _label_: push 0 ;2 bytes loop _label_ ;2 bytes Wow, without any pain, we saved 7 bytes ;)) And now, life story... U wanna move something from one variable into another variable. All registers must be preserved. U probably do this: ---------------------------------------------------------------- 6) push eax ;1 byte mov eax, [ebp + xxxx] ;6 bytes mov [ebp + xxxx], eax ;6 bytes pop eax ;1 byte And now, usin' only stack, no registers: 7) push dword ptr [ebp + xxxx] ;6 bytes pop dword ptr [ebp + xxxx] ;6 bytes This is useful, when u haven't any register free to use. I use it, when I wanna save old entrypoint to another variable... 8) push dword ptr [ebp + header.epoint] ;6 bytes pop dword ptr [ebp + originalEP] ;6 bytes This saves wonderful 2 bytes |-). Though it is little slower than normal manipulation by EAX (without savin' it), it still come handy, when u don't wanna (or can't) use any register. 4.7. Multiply fun ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ Tell me, how u will calculate offset of last section, when u have in EAX number_of_sections-1 ? Badly: 1) mov ecx, 28h ;5 bytes mul ecx ;2 bytes Better: 2) push 28h ;2 bytes pop ecx ;1 byte mul ecx ;2 bytes Much better: 3) imul eax, eax, 28h ;3 bytes What IMUL does ? IMUL multiplies second register with third operand and stores it in first register (EAX). U can so multiply 28h with EBX and store it to EAX by this: 4) imul eax, ebx, 28h Simple, and effective (as size, as speed). I dont wanna imagine, how would u do this by MUL instruction... X-D 4.8. Stringz in action ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ I wanna jump into the wall when I see unoptimized string operations. Here u have some hints, how can u optimize your code usin' string instructions. Do it please, or I will really do it ! X-D Startin' from the scratch, how can u load a byte ? --------------------------------------------------- Faster: 1) mov al, [esi] ;2 bytes inc esi ;1 byte Smaller: 2) lodsb ;1 byte I recommand to use *Smaller* version. This is one byte instruction, that does exactly the same thing as *Faster* version. It's faster on 80386, but very slower on 80486+. On pentium, *Faster* takes one cycle due to pairin'. However, I think the best to use is still *Smaller* version. And how can u load word ? Ehrm, DO NOT load words, it's too much slow in 32bit enviroment such as Win32. But if u seriously wanna load it, here is the clue... ----------------------------------------------------------------- Faster: 3) mov ax, [esi] ;3 bytes add esi, 2 ;3 byte Smaller: 4) lodsw ;2 bytes Whata 'bout speed and size ? See previous description (LODSB). Aaaah, loadin' dwords is always funny. Look at this: ----------------------------------------------------- Faster: 5) mov eax, [esi] ;2 bytes add esi, 4 ;3 byte Smaller: 6) lodsd ;1 byte See description of LODSB. And next very useful thing... Movin' something from somewhere to somewhere. It's in fact LODSB/LODSW/LODSD + STOSB/STOSW/STOSD. Here is the example of MOVSD: ------------------------------------------------------------------ Faster: 7) mov eax, [esi] ;2 bytes add esi, 4 ;3 bytes mov [edi], eax ;2 bytes add edi, 4 ;3 bytes Smaller: 8) lodsd ;1 byte *Faster* is faster on 486+, *Smaller* is smaller ;). Finaly, I would like to say, that u should always load dwords instead bytes or words, coz u run 386+ processor, which is 32bit. I.e. your processor worx with 32 bits, so if u wanna work with one byte, processor must load dword and then truncate it. Aaaa, too much work, so if it's not neccesery to use bytes/words, don't use them. Next fun... how can u get the end of string ? ---------------------------------------------- Here is the JQwerty's method: 9) lea esi, [ebp + asciiz] ;6 bytes s_check: lodsb ;1 byte test al, al ;2 bytes jne s_check ;2 bytes And Super's method: 10) lea edi, [ebp + asciiz] ;6 bytes xor al, al ;2 bytes s_check: scasb ;1 byte jne s_check ;2 byte Now, which is the best one ? Hmmm, hard to say truth...X-D On 80386+ is faster Super's method, but on Pentium's, Jacky's method is faster due to pairin'. Hehe, all these methods has the same size, so choose, which would u like to use... |-) 4.9. Complex aritmetix ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ Now my favourite stuff. Its a pity, that this great technique hasn't found usage at VX coderz. However, instructions, I wanna talk about r WELL KNOWN (heh, but still, noone knows how slightly does it run and what more it can do), VERY SMALL and VERY FAST on every processor. Imagine, u have a table of DWORDs. Pointer to table is stored in EBX register, index to table is in ECX. U wanna increment ECX. dword in table, so something like this: EBX+(4*ECX). U don't want to modify any register. U can do it this way (everybody does it): 1) pushad ;1 byte imul ecx, ecx, 4 ;3 bytes add ebx, ecx ;2 bytes inc dword ptr [ebx] ;2 bytes popad ;1 byte Or do it better (nobody does it): 2) inc dword ptr [ebx+4*ecx] ;3 bytes This really rox !!! U saved processor time (this is very fast), space in memory (very small, as u can see) and make better readable your source code !!! U saved 6 bytes by simple ONE INSTRUCTION !!! That's not all (not all for INC instruction). Imagine another situation: EBX - pointer to memory, ECX - index to table, u wanna increase ECX. dword + 4096 bytes, so this: EBX+(4*ECX)+1000h. Yeah, and u wanna preserve all registers. U can do it unoptimizly like this: 3) pushad ;1 byte imul ecx, ecx, 4 ;3 bytes add ebx, ecx ;2 bytes add ebx, 1000h ;6 bytes inc dwor ptr [ebx] ;2 bytes popad ;1 byte Or very optimizely... 4) inc dword ptr [ebx+4*ecx+1000h] ;7 bytes Yahoooooo, we saved 8 bytes by one instruction (and we used IMUL instead of MUL), great ! This magic can do EVERY aritmetical instructions, not only INC. Imagine, how much space will u save, when u will use this in instructions such as ADD, SUB, ADC, SBB, INC, DEC, OR, XOR, AND, etc. The biggest magic is commin' now. Hey guy, tell me, what does the LEA instruction. U probably know, that it's instruction we use for manipulatin' with variables in virus. But only some ppl know, how to use this intruction really effectively. LEA instruction can be translated as Load Effective Address. This name is little claimin'. Let's have a look, what LEA really does. Try to hardcode this: lea eax, [12345678h] What do u think, what will be in EAX after execution this opcode ? Rite answer is 12345678h. Another example (EBP = 1): lea eax, [ebp + 12345678h] What will be in register EAX ? Right answer is 12345679h. Yes, on the least significant digit is 9h. So let's translate this instruction to "normal" language: lea eax, [ebp + 12345678h] ;6 bytes ========================== mov eax, 12345678h ;5 bytes add eax, ebp ;2 bytes As u can see, LEA doesn't work with memory or addressed. It only worx with its operands and makin' some operations with it, then it stores result into first operand (EAX in our example). Now look at sizes. Weird, it does exactly the same thing (not so exactly, LEA preserves flags), but it is shorter. Let's show the whole magic... 5) Look at this unoptimized stuff: mov eax, 12345678h ;5 bytes add eax, ebp ;2 bytes imul ecx, 4 ;3 bytes add eax, ecx ;2 bytes 6) Open your mouth and look at this: lea eax, [ebp+ecx*4+12345678h] ;7 bytes Close your mouth now. LEA is shorter, faster (much faster) and preserves flags. Look at it once again, we saved 5 bytes by one instruction and processor time (LEA is much faster on every processor). I won't explain here every aritmetical instruction, I think, it wouldn't have a sense, coz it has the same syntax. U saw everything important, now u can use it. If u wanna use these technique, the only thing u have to have on the mind is the syntax: OPCODE [BASE + INDEX*SCALE + DISPLACEMENT] 4.10. Delta offset optimization ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ Naaah, u probably think, I'm mad. If u, as a reader of this e-paper, aren't beginner, u must know, what da fuck delta offset is. However, I saw at many VX coderz, that they don't use delta offset effectively. If u have a look on my first viruses, u will see, I also fucked the space it takes. And I wasn't alone. Let's see it in details.. [* Ehrm, let's have a pause. I think, u have to be tired from this BIIIIIG paper. I will tell ya something... Before some minutes, I went out to buy new cig-box (uuuh, to many drugs in my body now X-D). Hot, sunny weather changed before some moments to hot, windy weather, darky, total STOOOORM, but without any rain, I can see big lightenings, I like it. It's the best weather to have a minute for thinkin' about some things - girls, VX, friends, politix, ... I'm back now. I'm plug-inin' some very kewl CD with very kewl music, czech music. Now I can hear one very gewd song from one very gewd czech rock-group. Hehe, 90% of their songs were written when they were totally doped. But wait, they r very gewd. Many things u can understand, only when u r doped. They r singin' (rite now X-D) about Earth. It's very slow song, it's like Indian music (but they also play hard rock, so hard, that Billy would like it. Hehe, I will bring this CD sometimes, when we will be on some meetin', somewhere, maybe. Billy, u will 100% like it, my friend ! X-D). Hmmm, I will tell ya know some lyrix... Very gewd lyrix, I hope, u will understand it, I will translate it for ya X-D... She defence on and on, there r ages, when someone like her, Both of nice and cruel, U can touch, she will give it also to u, Now it is waitin' for that step, which makes walk a fly, And when then, when, if not now ????? Politix can invent only atomic shit, let's kick it back to them, And when then, when, if not now ????? She defence on and on, .... Ooooh, my god, whata hell I'm doin' now ? Hehe, if u think, I'm mad, be sure it's truth X-DDD. Ok, ok, back to reality... *] So, again, let's look at that stuff. This is the way, how is standardly delta offset handled... 1) call gdelta gdelta: pop ebp sub ebp, offset gdelta That's normal way (but less efficent). Let's look, how we can work with it... lea eax, [ebp + variable] Hmmm, if u look at it under some debugger, u will see followin' line: 3) lea eax, [ebp + 401000h] ;6 bytes In the first generation of virus, EBP register will be nulified. Ok, but let's look, what happens, if u code this: 4) lea eax, [ebp + 10h] ;3 bytes Hmmm, weird. Sometimes it's 6 bytes, next time it's 3 bytes. It's normal. Many instructions r optimized for SHORT (one byte long) values, e.g. SUB EBX, 3 will be 3 bytes long too. If u code SUB EBX, 1234h, it will have 6 bytes. Not only SUB instruction, also many other instructions. Look, what happens, if we will use "another" way, how to get delta offset... 5) call gdelta gdelta: pop ebp Only ! As I said, in first generation of virus, EBP will be nulified (in previous version of gdelta) and variable will be e.g. 401000h. That's not good. What do u say, we will have 401000h value in EBP and increment value will be that variable ? Thanx to our new version of gdelta, we can use SHORT version of LEA and so save 3 bytes on variable addressin'. Here is the sample... 6) lea eax, [ebp + variable - gdelta] ;3 bytes We got it. Next thing, what should we do is insert all initialized variables around the gdelta call. This will make our work (no more 6 bytes, but 3 bytes instructions) - THIS IS REALLY IMPORTANT. If u won't do it, variable would be somewhere FAR (ehrm, I wanted say NEAR X-D) from gdelta, so SHORT version of LEA wouldn't be used. Heh, u probably think, that there is some trick, that it has some limitation or something like that, coz if this would work, everybody would use it. Don't worry, there aren't any limitation. And why da fuck noone use it ? It's not easy hard to answer. I can say, that I dont know. Really don't know. [* Let me say my feelings. U probaly know Super/29A. He is the best optimizer, I and VX world know. It's fact. U probably also know JQwerty/29A. He is also VERY GOOD optimizer, but noone say "Super and JQwerty r the best optimizers". I don't know why. I saw this delta offset handlin' firstly at his code, noone use it before him (I think). And that is soooo easy to use it. If u look at Win32.Cabanas u will see MANY and MANY features. And it's only 2999 bytes !!! Who else than Super or JQwerty could code it ? I don't know. I wanna only say, that "someone" forgot to other kewl guy. *] My new virus uses this delta offset handlin' too, and I saved TONS of bytes. So why don't u use it too ? 4.11. Misc optimalizations ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ Here r included those optimization techniques, that I couldn't sort to groups above... Just read it, something can be useful... Zero EDX register, if EAX is less than 80000000h: -------------------------------------------------- 1) xor edx, edx ;2 bytes, but faster 2) cdq ;1 byte, but slower I always use CDQ instead XOR. Why ? Why not ? X-D Save space by usin' all registers, instead of EBP and ESP: ----------------------------------------------------------- 1) mov eax, [ebp] ;3 bytes 2) mov eax, [esp] ;3 bytes 3) mov eax, [ebx] ;2 bytes Wanna have mirror effect of register content ? Try BSWAP. --------------------------------------------------------- Example: mov eax, 12345678h ;5 bytes bswap eax ;2 bytes ;eax = 78563412h now I haven't ever found this instruction useful for any viral work. However, someone maybe will X-D. Wanna save some bytes replacin' CALL ? --------------------------------------- 1) call _label_ ;5 bytes ret ;1 byte 2) jmp _label_ ;2/5 (SHORT/NEAR) Huh, we saved 4 bytes and processor time. Always replace call/ret with jmp instruction, if call doesn't want any parameters on the stack... Wanna save time while comparin' reg/mem ? ------------------------------------------ 1) cmp reg, [mem] ;slower 2) cmp [mem], reg ;1 cycle faster Wanna save space and CPU time while dividin'/multiplyin' by power of 2 ? ------------------------------------------------------------ Dividin': 1) mov eax, 1000h mov ecx, 4 ;5 bytes xor edx, edx ;2 bytes div ecx ;2 bytes 2) shr eax, 4 ;3 bytes Multiplyin': 3) mov ecx, 4 ;5 bytes mul ecx ;2 bytes 4) shl eax, 4 ;3 bytes No comment... Loops, loops and loops: ------------------------ 1) dec ecx ;1 byte jne _label_ ;2/6 bytes (SHORT/NEAR) 2) loop _label_ ;2 bytes Next example: 3) je $+5 ;2 bytes dec ecx ;1 byte jne _label_ ;2 bytes 4) loopXX _label_ (XX = E, NE, Z or NZ) ;2 bytes LOOP is smaller, but slower on 486+. And next unforgetable thing. Noone normal can code this: --------------------------------------------------------- 1) push eax ;1 byte push ebx ;1 byte pop eax ;1 byte pop ebx ;1 byte Do this and only this. Nothing other than this: 2) xchg eax, ebx ;1 byte And again, if XCHG's operand is EAX, it takes 1 byte otherwise it takes 2 bytes. So when u wanna exchange ECX with EDX, XCHG will be 2 bytes long: 3) xchg ecx, edx ;2 bytes If u only want to move content of one register to another one, use simple MOV instruction. It has better pairin' on Pentium and takes less CPU time than XCHG without EAX register as operand: 4) mov ecx, edx ;2 bytes Discard repeated code (and procedure code): -------------------------------------------- 1) Unoptimized: lbl1: mov al, 5 ;2 bytes stosb ;1 byte mov eax, [ebx] ;2 bytes stosb ;1 byte ret ;1 byte lbl2: mov al, 6 ;2 bytes stosb ;1 byte mov eax, [ebx] ;2 bytes stosb ;1 byte ret ;1 byte --------- ;14 bytes 2) Optimized: lbl1: mov al, 5 ;2 bytes lbl: stosb ;1 byte mov eax, [ebx] ;2 bytes stosb ;1 byte ret ;1 byte lbl2: mov al, 6 ;2 bytes jmp lbl ;2 bytes --------- ;11 bytes Remember, if u have any redundant code, and is greater than jump instruction, replace code with it. If u write your own poly engine, u will have many opportunities to do that. Don't lose them ! Manipulatin' with variables: ----------------------------- 1) Unoptimized: mov eax, [ebp + variable] ;6 bytes ... ... mov [ebp + variable], eax ;6 bytes ... ... variable dd 12345678h ;4 bytes 2) Optimized: mov eax, 12345678h ;5 bytes variable = dword ptr $ - 4 ... ... mov [ebp + variable], eax ;6 bytes Have u got it ? We use variable as hardcode. This is very effective for decreasin' space, which our code takes. As u can see, we saved 5 bytes without any pain or losin' stability (we only invalidate cache content, so it will be little, but VERY little slower). And finaly one Intel undocumented instruction. We called it SALC (Set AL on Carry) and it worx on Intel 8086+. I tested on my AMD K5 166MHz and it also worked. SALC does this thing: ------------------------------------------------------------------ 1) jc _lbl1 ;2 bytes mov al, 0 ;2 bytes jmp _end ;2 bytes _lbl: mov al, 0ffh ;2 bytes _end: ... 2) SALC db 0d6h ;1 byte ;) This is perfect for codin' poly engines. I don't think, that heuristic emulator knows all undocumented opcodes X-D And that's all folx. ÚÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ¿ ³ 5. And finally some tips and trix ³ ÀÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ I will resume here the most important things into points. It's only brief theoretical view on optimization techniques. U should remember it and try to use it in your own virus. - Avoid as much as possible usin' of STACK and variables Remember, that registers r much faster than memory (and STACK and variables r in the memory !), so... - Use registers as much as possible (use MOV instead PUSH/POP) - Try to use EAX register as frequently as possible - Remove all unnecessary NOPs by increasin' number of passes (use TASM /m9) - Do not use JUMPS directive - For calculatin' large expressions use LEA instruction - Use 486/Pentium instructions for faster code - DO NOT fuck with your sister ! - Do not use 16bit registers and opcodes in your 32bit code - Use string operations - Do not use instructions to calculate values, that can be calculated by preprocessor (use parentheses) - Avoid CALLs if they aren't needed and use direct code - Use 32bit DEC/INC instead of 8/16bit DEC/INC/SUB/ADD - Use coprocessor and undocumented opcodes - Have on the mind, that instructions that haven't any conflict with memory/register r pairable, so they can be executed min. 2x faster on Pentium processor - If some code is used many times and is greater than 6 bytes ("call label" and "ret" instructions r 6 bytes), make it procedure and use it instead of writin' repeated code - Avoid conditional jumps to minimum, speculative execution is implemented startin' P6+. Too many conditional jumps will slow your code by x-timez. Unconditional jumps r OK, but still, every byte can be optimized |-) - For aritmetical calculates + next operations use aritmetical extension of instructions - Try to use every your variable as hardcode. Perfect use of hardcodes is as semaphores. HardMOVe it to ECX and then test it by JECXZ jump instruction. I really recommand it, it will solve many your troubles with semaphores - Ufff, I don't know what more can I recommend u (maybe u could send me some credits, hehe). Mmmm, read this stuff again X-D And that's all folx. Let's meet somewhere in next lifes... ÚÄÄÄÄÄÄÄÄÄÄÄÄ¿ ³ 6. Closin' ³ ÀÄÄÄÄÄÄÄÄÄÄÄÄÙ Ufff, u r good if u get here after readin' that looooong paper. What should I say ? I hope u understood all things (or at least 50% of them) descripted here and that u will use them in your code. I know, I'm not one of those guys, that makes his code 100% optimized. However, I'm tryin' to do that. Generally, I think, that optimization of code isn't any luxus or work u can (but needn't) make after everything other is done. It's one of many things which makes u profesional coder. Coder, that can't optimize his own code isn't profesional coder. Remember it. Hehe, and again my favourite stuff ==> If u like this tute, if u know something u think I should know or if u only (dis)like it, I will be very grateful to u if u mail me to benny@post.cz. Very, very thanx. Some greetz: Darkman/29A, Super/29A, Jacky Qwerty/29A, GriYo/29A, VirusBust/29A, MDriler/29A, Billy_Bel/???, MrSandman and to all I forgot... ÚÄÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍ» ³ Benny / 29A, 1999 º ÀÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ