ÚÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ¿ ³ 32 bit optimization ³ Billy Belceb£/DDT ÀÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ Ehrm... Super should do this instead me, anyway, as i'm his pupil, i'm gonna write here what i have learnt in the time while i am inside Win32 coding world. I will guide this tutorial through local optimization rather than structural optimization, because this is up to you and your style (for exam- ple, personally i'm *VERY* paranoid about the stack and delta offset calcu- lations, as you could see in my codes, specially in Win95.Garaipena). This article is full of my own ideas and of advices that Super gave to me in Valencian meetings. He's probably the best optimizer in VX world ever. No lie. I won't discuss here how to optimize to the max as he does. No. I only wan't to make you see the most obvious optimizations that could be done when coding for Win32, for example. I won't comment the VERY obvious optimization tricks, already explained in my Virus Writing Guide for MS-DOS. % Check if a register is zero % ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ I'm sick of see the same always, specially in Win32 coders, and this is re- ally killing me slowly and very painfully. No, no, my mind can't assimilate the idea of a CMP EAX,0 for example. Ok, let's see why: cmp eax,00000000h ; 6 bytes jz bribriblibli ; 2 bytes (if jz is short) Heh, i know life's a shit, and you are wasting many code in shitty compari- sons. Ok, let't see how to solve this situation, with a code that does the same, but with less bytes. or eax,eax ; 2 bytes jz bribriblibli ; 2 bytes (if jz is short) And there is a way to do this even more optimized, anyway it's okay if it doesn't matter where should be the content of EAX (after what i am going to put here, EAX content will finish in ECX). Here you have: xchg eax,ecx ; 1 byte jecxz bribriblibli ; 2 bytes (if it is short) Do you see? No excuses about "i don' t optimize because i lose stability", because with this tips you will optimize without losing anything besides bytes of code ;) Heh, we passed from a 8 bytes routine to 3 bytes... Heh? what do you say about it? Hahahaha. % Check if a register is -1 % ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ As many APIs in Ring-3 return you a value of -1 (0FFFFFFFFh) if the function failed, and as you should compare if it failed, you must compare for that value. But there is the same problem as before, many many people do it by using CMP EAX,0FFFFFFFFh and it could be done more optimized... cmp eax,0FFFFFFFFh ; 6 bytes jz insumision ; 2 bytes (if short) Let's do it as it could be more optimized: inc eax ; 1 byte xchg eax,ecx ; 1 byte jecxz insumision ; 2 bytes (if short) dec ecx ; 1 byte And another thingy could be this: inc eax ; 1 byte jz insumision ; 2 bytes dec eax ; 1 byte Heh, maybe it occupies more lines, but occupies less bytes so far (4 bytes against 8). % Clear a 32 bit register and move something to its LSW % ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ The most clear example is what all viruses do when loading the number of sections of PE file in AX (as this value occupies 1 word in the PE header). Well, let's see what do the majority of VX: xor eax,eax ; 2 bytes mov ax,word ptr [esi+6] ; 4 bytes I'm still wondering why all VX use this "old" formula, specially when you have a 386+ instruction that avoids us to make register to be zero before putting the word in AX. This instruction is MOVZX. movzx eax,word ptr [esi+6] ; 4 bytes Heh, we avoided 1 instruction of 2 bytes. Cool, huh? % Calling to an address stored in a variable % ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ Heh, this is another thing that some VX do, and makes me to go crazy and scream. Let me remember it to you: mov eax,dword ptr [ebp+ApiAddress] ; 6 bytes call eax ; 2 bytes We can call to an address directly guys... It saves bytes and doesn't use any register that could be useful for another things. call dword ptr [ebp+ApiAddress] ; 6 bytes Another time again, we are saving an unuseful, and not needed instruction, that occupies 2 bytes, and we are making exactly the same. % Fun with push % ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ Almost the same as above, but with push. Let's see what to don't do and what to do: mov eax,dword ptr [ebp+variable] ; 6 bytes push eax ; 1 byte We could do the same with 1 byte less. See. push dword ptr [ebp+variable] ; 6 bytes Cool, huh? ;) Well, if we need to push many times (if the value is big, is more optimized if you push that value 2+ times, and if the value is small is more optimized to push it when you need to push the value 3+ times) the same variable is more optimized to put it in a register, and push the register. For example, if we need to push zero 3 times, is more optimized to xor a register with itself and later push the register. Let's see: push 00000000h ; 2 bytes push 00000000h ; 2 bytes push 00000000h ; 2 bytes And let's see how to optimize that: xor eax,eax ; 2 bytes push eax ; 1 byte push eax ; 1 byte push eax ; 1 byte Another thing passes while using SEH, as we need to push fs:[0] and such like. Let's see how to optimize that: push dword ptr fs:[00000000h] ; 6 bytes mov fs:[0],esp ; 6 bytes [...] pop dword ptr fs:[00000000h] ; 6 bytes Instead that we should do this: xor eax,eax ; 2 bytes push dword ptr fs:[eax] ; 3 bytes mov fs:[eax],esp ; 3 bytes [...] pop dword ptr fs:[eax] ; 3 bytes Heh, seems a silly thing, but we have 7 bytes less! Whoa!!! % Get the end of an ASCIIz string % ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ This is very useful, specially in our API search engines. And of course, it could be done more optimized rather than the typical way in all viruses. Let's see: lea edi,[ebp+ASCIIz_variable] ; 6 bytes @@1: cmp byte ptr [edi],00h ; 3 bytes inc edi ; 1 byte jz @@2 ; 2 bytes jmp @@1 ; 2 bytes @@2: inc edi ; 1 byte This same code could be very reduced, if you code it in this way: lea edi,[ebp+ASCIIz_variable] ; 6 bytes xor al,al ; 2 bytes @@1: scasb ; 1 byte jnz @@1 ; 2 bytes Hehehe. Useful, short and good looking. What else do you need? ;) % Multiply shitz % ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ For example, while seeing the code for get the last section, the code most used includes this (we have in EAX the number of sections - 1): mov ecx,28h ; 5 bytes mul ecx ; 2 bytes And this saves the result in EAX, right? Well, we have a much better way to do this, with an only one instruction: imul eax,eax,28h ; 3 bytes IMUL stores in the first register indicated the result, result that is given to us multiplying the second register indicated with the third operand, in this case, it's an immediate. Heh, we saved 4 bytes of substituing only 2 instructions of code! % Infection mark % ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ It should work, anyway i'm not sure, because it doesn't in my computer. Pff, maybe an intel bug, or my system is crazy or something. Not sure, but anyway try it, as it is very interesting. Look how it should be unoptimized: cmp dword ptr [esi+44h],"MARK" ; 7 bytes jz oro_y_grana ; 2 bytes mov dword ptr [esi+44h],"MARK" ; 7 bytes Optimized, this should be in this way (i already said that it SHOULD work, but it doesn't in my PC): mov eax,"MARK" ; 5 bytes cmpxchg dword ptr [esi+44h],eax ; 4 bytes jz oro_y_grana ; 2 bytes Pfff, a really good optimization, 16 bytes reduced to 11 bytes ;) % UNICODE to ASCIIz % ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ There are many to do here. Specially done for Ring-0 viruses, there is a VxD service for do that, firstly i'm gonna explain how to do the optimization based in the use of this service, and finally i'll show Super's method, that saves TONS of bytes. Let's see the typical code (assumming EBP as ptr to ioreq structure and EDI pointing to file name: xor eax,eax ; 2 bytes push eax ; 1 byte mov eax,100h ; 5 bytes push eax ; 1 byte mov eax,[ebp+1Ch] ; 3 bytes mov eax,[eax+0Ch] ; 3 bytes add eax,4 ; 3 bytes push eax ; 1 byte push edi ; 1 byte @@3: int 20h ; 2 bytes dd 00400041h ; 4 bytes Well, particulary only 1 improve could be done to that code, substitute the third line with this: mov ah,1 ; 2 bytes Heh, but i said that Super improved this to the max. I haven't copied his code to get the ptr to the unicode name of file, because is almost ununders- tandable, but i catched the concept. Assumptions are EBP as ptr to ioreq structure and buffer as a 100h bytes buffer. Here goes some code: mov esi,[ebp+1Ch] ; 3 bytes mov esi,[esi+0Ch] ; 3 bytes lea edi,[ebp+buffer] ; 6 bytes @@l: movsb ; 1 byte Ä¿ dec edi ; 1 byte ³ This loop was cmpsb ; 1 byte ³ made by Super ;) jnz @@l ; 2 bytes ÄÙ Heh, the first of all routines (without local optimization) is 26 bytes, the same with that local optimization is 23 bytes, and the last routine, the structural optimization is 17 bytes. Whoaaaa!!! % VirtualSize calculation % ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ This title is an excuse for show you another strange opcode, very useful for VirtualSize calculations, as we have to add to it a value, and get the value that was there before our addition. Of course, the opcode i am talking about is XADD. Ok, ok, let's see the unoptimized VirtualSize calculation (i assume ESI as a ptr to last section header): mov eax,[esi+8] ; 3 bytes push eax ; 1 byte add dword ptr [esi+8],virus_size ; 7 bytes pop eax ; 1 byte And let's see how it should be with XADD: mov eax,virus_size ; 5 bytes xadd dword ptr [esi+8],eax ; 4 bytes With XADD we saved 3 bytes ;) Btw, XADD is a 486+ instruction. % Setting STACK frames % ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ Another Ring-0 thingy. Let's see it unoptimized: push ebp ; 1 byte mov ebp,esp ; 2 bytes sub esp,20h ; 3 bytes And if we optimize... enter 20h,00h ; 4 bytes Charming, isn't it? ;) % Tips & tricks % ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ Here i will put unclassificalble tricks for optimize, or if i assumed that you know them while making this article ;) - Never use JUMPS directive in your code. - Use string operations (MOVS, SCAS, CMPS, STOS, LODS) - Use LEA reg,[ebp+imm32] rather than MOV reg,offset imm32 / add reg,ebp - Make your assembler pass many times over the code (in TASM, /m5 is good) - Use the STACK, and avoid as much as possible to use variables - Try to avoid use AX,BX,CX,DX,SP,SI,DI and BP, as they occupy 1 byte more - Many operations (logical ones specially) are optimized for EAX/AL register - Use CDQ for clean EDX if EAX is lower than 80000000h - Use XOR reg,reg or SUB reg,reg for make a register to be zero - For bit operations use the "family" of BT (BT,BSR,BSF,BTR,BTF,BTS) - Use XCHG instead MOV if the register order doesn't matter - While pushing all values of IOREQ structure, use a loop - Use the HEAP as much as possible (API addresses, temp infection vars, etc) - If you like, use conditional MOVs (CMOVs), but they are 586+ - If you know how to, use the coprocessor (its stack, for example) - Use SET family of opcodes for use kinda semaphores in yer code % Final words % ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ I expect you understood at least the first optimizations put in this article because they are the ones that make me go mad. I know i am not the best at optimization, neither one of them. For me, the size doesn't matter. Anyway, the obvious optimizations must be done, at least for demonstrate you know to something in your life. Less unuseful bytes means a better virus, believe me. And don't come to me using the same words that QuantumG used in his Next Step virus. The optimizations i showed here WON'T make your virus to lose stability. Just try to use them, ok? It's very logic, guyz. Billy Belceb£, mass killer and ass kicker.