ÚÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ¿ ³ Xine - issue #5 - Phile 114 ³ ÀÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ C to assembly, language point of view ------------------------------------- Intro ------- Many of you think c is useless for viruswriter. Many poeple wich start vxing handle better c language than assembly one. But the true reason where C is locked for virus is the dependency of the compiler, wich block some special manipulation. But the use of a C compiler is not totally senseless. If the optimization is not perfect, it can build a code very proach and similar to the assembly one, if you follow some rules. Of course, you will loose a few size, but you will gain by stability, portability, and finally by coding much fastly your nice littles routines, that you may customize once in assembly. The first step, code parser, analyzer --------------------------------------- a) syntax analyzer ------------------ the major improvent in hll is the ability of analysing syntaxic expression, and evaluate them, things that the processor can't do. Syntax analyzer is quite common to all language, wich result in fact in a series of mathematical operation of variable and known value. The best way I did to make a syntax analyzer was using a binary tree and recursivity. so, lets assume the simple case: b+c + / \ B C for this case, you need to have: b+c*d + / \ B * / \ C D the second part of the plus operation has been changed, and it's now an expression himself, for such operation, you have to take care of operator priority, the * is prior to the +, then you have to evaluate c*d before making the +. now to see why recusivity and binary tree are usefull, imagine you have (b+c*d)+a , a part of the expression three will stay the same, but in the upper of it, you have to add a "+a", after you then try (b+c*d)-(a*b+d) note, it's not important here if it's B or a value, it's just symbolic. b) language needings -------------------- You need also to know what type of data you are manipulating, in fact, their "class" are different, and this have many repercussion later in compiling. Normally, assembler programmer handle just nice little integers, wich is quite suffisant. Boolean are also integers, but the majority of the test are to see if it's zero or not. and not using the complex processor flags, wich seems better but much problematic for boolean operation as exemple. Each language have a same squeletton, a language handle test, jump, function calls, and boucle. For C, it's easy, if true (non zero as seen) , execute next instruction. The next instruction can be a group of instruction, and then you introduce instruction block. Instruction block have also to be seen as a directory with a main directory. classical instruction are subdirectory of instruction. Then for that you have to handle the text file a certain way, line per line. So you are forced to reconstruct the main body. c) memory, variable, the magic key ? ------------------------------------ The only extensible zone for the processor is the memory, so, the other extensible number variable make perfect the linking between memory and constants. Now we have two type of data, the global and local ones, global have to be stored with the general program, when local have to be allocated, in C, the compiler take care of allocation, deallocate when the block finish, wich define the need of a simple and fast memory allocator/deallocator. for that, the stack is quite perfect, as often seen in program, the mov ebp,esp save the actual stack state. The job is evaluating for each constant a dynamic place in the stack. C have also defined pregenerated new and delete wich are just typical malloc(sizeof), not to be implemented in the dynamic calculation The second step, the meta-code -------------------------------- Translating an expression to the processor is not as easy as you can think, because there's today two style of processor, the accumulators based one and the stack based one, ie CPU/FPU. Often the accumulator cpu can act with some manipulation as FPU, but it make larger the code, memory and data manipulation. First, for CPU, you work often with a working register, the accumulator one on intels as exemple. lets take our first exemple b + c*d, so, analysing the tree, we look that way load c mul d save #1 load b add #1 then, now the accumulator equal the expression, watch that here the expression have just one operator, because we have no predefined processor, then we have assume we work with a very restricted one. there's three type of opperation, load save and operations. The problem is assuming the first save/add now let look on a stack operation how it will work push b push d push c mul plus The principe of stack processor is that the result is allways push on top of the stack, wich equal the accumulator of the 1st type. Third step, translating the code -------------------------------- Translating the code to assembly output is much easy, because meta code are a series of macro, these macro can be directly changed into small groups of instruction, lets see: load c -> mov eax,[esp+c] mul d -> mov ecx,[esp+d], xor edx,edx, mul ecx save #1 -> mov [esp+#1],eax // # are dynamically alocated load b -> mov eax,[esp+b] add #1 -> add eax,[esp+#1] and for the second one, lets see push b -> fld qdword ptr [esp+b] push d -> fld qdword ptr [esp+d] push c -> fld qdword ptr [esp+c] mul -> fmul, fxchg, fstp * plus -> fadd, fxchg, fstp * we fstp * mean we send ST(0) to trash, can be as exemple sub esp,8 , fstp [esp], add esp,8 Conclusion ---------- I hope you found that interresting, I hope to see multiplatform viruses soon :)