Beginner Tutorial #10 Anti-tampering Techniques Theory v1.0 |
| The Target: none |
| The Tools: Ollydbg 1.10 |
| The Protection: none |
Other Information: |
Best viewed in Firefox at 1280x1024
|
1. Introduction |
|
This tutorial is meant to cover a topic that is usually just slightly pointed in common tutorials. The argument is the anti-tampering techniques used by programs to counteract the modifications of their code. Told this way, it could seem something you never seen around, but this is the name of techniques such as CRC self-checking of an application, MD5, SHA1, and so on. Did you realize it? Self-checking (also called self-validation or integrity checking) is a technique in which a program, while running, checks itself to verify that it has not been modified. Usually it's useful to distinguish between static self-checking, in which the program checks its integrity only once, during startup, and dynamic self-checking, in which the program repeatedly verifies its integrity, as it is running. While self-checking alone is not sufficient to robustly protect software, is nowadays one of the most easier protections to implement and widely used either in commercial or custom protections. To correctly discuss this issue I'll resemble some things coming from different sources and some contributions on my own. As usual the distinction between my original work and what is coming from another source is left intentionally hidden. This tutorial presents the anti-tampering problem and some details of the CPU memory management which will help understanding the whole thing. An example in C is then built and debugged as a proof of concetpts. have phun.
|
|
2. Classification of Attacks |
|
"The security of a cypher scheme shouldn't rely on
keeping secret the algorithm. Before starting to discuss the argument it is better to introduce a classification of attacks types, which helps keeping ideas clear. Broadly speaking the attacks to a system (software or not) can be classified into three categories based on the nature of malicious agent.
Under the heading of software protection, there is another little classification to discuss. The techniques for protecting programs can be distinguished rougly into
|
|
3. Tamper Resistance Techniques and Checksumming |
|
Introduction. There are of course different techniques for self-checking software (e.g. program or result checking and generation of the correct executable or part of code on the fly, decryption according to a calculated digest) partially derived from asimmetric signature algorithms, but we'll focus more on checksumming. The standard threat model for software tamper resistance is the hostile host model. The challenge is to protect an application running on a system controlled by a malicious, intelligent user. Because such a user can in theory change any code on the computer, other software on such a system, including the operating system, is untrusted; in the case of particularly determined adversaries, even the hardware is untrusted. This situation is in contrast with the hostile client problem in which we assume a trusted host and untrusted application. The hostile client problem appears to be an easier problem to solve; numerous solutions have been developed, e.g. sandboxing and is typical of mobile agents such as smartphones or PDAs. Since a single checksum is relatively easy for an attacker to disable, stronger proposals rely on a network of inter-connected checksums, all of which must be disabled to defeat tamper resistance. A tester reads the area of memory occupied by code and readonly data, building up a checksum result based on the data read. A subsequent section of the code may operate on the checksum result, affecting program stability or correctness in a negative way if a checksum result is not the same as a known good value pre-computed at compile time. The sections of code which perform the checksumming operations may be further hidden using code obfuscation techniques to prevent static analysis. Ideally the effects of a bad checksum result in the program are subtle (e.g. causing mysterious failures much later in execution) thus making it much more difficult for an attacker to locate the checksum code.
Figure 1 gives a simplified view of a typical distribution of checksum code within an application. In practise, there may be hundreds of checksum blocks hidden within the main application code. Each allows verification of the integrity of a predetermined section of the code segment. The read-only data segment may also be similarly checked. The checksumming code is inserted at compile time and integrated with
regular execution code. The application also requires a correct checksum
result for each block in order to work properly. High level view of how the protection works, Testers and Guards Two of the most common guards are checksumming and repairing.
The security of the guards actually comes from guards watching each other's backs. Each guard protects fragments of code, and that includes protecting other guards. Figure 1.1 shows an example of a possible network of guards.
Figure 1.1 - A possible distribution of cross-checking guards of an application's integrity. A group of guards working together implement a sophisticated
protection scheme that is more resilient against attacks than a single
guard. These schemes are used by the most modern protectors, such as
Armadillo also: the nanomities trick is just a way to implements a Guard
scheme and an interlocking scheme, made of two processes, controlling
one each other.
Defeating complex Guard Networks can be, generally speaking, a nightmare. Anti-Tampering details
Note that a critical (implicit) assumption of checksumming
algorithms is that D(x) = I(x), where D(x)
is the bit-string result of a “data read” from memory
address x, and I(x) is the bit-string result of an “instruction
fetch” of corresponding length from x. In the
practical everyday cases this is the situation, because D(x) and
I(x) are the same portion of memory "read as data and as
instructions". While current checksumming proposal critically rely on this assumptions there are opportunities to violate this assumption on several modern processors..
|
|
4. CPU Support for Virtual Memory |
|
This section provides background information for those less familiar with the virtual memory subsystems of modern processors, including translation look-aside buffers (TLBs). Thos of you already familiar with processor architecture can jump directly to the next section. Modern processors do much more than execute a sequence of instructions. Advances in processor speed and flexibility have resulted in a very complex architecture. A significant part of this complexity comes from mechanisms designed to efficiently support virtual memory. Virtual memory, first introduced in the late 1950’s, involves splitting main memory into an array of frames (pages) which can be subsequently manipulated. Virtual addresses used by an application program are mapped into physical addresses by the virtual memory system.
Even though the page table translation algorithmmay vary slightly between processors and may sometimes be implemented in software, modern processors all use roughly the same method for translating a virtual page number to a physical frame number. Specifically, this translation is performed through the use of page tables, which are arrays that associate a selected number of virtual page numbers with physical frame numbers. Because the virtual address spaces of most processes are both large and sparse, page table entries are only allocated for the portions of the address space that are actually used. To determine the physical address corresponding to a given virtual address, the appropriate page table, and the correct entry within that page table must be located. For systems that uses 3-level page tables, a virtual address is divided into four fields, x1 through x4. The x1 bits (the directory offset) specify an entry in a perprocess page directory. The entry contains the address of a page map table. The x2 bits (the map offset) are used as an offset within the specified page map table, giving the address of a page table. The x3 bits (the table offset) index into the chosen page table, returning the number of a physical page frame. x4, then, specifies the offset within a physical frame that contains the data referred to by the original virtual address. This resolution process is illustrated in Figure 3. Note that if memory segments are used, segment translation typically occurs before operations involving the page table.
TLB Translation look Aside Buffer
A TLB caches recently used mappings of virtual page numbers to physical
page frames. On every virtual memory access, all entries in a TLB are
checked to see whether any of them contain the correct virtual page
number. If an entry is found for the virtual page number, a TLB hit
has
occurred, and the corresponding physical page frame is immediately
accessed. Otherwise, we have a TLB miss, and the appropriate page tables
are consulted in the fashion
Because of the principal of locality, TLB translation works very well in
practise. System designers have noticed, however, that code and data
exhibit different patterns of locality. To prevent interference between
these patterns, caches of code and data are often separated; for similar
reasons, most modern CPUs have separate code and data TLBs.
Page swapping Since not all pages of virtual memory may map to a physical page, there must be some way for the processor to inform the OS when a virtual address does not have a physical mapping. The processor does this through the use of a page fault interrupt. The processor will store the virtual address which caused the page fault in a register, and then signal the operating system through an interrupt handler. The operating system updates the mapping of virtual to physical addresses, so that the requested virtual address can be mapped to a physical address. This may mean bringing the section of the program into physical memory from disk or some other external storage. The OS then signals the processor to retry the instruction by returning from the interrupt. The OS also has the choice of aborting execution of the application if it determines that the virtual address is invalid, e.g. if the virtual address refers to memory that has not been allocated.
Access Controls on Memory. In addition, there are protection mechanisms for pages which are in a process’s address space. Each mapped page is restricted in the types of operations that may be performed on its contents: read, write, and instruction fetch (also called execute). Permitted operations are specified using control bits associated with each page table entry. Read and write are common operations on data pages, while executing code is commonly associated with a page containing executable code.
Modern operating systems take advantage of the protection mechanisms
implemented by the processor to distinguish various types of memory
usage. The ability to set no-execute permission on a per-page basis produces the restriction that many programs are confined to executing code from their code segment, unless they take specific action to make their data executable. Although such changes can interfere with systems that generate machine code at runtime (e.g. modern Java Virtual Machines), many types of code injection attacks can be defeated by nonexecutable data pages
Table 1 shows the ideal separation of privileges for different sections of an application. This separation of privileges is currently assumed in executable file formats. All processors implementing page level access controls must check for disallowed operations and signal the operating system appropriately. Most often, the operating system is signalled through the page fault interrupt, which indicates the memory reference that caused the invalid operation. |
|
5. The memory model of the x86 architecture, the segments |
|
In addition to supporting memory pages, the x86 can also manage memory in
variable sized chunks known as segments. Associated with each segment is a base address, size, permissions, and other meta-data. Together this information forms a segment descriptor. To use a given segment descriptor, its value is loaded into one of the segment registers. Other than segment descriptor numbers, the contents of these registers are inaccessible to all software. In order to update a segment register, the corresponding segment descriptor must be modified in kernel memory and then reloaded into the segment register. A logical address consists of a segment register specifier and offset. To derive a linear address, a segment register’s segment base (named by the segment specifier) is added to the segment offset. An illustration of the complete translation mechanism for the x86 architecture is shown in Figure 6. Code reads are always relative to the code segment (CS) register, while normally, if no segment register is specified data reads use the data segment (DS) register. Through segment overrides a data read can use any segment register including CS. After obtaining a linear address, normal page table translation is done as shown in Figure 6 and Figure 7.
Unlike pages on the x86, segments can be set to only allow instruction reads (execute-only). Data reads and writes to an execute-only segment will generate an exception. This execute-only permission can be used to detect when an application attempts to read memory relative to CS. As soon as the exception is delivered to an OS modified for our attack, the OS can automatically modify the memory map (see Figure 7) to make it appear as if the unmodified data was present at that memory page. Most operating systems for x86, however, now implement a flat memory model. This means that the base value for the CS and DS registers are equal; an application need not use the CS register to read its code. A flat memory model will ensure that both linear addresses are the same, resulting in the same physical address (see the dash-dot-dot line in Figure 7).
|
|
6. Writing and debugging a selfchecking program in C |
||||
Reasuming what we said till now it can be stated that: "detecting
whether portions of a binary have been modified is essentially an
error-detection problem; therefore, a checksum algorithm such as CRC32,
MD5, or SHA1 can be used to generate a signature for an arbitrary block
of code or data. This signature can then be checked at runtime to
determine whether any modification has taken place."
Implementation of a checksum API in C #define
CRC_START_BLOCK(label) void label(void) { } #include <stdio.h> It is tempting to generate a checksum of the entire program and use this to determine whether any bytes have been changed; however, this causes a loss of granularity for the checksum and can degrade performance. Instead, as explained in previous sections is always better to generate multiple checksums for vital sections of code. These checksums can be encrypted, and they can even be supplemented with checksums of trivial blocks of code to disguise which portions of code are significant. Notes about compiling the example1 with VisualC++ 6.0 To compile use "cl crc1.c /DTEST_BUILD" or "cl crc1.c"
I properly placed the comments to help you find the correspondence with the written code (refers to the modified code used for the example1 and not to the code above). Place a Breakpoint with F2 at the JMP at 0x0040114C and press F9 to reach the breakpoint. First of all if you step with F8 the program at 0x00401180 you can see that the real CRC just-in-time calculated is 0x6C8673E6 (it's in ECX). Well you have the opportunity then to run again the application from the beginning and to stop again the application at 0x0040114C. Now go to location 0x401152 and select "Follow in dump" option to let you view in Ollydbg the memory at that location. Now insert the real CRC value you've just seen before (0x6C8673E6), remember to insert it as 0xE673866C.
Weel, obviouslly the markers are not mandatory and are useful only to easily find the proper point where to insert the real CRC value, the whole routine can be more efficiently hidden and the check can be made less evident. Analizying a more more complex example, example2 The next program demonstrates how to use a table of function pointers
to test the value of a checksum. Each nibble or half-byte in the
checksum is used as an index into a 16-entry table of function pointers;
only the correct table entry calls the function to check the next
nibble. This method requires 8 tables of 16 function pointers so that
one table is used for each nibble in the checksum. When this program is compiled with TEST_BUILD defined, the resulting binary will print the CRC32 computed for the function test_routine( ). If the computed CRC32 is 0xFFF7FB7C, the following table indices will represent valid function pointers: b1[12], b2[7], b3[11], b4[15], b5[7], b6[15], b7[15], b8[15]. Each of these contains a pointer to the function that will process the next nibble in the checksum, except for b8[15], which contains a pointer to the function that is called when the checksum has proven valid. The tables in the source can now be rewritten to reflect these correct values: crc_check_fn b1[16] = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, crc_nib2, 0, 0, 0 }, /*C*/b2[16] = { 0, 0, 0, 0, 0, 0, 0, crc_nib3, 0, 0, 0, 0, 0, 0, 0, 0 }, /*7*/ b3[16] = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, crc_nib4, 0, 0, 0, 0 }, /*B*/ b4[16] = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, crc_nib5 }, /*F*/ b5[16] = { 0, 0, 0, 0, 0, 0, 0, crc_nib6, 0, 0, 0, 0, 0, 0, 0, 0 }, /*7*/ b6[16] = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, crc_nib7 }, /*F*/ b7[16] = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, crc_nib8 }, /*F*/ b8[16] = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, crc_good }; /*F*/ Obviously, the NULL bytes will have to be replaced with other values to disguise the fact that they are invalid entries. They can be replaced with pointers to functions that handle incorrect checksums, or they can be filled with garbage values to make the program unstable. For example: crc_check_fn b8[16] = { crc_good - 64, crc_good - 60, crc_good - 56, crc_good - 52,crc_good - 48, crc_good - 44, crc_good - 40, crc_good - 36, crc_good - 32, crc_good - 28, crc_good - 24, crc_good - 20, crc_good - 16, crc_good - 12, crc_good - 8, crc_good - 4, crc_good }; In this table, the use of incrementally increasing values causes the table to appear to be valid data, as opposed to addresses in the code segment Notes about compiling the example1 with VisualC++ 6.0
Looking the example2 into OllyDbg crc_check_fn b1[16] = { 0, 0, 0, 0, 0, 0,
crc_nib2, 0, 0, 0, 0, 0, 0, 0, 0, 0 }, /*6*/ If you try to open the crc2_step2 into Olly you can see that it is not a so trivial task to modify the code and still obtain a valid running program. Looking for the references to the string "CRC is valid" you should land at 0x004012C4. If you place a breakpoint with F2 and press run to the program with F9 you land to the correct place. Now if you press CTRL-K to see the call stack you should easily see that this subroutine is called from the 004012BA |. FF55 F8 CALL [LOCAL.2] ; crc2_ste.004012C4
Well, indeed there's still space for a more intelligent reversing of this application. We will see now how crypto algorithms can be identified using the constants they use. CRC32 is a very easy algorithm but also all the other more complex are not so complex..we'll see it in the next, last, section! Have a little more patient, we arriving at the end ^_°
|
|
7. Identifying self-checking algorithms into programs |
|
Breaking the example1 program To introduce you to the problem let approach to the program crc1_ok.exe inside the example1 folder in the classical way. First of all be sure to have PEiD with the KryptoAnalizer (aka KANAL) installed. Run PEiD and select the crc1_ok.exe program, then select plugins and run the plugin just mentioned. You should obtain (I assume you are alredy expert on PEiD, otherwise return to Beginner tut #1) something like below:
The plugin reports clearly an address at 0x004010DF. Well, fire up Olly and go there to see what's happening there..
You landed here, to get the meaning of the place where you are move a little the cursors, to reallign to the actual real code (you landed in the middle of an instruction, why it will be clear in a moment). Your Olly then should show something like below:
Well, we were at the address of a constant, 0xEDB88320 which was part of an instruction XOR ECX,EDB88320 Now look the CRC32 implementation I did in Section 6, you can find this line: #define CRC_POLY 0xEDB88320L This is the same constant KANAL plugin found. If you have the opportunity/time to see the sources of KANAL (freely available on PEiD site) you will realize that KANAL is doing nothing more than searching for specific constants in memory, any crypto algorithm (better hashing algorithm) uses some specific constants somewhere in the algorithm. At this stage we then understood we landed in the middle of CRC32 algorithm implementation. What we should do is to go at the entrypoint of the CRC32 routine that is, in the example1 sources, the crc32_calc() function. The function where we landed starts here: 0040108F /$ 55 PUSH EBP and is called here: 00401016 |. E8 74000000 CALL crc1_ok.0040108F which is inside a function which is called here:
00401172 |. E8 89FEFFFF CALL crc1_ok.00401000 ; \crc1_ok.00401000 Breaking the example2 program
As done before go to the address shown by KANAL, go up to the entry point of the CRC32 function and you finally land here:
004012EA |. E8 11FDFFFF CALL crc2_ste.00401000 ; \crc2_ste.00401000 Well, here's how the program is compiled and I'll leave to you the opportunity to understand how to fix it when a change is done.
This is the routine into which the CRC32 calculation is performed. I commented it with reference to the sources we wrote in Section 6. The crc_check() function is called at:
004012F9 |. E8 09000000 CALL crc2_ste.00401307 ; \call crc_check(&crc)
The function elaborates a bit the CRC32 value computed and uses it to call the next correct function here:
00401336 |. FF55 F8 CALL [LOCAL.2] ; call the function returned at
address b1[6], where 6 is the value found at 00401312 For example the sequence of calls with a correct CRC32 is:
00401307 -> 00401135 -> 0040116E -> 004011A7 -> 004011E0 -> 00401219 ->
00401252 -> 0040128B -> 004012C4
|
|
Conclusions |
|
What I shown here is a long journey on self-checking programs and the
anti-tampering techniques programs uses. I also shown the theoretical
assumptions that are behind these solutions and two practical examples.
You should have realized how easy is to build custom protections for
programs, that combined with known commercial protectors give extra
barrires to crackers. Fortunately on the other hand much programmers do
not ever consider them doing our cracking-lifes much of the time easy.
|
|
References |
|
I suggest the following further readings from now on to complete this argument..
..and essentially all the tutorials seens around (also others on our tutorials page) which often target checksummed applications..
|
|
Greetings |
|
[Nilrem] [JDog45] [Shub - Nigurrath] [MaDMAn_H3rCuL3s] [Ferrari] [Kruger] [Teerayoot] [R@dier] [ThunderPwr] [Eggi] [EJ12N] [Stickman 373] [Bone Enterprise] Thanks to all the people who take time to write tutorials. Thanks to all the people who continue to develop better tools. Thanks to Exetools, Woodmann, SND, TSRH, MP2K and all the others for being a great place of learning. Thanks also to The Codebreakers Journal, and the Anticrack forum. If you have any suggestions, comments or corrections contact me in usual places.. |