Wednesday, 15 November 2017

Mastering stack and heap for system reliability

The stack and the heap are random access memory (RAM) allocations that are fundamental to an embedded system. Setting them up properly is essential to system stability and reliability. Incorrectly used, they may cause your system to wreak havoc in strange ways. Stack and heap memory must be allocated statically by the programmer, but calculating the space required for the stack space is notoriously difficult for all but the smallest embedded systems.
Underestimating stack usage can lead to serious runtime errors that can be difficult to find, while over-estimating stack usage wastes memory resources. Heap memory overflows also can have significant an impact on system behavior, and can be notoriously difficult to debug. In this article, we’ll take a closer look at stacks and heaps for embedded systems, discussing principles and methods for reliable stack and heap design and some special issues that arise in small embedded systems.

Desktop systems and embedded systems share some common stack and heap design errors and considerations, but differ completely in many other aspects. One example of a difference between these environments is the available memory. Windows and Linux default to 1 and 8 Mbytes of stack space; a number that can be increased even more. Heap space is only limited by the available physical memory and/or page file size.
Embedded systems, on the other hand, have very limited memory resources, especially when it comes to RAM space. There is clearly a need to minimize stack and heap in this restricted memory environment. A common issue for small embedded systems is that there is no virtual memory mechanism; allocation of stack, heap and global data (i.e. variables, TCP/IP, USB buffers, etc.) is static and performed at the time the application is built.
The stack
The stack is an area of RAM where a program stores temporary data during the execution of code blocks. The stack is statically allocated and operates on a “last in first out” basis. The life span of variables on the stack is limited to the duration of the function. As soon as the function returns, the used stack memory will be free for use by subsequent function calls.
The types of data stored in the stack include:
  • local variables
  • return addresses
  • function arguments
  • compiler temporaries
  • interrupt contexts

Stack memory has to be allocated statically by the developer. The stack usually grows downward in memory. If the memory area allocated for the stack isn’t large enough, the executing code writes to the area allocated below the stack and an overflow situation occurs. The area written to is usually the area where global and static variables are stored (see Figure 1). As a result, underestimating stack usage can lead to serious runtime errors such as overwritten variables, wild pointers, and corrupted return addresses. All of these errors can be difficult to find. At the same time, overestimating stack usage wastes memory resources, which increases cost.
 
Figure 1: When the data to be stored is larger than stack capacity, it typically gets written to the memory area allocated for global and static variables.
Determining worst-case maximum stack depth is useful in most embedded projects, as it allows you to allocate only as much RAM as needed for the stack while leaving the rest free for the developer to use in their application. We will highlight some methods that can be used to reliably calculate the required stack size and detect stack related problems.
The heap
The heap is an area of RAM that represents the dynamic memory of the system. When one module does not need its allocated memory anymore, the developer should return it to the memory allocator to be reused by some other module. Dynamic memory makes memory sharing possible between different pieces of a program. As with the stack allocation, it is important to minimize heap usage in small embedded systems; indeed, dynamic memory and the heap can in many cases be considered optional in small embedded systems.
Some examples of data that is placed on the heap include:
  • Transient data objects
  • C++ new/delete
  • C++ STL containers
  • C++ exceptions
Because of the dynamic behavior of the application, calculating heap space in larger systems ranges from difficult to impossible. Moreover, there is not much tool support in the embedded world for measuring heap utilization, but we will discuss some methods to approximate and measure the heap usage of an application.
It is important to maintain heap integrity. Allocated data space is typically interspersed with critical memory allocator housekeeping data. Inefficient use of the allocated data space will corrupt the entire heap memory area and most likely result in an application crash with few traces of how the crash happened. We will discuss some methods to aid in checking for heap integrity.
Another aspect to consider is that the real-time performance of the heap is not deterministic. Memory allocation depends on such factors as previous use and the requested data space size. This ad hoc behavior makes testing and debugging heap-related problems exceedingly difficult.
Managing stacks and heaps
Stretching the limits in everyday life can be rewarding but can sometimes cause you trouble. Stretching the limits in programming when it comes to allocated data will definitely cause you trouble. If you’re lucky, that trouble will arise during system testing, but it might also manifest itself only when the product has already been delivered to thousands or millions of customers, thus necessitating an expensive recall or an embarrassing need for a software patch.
Overflowing allocated data can occur in all three storage areas: global memory, the stack, and the heap. Writing to arrays or pointer references can cause accesses outside of the memory allocated to the object. Some array accesses can be validated by static analysis, for example by the compiler issuing a warning message, a MISRA C checker, or by standalone static analysis tools) as follows:
int array[32];
array[35] = 0x1234;
When the array index is a variable expression, it is difficult for static analysis to find all problems. Pointer references are also hard to trace by static analysis, as this example shows:
int* p = malloc(32 * sizeof(int));
p += 35;
*p = 0x1234;
Runtime methods to catch object overflow errors have been available for desktop systems for a long time, e.g. Purify, Insure++, and Valgrind, to name a few. These tools work by instrumenting the application code to validate memory references at runtime. This comes at the price of slowing down application execution speed, dramatically increasing code size, and has thus not become a viable method for small embedded systems.


The stack is an area of system RAM dedicated to storing temporary variables and program data during the execution of code blocks. This is static memory that operates on a last-in, first-out basis.

A number of factors combine to make it difficult to calculate maximum stack usage. Many applications are complex and event driven, consisting of hundreds of functions and many interrupts. There interrupt functions can be triggered at any time and, if they are allowed to be nested, it becomes even more difficult to calculate the required stack size. There might be indirect calls using function pointers where the destination of the call could be a number of different functions. Recursion and un-annotated assembly routines will also cause problems for anyone who wants to calculate the maximum stack usage. Ergo, there is not a natural execution flow that can be easily followed.

Many microcontrollers implement multiple stacks, for example a system stack and a user stack. Multiple stacks are also a reality if you use an embedded RTOS like µC/OS, ThreadX, and others, in which case each task has its own stack area. Runtime libraries and third-party software are other factors that complicate the calculation because the source code for libraries and the RTOS may not be available.

It is also important to remember that changes to the code and the scheduling of the application can have a large impact on the stack usage. Different compilers and different optimization levels also generate different code that uses different amounts of stack. All of these factors make it essential to continuously track the maximum stack requirement.

Setting stack size
If you're designing an application, the stack size requirement is one factor that you must consider during the design phase, which means that you need a method for determining the amount of stack that you require. Allocating too much stack will waste RAM, but sometimes even if you allocate the entire remaining RAM for stack usage, you still cannot be sure it will be enough.

One approach is to test the system under conditions that produce worst-case stack behavior. During these tests you will need a method for finding out how much stack has been used. This can be done in two ways: from printouts of the current stack usage or by making sure that you can find traces of stack usage in memory after your test run has been completed. Worst case conditions are very hard to provoke in complex systems, however. Moreover, one fundamental problem with testing an event-driven system with many interrupts is that it is likely that some execution paths will never be tested.

Another approach is to calculate the theoretical maximum stack requirement. It’s easy to understand that calculating stack usage for a complete system manually is exceedingly difficult. Quickly and accurately performing this calculation requires a tool that can analyze the complete system. The tool must operate on either the binary image or the source code. A binary tool works at the machine instruction level to find all possible movements of the program counter through your code in order to find the worst-case execution path. A source code static analysis tool reads all of the compilation units of the application. In both cases, the tool must be able to determine direct function calls and indirect function calls through pointers in the compilation unit, and therefore compute a conservative stack usage profile across the entire system for all call graphs. The source code tool has to take into account the demands the compiler places on the stack, such as alignments and compiler temporaries; this can be done by the tool examining the object/executable code.

Writing this kind of tool yourself is a difficult task. Commercial alternatives exist, in the form of either stand-alone static stack calculation tools like PC-Lint, or those provided by a solution vendor, like the StackX stack calculation tool that is available from Express Logic. Compiler and linker also have the information needed to calculate the maximum stack requirement. This functionality is available, for example, in the IAR Embedded Workbench for ARM.

Estimating stack depth
One way of calculating the stack depth is to use the address of the current stack pointer. This requires simply taking the address of a function's argument or local variable. If this is done in the beginning of the main function and for each of the functions that you think are using the most stack, you can calculate the amount of stack your application requires.

Here is an example where we assume that the stack is growing from high to low addresses:

char *highStack, *lowStack;
int main(int argc, char *argv[])
{
    highStack = (char *)&argc;
    // ...
    printf("Current stack usage: %d\n", highStack - lowStack);
}


void deepest_stack_path_function(void)
{
    int a;
    lowStack = (char *)&a;
    // ...
}


This method can yield good results in small and deterministic systems. It is not perfect, however. For many systems it can be difficult to determine the nested function with the deepest stack usage and to provoke the worst case situation. In addition, the results obtained with this method do not take into account stack usage by interrupt functions.

A variant of this approach involves periodically sampling the stack pointer using a high frequency timer interrupt. The interrupt frequency should be set as high as possible without impacting the real-time performance of the application. Typical frequencies would be in the range of 10 to 250 kHz. The advantage of this variant is that there is no need to manually find the function with the deepest stack usage. It is also possible to find stack usage by interrupt functions if the sampling interrupt is allowed to preempt other interrupts. Care should be taken, however, as interrupt functions tend to be short in duration and might be missed by the sampling interrupt.

void sampling_timer_interrupt_handler(void)
{
    char* currentStack;
    int a;
    currentStack = (char *)&a;
    if (currentStack < lowStack) lowStack = currentStack;
}


The stack guard zone
A stack guard zone is a memory area allocated just below the stack, where the stack leaves traces if it overflows (see Figure 1). This method is always implemented on desktop systems where the operating system can easily be set up to detect memory protection errors for a stack overflow situation. On a small embedded system without MMU, a guard zone can still be inserted that will be quite useful. For a guard zone to be effective, it must be of a reasonably large size in order to catch writes to the guard zone.


Figure 1: Located just below the stack and memory, the guard zone captures traces of stack overflow.

The consistency checking of the guard zone can be made in software by regularly checking that the guard zone fill pattern is intact.

A better method can be implemented if the MCU is equipped with a memory protection unit. In that case, the memory protection unit can be set up to trigger on writes to the guard zone. If an access occurs, an exception will be triggered, and the exception handler can record what happened for later analysis.

Filling the stack area with a dedicated pattern
One technique to detect stack overflow is to fill the entire amount of memory allocated to the stack area with a dedicated fill value, for example 0xCD, before the application starts executing. Whenever the execution stops, the stack memory can be searched upwards from the end of the stack until a value that is not 0xCD is found, which is assumed to be how far the stack has been used. If the dedicated value cannot be found, the stack has consumed all stack space and most likely has overflowed.

Although this is a reasonably reliable way to track stack usage, there is no guarantee that it will detect a stack overflow. For example, a stack can incorrectly grow outside its bounds, and even modify memory outside the stack area, without actually modifying any of the bytes near the stack range. Likewise, your application might modify memory within the stack area by mistake.

This method of monitoring stack usage is commonly used by debuggers. This means that the debugger can display a graphical representation of the stack usage like that shown in Figure 1. The debugger does normally not detect a stack overflow when it happens; it can only detect the signs it leaves behind.

Linker-calculated maximum stack requirement
We will now take a closer look at the way build tools like compilers and linkers can calculate maximum stack requirements, using the IAR Embedded Workbench as an example (Figure 2). The linker can accurately calculate the maximum stack usage for each call graph root (each function that is not called from another function, like the start of the application), but it may require some input from the developer because this is only accurate if there is accurate stack usage information for each function in the application, of course. Stack usage provided for the depth of recursive functions and nested interrupts, among other things, is difficult to determine at compile-time.



Click on image to enlarge.

Figure 2: The IAR Embedded Workbench leverages the dedicated pattern technique to track stack calls for calculating maximum stack usage. In general, the compiler will generate this information for each C function, but in some situations you must provide stack-related information to the system. For example, if there are indirect calls (calls using function pointers) in your application, you must supply a list of possible functions that can be called from each calling function. You can do this by using pragma directives in the source file, or by using a separate stack usage control file when linking.

void
foo(int i)
{
    #pragma calls = fun1, fun2, fun3
    func_arr[i]();
}


If you use a stack usage control file, you can also supply stack usage information for functions in modules that do not have stack usage information.

The linker will also generate warnings if some necessary information is missing, for example under the following circumstances:
  • There is at least one function without stack usage information
  • There is at least one indirect call site in the application for which a list of possible called functions has not been supplied
  • There are no known indirect calls, but there is at least one uncalled function that is not known to be a call graph root
  • The application contains recursion (a cycle in the call graph)
  • There are calls to a function declared as a call graph root

When stack usage analysis is enabled, a stack usage chapter will be added to the linker map file, which provides a call chain for each call graph root and subsequently the maximum stack depth for each root.




Click on image to enlarge.

Table 1: Result of linker-calculated maximum stack usage

The total maximum stack usage for the complete system is calculated by adding together the result for each call graph root. In this analysis, the maximum possible stack usage will be:

500+24+24+12+92+8+1144+8+24+32+152 = 2020 bytes

It is important to remember that this type of stack usage analysis produces a worst case result. The application might not actually ever end up in the maximum call chain, by design or by coincidence.

The heap stores dynamic elements generated by the code during runtime and allows different elements of the program to share data and memory. Underestimating heap usage can lead to out-of-memory errors in malloc(). This situation can be detected easily by checking the return status of malloc(), but by then it will be too late. This is a serious situation in most embedded systems because there is typically no acceptable way to recover, which means that the only available response is to restart the application. Overestimating heap usage is necessary due to the dynamic nature of the heap; on the other hand, too large an estimate will waste memory resources. Two other failures that can occur when using the heap are overwritten heap data (variables and pointers), and corruption of the heap’s internal structure.

Before we continue, we'll recap the dynamic memory API:

void* malloc(size_t size)
  • allocates size bytes
  • returns a pointer to the allocated memory block
  • does not clear the memory block
  • returns NULL if no memory is left

free(void* p)
  • frees the memory space pointed to by p
  • requires p to have been returned by a previous call to malloc(), calloc(), or realloc()
  • calling free(p) more than once for the same p must be avoided

void* calloc(size_t nelem, size_t elsize)
  • is similar to malloc()
  • clears the memory block

void* realloc(void* p, size_t size)
  • is similar to malloc()
  • grows/shrinks a previously allocated block
  • the returned block might have a new address

C++
  • new operator similar to malloc()
  • new[]
  • delete operator similar to free()
  • delete[]

There are a number of dynamic memory allocator implementations available. The most commonly used today is Dlmalloc (Doug Lea’s memory allocator). Dlmalloc is found in Linux and also in many development tools for embedded systems. Dlmalloc is in the public domain and is freely available.

The internal structure for the heap is interspersed with data allocated by the application (Figure 1). If the application writes outside of the allocated data it might easily corrupt the internal structure of the heap.


Figure 1: Internal structure of the heap shows that writing outside of the allocated user data will corrupt the heap.

Calculating the amount of memory required for the heap is a non-trivial task. Most designers choose a trial and error approach just because the alternatives are too tedious. A typical approach would be to find the smallest heap for which the application still works and then increase that size by another 50%. If they are lucky, this works. If they are not, problems arise.

Mistakes to guard against
To reduce the risk of launching products with heap errors, programmers and code reviewers should be aware of several potential problem areas.
  • Initialization mistakes - Uninitialized global data is always initialized to zero. That well-known fact makes it easy to assume the same of the heap. Malloc(), realloc() and C++ new do not initialize the allocated data, although a special variant of malloc() called calloc() does initialize allocated data to zero. In C++ new, the appropriate constructor will be called, so make sure it initializes all elements.
  • Failure to distinguish between scalars and arrays - C++ has different operators for scalars and arrays: new and delete for scalars, new[] and delete[] for arrays.
  • Writing to already freed memory - This will either corrupt the internal memory allocator data structures or be overwritten later by a write to a subsequently legitimately allocated area. In any case, these errors are really hard to catch.
  • Failing to check return values - All of malloc(), realloc(), and calloc() return a NULL pointer to indicate an out-of-memory condition. Desktop systems will generate a memory fault when de-referencing the null pointer, which makes out-of-memory conditions easy to detect during development. Embedded systems may have flash at address zero and may survive with more subtle errors. If your MCU has a memory protection unit, you could configure it to generate a memory protection fault on write accesses to flash and other execute-only areas.
  • Freeing the same memory multiple times - This is likely to corrupt the internal memory allocator data structures, and is another error that can be challenging to detect.
  • Writing outside of the allocated area - This is likely to corrupt the internal memory allocator data structures, and is very difficult to discover.

The last three errors can be detected fairly easily if you put wrappers around the standard malloc(), free() and related functions. The wrappers must allocate a few extra bytes of memory to accommodate the extra information needed for the consistency checks.

One commonly used method is the “Magic Number” where a specially-chosen value is written at the top of the heap and is checked whenever memory is freed. If this value is not correct, then the heap has been corrupted (Figure 2).


Figure 2: Wrapper data layout shows a field size (bottom) used to calculate the magic number (top) that is used to detect corruption.

The size field below the data is used by the free() wrapper to find the ‘magic number’ (see sample code). The wrapper example uses 8 bytes of overhead per allocation, which should be acceptable for most applications. The example also shows how to override the C++ new and delete operators globally.

#include <stdint.h>
#include <stdlib.h>
#define MAGIC_NUMBER 0xefdcba98
uint32_t myMallocMaxMem;

void* MyMalloc(size_t bytes)
{
   uint8_t *p, *p_end;
   static uint8_t* mLow = (uint8_t*)0xffffffff; /* lowest address 
   returned by malloc() */
   static uint8_t* mHigh; /* highest address + data returned by
   malloc() */
   bytes = (bytes + 3) & ~3; /* ensure alignment for magic number */
   p = (uint8_t*)malloc(bytes + 8); /* add 2x32-bit for size and
   magic number */
   if (p == NULL)
   {
      abort(); /* out of memory */
   }
   *((uint32_t*)p) = bytes; /* remember size */
   *((uint32_t*)(p + 4 + bytes)) = MAGIC_NUMBER; /* write magic
   number after user allocation */
   /* crude method of estimating maximum used size since application
   start */
   if (p < mLow) mLow = p;
   p_end = p + bytes + 8;
   if (p_end > mHigh) mHigh = p_end;
   myMallocMaxMem = mHigh - mLow;
   return p + 4; /* allocated area starts after size */
}

void MyFree(void* vp)
{
   uint8_t* p = (uint8_t*)vp - 4;
   int bytes = *((uint32_t*)p);
   /* check that magic number is not corrupted */
   if (*((uint32_t*)(p + 4 + bytes)) != MAGIC_NUMBER)
   {
      abort(); /* error: data overflow or freeing already freed
      memory */
   }
   *((uint32_t*)(p + 4 + bytes)) = 0; /* remove magic number to be
   able to detect freeing already freed memory */
   free(p);
}

#ifdef __cplusplus
// global override of operator new, delete, new[] and delete[]
void* operator new (size_t bytes) { return MyMalloc(bytes); }
void operator delete (void *p) { MyFree(p); }
#endif

It is important to note that the example above will catch all such errors only if all allocated memory is also freed at some point. This may not be the case for some applications. In that case, the wrapper must maintain a list of all memory allocations and periodically validate all allocations recorded in the allocation list. The overhead of implementing this might not be as large as it sounds, though, since most embedded systems make relatively small use of dynamic memory, keeping the allocation list within a reasonable limit.

Setting the heap size
Now that we've reviewed challenges and pitfalls, let's consider techniques we can use to determine the minimum heap size needed by the application. The dynamic behavior and fragmentation that may occur make the task non-trivial. The approach recommended here is to run the application under a system test case that has been designed to force dynamic memory to be used as much as possible. It is important to repeatedly exercise transitions from low memory usage in order to see the effects of possible fragmentation. When the test case is completed, the maximum usage level of the heap should be compared to the actual heap size. A margin of 25 to 100% should be applied, depending on the nature of the application.

For systems that mimic desktop systems by emulating sbrk(), the maximum heap usage will be given by malloc_max_footprint(). For embedded systems that do not emulate sbrk(), it is common to give the entire heap to the memory allocator in one chunk. In that case, malloc_max_footprint() becomes worthless; it will simply return the size of the entire heap. One solution, for example, in the wrapper function described earlier, would be to call mallinfo() after every call to malloc() and observe the total allocated space (mallinfo->uordblks). It is important to note that Mallinfo() is computationally intensive, however, and will impact performance. A better method involves recording the maximum distances between the allocated areas. This can be easily performed and is shown in the wrapper example; the maximum value is recorded in the variable myMallocMaxMem. This method works, provided that the heap is one contiguous memory area.

Tool support
When it comes to detecting heap-related problems on desktop systems, designers can choose from well-known analysis tools such as Purify, Insure++, and Valgrind. Tools of this class are not available for small embedded systems due to performance reasons. however. Designers either have to rely on the methods described above or they must replace the dynamic memory allocator with one of the variants designed for consistency analysis and debugging. One such debug implementation of a memory allocator is Dmalloc.

Summary
Properly setting up the stack in the heap is an essential part of developing a safe, reliable embedded system. Even if calculating the stack and heap requirements is a complex and difficult task, a number of useful tools and methods exist to make the process easier. The time and money spent on good calculation during the development phase is well rewarded by not having to find problems related to overwritten stack and heap areas in a deployed system. Ultimately the exercise provides bigger benefits in the form of having designers develop and commercialize successful embedded products.

References
1. Nigel Jones, "Computing Your Stack Size: Stack Overflow"
2. John Regehr, “Say no to stack overflow,” EE Times Design, 2004.
3. Carnegie Mellon University, “Secure Coding in C and C++, Module 4, Dynamic Memory Management,” 2010.

No comments:

Post a Comment