Boost C++ Libraries Preliminary

...one of the most highly regarded and expertly designed C++ library projects in the world. Herb Sutter and Andrei Alexandrescu, C++ Coding Standards



The Counter Tree Library





4.1.- Problems with allocators in Windows and Linux

 

The allocator is the data structure defined in the STL, which is the interface between the data structures and the memory provided generally by the Operating System. The allocators have a well defined interface. The data structures don't need to know anything about the allocator except the interface. The allocator manages the memory received from the Operating System, and the memory requested in the allocate operations and the memory returned in the deallocate operations.

The structure of an allocator is :

Example of Allocator

The allocator manage the requests from the data structure, and the request to the Operating System. If you need a specific allocator ( for example if you use shared memory  .. ), the management of the elements is the included in the allocator.
 
Working with allocators we have several problems related
 

a) THE SPEED

To write an allocator for to manage a wide range of size elements and a big number of elements (several millions) of each size in a fast way, it's a very difficult task. When the number of elements grows, many allocators begin to have speed problems.


A very hard test for the allocators are the data structures which need a very high number of elements of the same size, like the STL data structures  list, set, multiset, map or multimap.

Some years ago, only supercomputers have 4 GB of memory. Now, the cheap laptop have 6 GB, and is expected the memory of the computers grow Many allocators of modern compilers are not prepared for to manage hundred of millions of elements. With a small number of elements they are efficient, but when have a great number the speed down, and appear problems with the memory management. It is a challenge for the HW designers and mainly for the SW designers


For to improve the speed allocating the small size elements, many allocators request to the Operating System big chucks of memory. With this, the allocator don't need request memory to the operating system for each allocation. These allocators are named pool allocators.


b) MEMORY USED BY THE ALLOCATOR, MEMORY CONSUMPTION BY THE PROGRAM

Many pool allocators don't return well the unused chucks of memory to the Operating System  and the memory  used by the allocator is the maximum used, never decrease . If you have a small number of elements, you have a small problem, small resources and small time operations. But, if you have several millions of elements allocated, perhaps you are using several GB of memory. Running a program with GB of unused memory, because the allocator don't return the memory request, is a great waste of resources.

This problem is specially important in two environments :
  •  When you have a limited resources, as occurs in the mobile phones, pads , tablets...
  •  When the programs must run continuously in a multitask environment

The std::allocator of GCC 4.6 and CLANG 3.0, don't return the memory to the Operating System with the allocation of several millions of small and fixed size elements. The std::allocator of Visual Studio 10, return the memory to the Operating System, but is slower than the allocator of GCC 4.6 and CLANG 3.0

If you use an additional allocator for to improve the speed of the small fixed size elements, this improve the speed, but increase the memory consumption, which is the sum of the maximum memory of the pool allocator and the memory used by the  allocator. If the pool allocator don't free the unused memory,  this memory can't be reused by the allocator.

Other additional problem is the information around the memory allocate. Many allocators, when ou request memory for to allocate a double number. The allocator return you a pointer to a 8 bytes of memory, but around that memory you have additional information used by the allocator, for to deallocate that memory. By example, allocate 50.000.000 elements of 64 bits, the std::allocator of GCC 4.6 use 1.56 Gigas of memory and the boost::fast_pool_allocator 0.52 Gigas


c) THE CACHE PERFORMANCE

The last problem associated to the allocators is the cache performance. When you have a big data structure (imagine a std::set with 50.000.000 elements), travel the structure for to find an element or for to insert an element, you must cross though many nodes. Calculate the hit rate of the cache is extremely difficult. But some measures can provide us a very useful information.

Take in mind, that the performance can have great variations, depending of the processor, and mainly of the cache size. On my computer in the allocation of 50.000.000 elements of 64 bits, on Linux 64 bits

NAME Time Spent Time of individual allocation
std::allocator 1.37 seconds 31.2 nanoseconds
boost::fast_pool_allocator  0.58 seconds 11.6 nanoseconds

 If we check the time with the std::set and different allocators in the insertion of 30.000.000 random elements of 64 bits (In the two test insert the same sequence of numbers)

NAME Time Spent Time of individual insertion
std::allocator 95.35seconds 3.17 microseconds
boost::fast_pool_allocator 61.22 seconds 2.04 microseconds

The time difference between allocate and insert one element is more than 1 microsecond. The time difference between in the allocation is 20 nanoseconds. If you subtract 20 nanoseconds to the
time of the std::allocator you have 3.15 microseconds. Where is the reason of the difference of more than a microsecond in each operation? The response is the cache performance.


4.2.- Description of the Suballocators


The suballocator born with the idea of correct these problems. But, what is a suballocator ?  The suballocator is a layer over the allocator.  It manage in very fast way a greater number (hundred millions) of fixed size elements.

The suballocator receives an allocator as template parameter. When the suballocator needs memory, request memory from the allocator, and when the memory is not used, is returned to the allocator. The suballocator present the same interface than the STL allocator. Form the view point of the data structures, the suballocator is other allocator.


Example of Allocator


The suballocator is a pool allocator with two important characteristics :

a) Always provide the first element free. This strategy improves the compactation of the area used for to allocate the elements.  This improve the cache performance and the speed of the data structures.

b) The suballocator have a aggressive strategy in order to free and deallocate from the allocator tha last chunk obtained. With each element deallocated , the suballocator check it. This simple memory schema permit to the allocators return memory to the Operating System and decrease the memory used by the program, as can see in the benchmarks


The results obtained in the benchmarks confirm the theory described here. These result depend in a very important part of the cache size. In a QuadCore with 6M of cache, the suballocator is 3 times faster than the GCC 4.6 std::allocator, and in a AMD Athlon with 1 M of cache only 1.5 times faster)


With the suballocator

a) We have a very fast allocation (between 1.5 and 3 times faster than the  std::allocator of  GCC 4.6, CLANG 3.0  and Visual Studio 10 *See details in the Suballocator Benchmark  )

b) Return memory to the allocator, for to be used by others types of data. Many allocators,  return this memory to the Operating System, and decrease the memory used by the program, ( as you can see in the Suballocator Benchmark  )

c) You can use with any allocator if it is according with the STL definition. The suballocator provides speed and memory management to any allocator                  

d)  The suballocator always provide the first position free. With this we obtain a very compact data areas, which improve the cache performance ( 35% of the time saved in the insertion in a std::set. See details in the second graph in Benchmarks)

The next step is the join of the two previous concepts.  The idea is to integrate the suballocator inside the data structure. The STL vector have a management of the memory used by itself. This is the same idea applied to the data structures. I had done with the cntree data structures : vector_tree, set , map , multiset and multimap,  but it is trivial to do with any other data structure.with fixed size elements.

The results are data structures, defined in the cntree namespace with the suffix pool ( vector_tree_pool, set_pool, multiset_pool, map_pool, multimap_pool) . They have identical interface than the STL structures with the same name with the suffix pool, but with an internal suballocator , which increase the speed of the data structure.


Example of Allocator

INTERNAL DESIGN

The internal algorithms and data structures are described in the document The_suballocator_algorithms.pdf.

The suballocator permit to do  the splice and merge operations of the std::list, and for move elements from one data structure to other as need with the rvalues. This permit to the iterators to the elements remain valid after these operations. They pointed to the same element, but in different data structure.

The suballocator support the void elements  cntree::suballocator < std::allocator <void> >


FUNCTIONAL DESCRIPTION

The idea is simple, the suballocator is a template class which receives an allocator as template parameter. This allocator provide memory to the suballocator and deallocate when is freed from the suballocator.

There are two classes suballocator32 and suballocator64. The two classses can run in a 32 bits or a 64 bits environment. In a 32 bits environment, the suballocator64 is slower than suballocator32. In a 64 bits environment  suballocator64 is slightly faster than suballocator32. The class suballocator detect the environment and select the version according with the environment

The class suballocator receives an allocator as template parameter. You need to include the file boost/cntree/suballocator.hpp. The classes are defined in the namespace cntree .The interface of suballocator32, suballocator64 and suballocator are identical to the definition of the class allocator in the STL. It is so simple as appear.



#include <boost/cntree/suballocator.hpp>

cntree::suballocator< std::allocator<double> >A ;

cntree::suballocator32< std::allocator<double> >B ;

cntree::suballocator64< std::allocator<double> >C ;



The suballocators run well with any number of element allocated, but when show all the power is when the number of element up to several millions

You can use the suballocator with all the allocations, but you must know that when the number of elements requested to the allocator is greater than 1, the memory is obtained directly from the allocator, not from the static pool.

cntree::suballocator32          Documentation and Code
cntree::suballocator64          Documentation and Code
cntree::suballocator              Documentation and Code


FUTURE IMPROVEMENTS




4.3.- Examples

 



///----------------------------------------------------------------------------
// FILE : example_suballocator_00.cpp
//
// DESCRIPTION : Test program of the class suballocator
//
// MODIFICATIONS (AUTHOR,DATE,REASON) :
//  Copyright (c) 2010 2012 Francisco José Tapia (fjtapia@gmail.com )
//
// Distributed under the Boost Software License, Version 1.0. (See accompanying
// file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
//
// NOTES : This program allocate 100.000.000 double elements and
//   deallocate measuring the times with the function clock()
//-----------------------------------------------------------------------------
#include <boost/cntree/suballocator.hpp>
#include <iostream>
#include <time.h>

#define NELEM 100000000


int main ( void)
{   //------------------------------ begin ----------------------------------
    double duracion ;
    clock_t start, finish ;
    double ** K = new double* [NELEM];

    //-------- Creating the suballocator --------------------------
    cntree::suballocator< std::allocator<double> >A ;

    //--------------------------------------------------------
    //         Allocate of the elements
    //-------------------------------------------------------
    std::cout<<"Begin  ---------------------->\n";
    start = clock();
    for ( uint32_t i =0  ; i < NELEM ; ++i)
        K[i] = A.allocate(1);
    finish = clock() ;

    duracion = (double)(finish - start) / CLOCKS_PER_SEC;
    std::cout<<"Time Alloc  :"<<duracion<<" seconds\n";

    //--------------------------------------------------------
    //         Deallocate of the elements
    //-------------------------------------------------------
    std::cout<<"Full ---------------------->\n";
    start = clock();
    for ( uint32_t i =0  ; i < NELEM ; ++i)
        A.deallocate(K[i]);
    finish = clock() ;

    duracion = (double)(finish - start) / CLOCKS_PER_SEC;
    std::cout<<"Time Dealloc  :"<<duracion<<" seconds\n";

    std::cout<<"Empty ---------------------->\n";

    delete K;
}



4.4.- Benchmarks (GCC 4.6, VC++ 10, CLANG 3.0)


IMPORTANT! The results of the benchmarks have a great dependency of the processor, and the size of the cache. Different processors can have great differences in the results of the benchmarks.  These benchmarks had been done over a quadcore 2.4GHz with a cache of 6M , and memory of 533MHz

Allocate 50.000.000 elements of uint64_t

The benchmark of the suballocator is a vector of pointers to elements uint64_t. The size of the vector is 50.000.000 elements. This vector is obtained from dynamic memory, and is used for to link the elements obtained from the allocator.The allocator is selected from a menu .
The left graph shows the time of the 5 operations over the allocator . We take the time with the function clock( )
  1. Allocate 50000000 elements of 64 bits
  2. Deallocate all from the last to the first
  3. Deallocate all from the first to the last
  4. Deallocate odd elements
  5. Deallocate odds elements and reallocate

The right  graph shows the memory used by the program in 3 moments:
  1. The vector of pointers had been created and 50.000.000 elements uint64_t allocated
  2. The vector of pointers had been created and 50.000.000 elements uint64_t had been deallocated
  3. The vector of pointers had been deleted and 50.000.000 elements uint64_t had been deallocated

This second graph shows the memory consumption, and the capability of the allocators for to return memory to the operating system and decrease the memory used by the program





* The source code of malloc_allocator, mt_allocator and pool_allocator is in a directory of the GCC compiler ( In my computer /usr/include/c++/4.6/ext). Check if your CLANG compiler access to this directory, for to compile the benchmark programs


Allocate 30.000.000 elements uint64_t and insert in a std::set

The second Benchmark is the insertion of several collections of 30.000.000 numbers of 64 bits in a std::tree. You select the allocator from a menu

This graph is very important, because show other important property of the allocators. The time spend  by the allocator in the insertion of a number in the set is less than 1%. As you can see  we  save around 35%.of the time  The allocator always provide the first position free. This provide us a very compact data area, and this improve the cache performance.


The left graph is the time consumed, and the right is the memory used by the allocator when the 30.000.000 elements are allocated , and after when are freed, and the std::set is empty. This shows the memory consumption, and the capability of the allocators for to return memory to the operating system and decrease the memory used by the program

 




* The source code of malloc_allocator, mt_allocator and pool_allocator is in a directory of the GCC compiler ( In my computer /usr/include/c++/4.6/ext). Check if your CLANG compiler access to this directory, for to compile the benchmark programs




 

Last revised: April  16, 2012