The Counter Tree Library
4.1.-
Problems with allocators in Windows and Linux
The allocator is the data structure defined in the STL, which is the
interface between the data structures and the memory provided generally
by the Operating System. The allocators have a well defined interface.
The data structures don't need to know anything about the allocator
except the interface. The allocator manages the memory received from the
Operating System, and the memory requested in the allocate operations
and the memory returned in the deallocate operations.
The structure of an allocator is :
The allocator manage the requests from the data structure, and the
request to the Operating System. If you need a specific allocator ( for
example if you use shared memory .. ), the management of the elements
is the included in the allocator.
Working with allocators we have several problems related
a) THE SPEED
To write an allocator for to manage a wide range of size elements
and a big number of elements (several millions) of each size in a
fast way, it's a very difficult task. When the number of elements
grows, many allocators begin to have speed problems.
A very hard test for the allocators are the data structures which
need a very high number of elements of the same size, like the STL
data structures list, set, multiset, map or multimap.
Some years ago, only supercomputers have 4 GB of memory. Now, the
cheap laptop have 6 GB, and is expected the memory of the computers
grow Many allocators of modern compilers are not prepared for to
manage hundred of millions of elements. With a small number of
elements they are efficient, but when have a great number the speed
down, and appear problems with the memory management. It is a
challenge for the HW designers and mainly for the SW designers
For to improve the speed allocating the small size elements, many
allocators request to the Operating System big chucks of memory. With
this, the allocator don't need request memory to the operating system
for each allocation. These allocators are named pool allocators.
b) MEMORY USED BY THE ALLOCATOR, MEMORY
CONSUMPTION BY THE PROGRAM
Many pool allocators don't return well
the unused chucks of memory to the Operating System and the memory
used by the allocator is the maximum used, never decrease . If you
have a small number of elements, you have a small problem, small
resources and small time operations. But, if you have several millions
of elements allocated, perhaps you are using several GB of memory.
Running a program with GB of unused memory, because the allocator
don't return the memory request, is a great waste of resources.
This problem is specially important in two environments :
- When you have a limited resources, as occurs in the mobile
phones, pads , tablets...
- When the programs must run continuously in a multitask
environment
The std::allocator of GCC 4.6 and CLANG 3.0, don't return the memory
to the Operating System with the allocation of several millions of
small and fixed size elements. The std::allocator of Visual Studio 10,
return the memory to the Operating System, but is slower than the
allocator of GCC 4.6 and CLANG 3.0
If you use an additional allocator for to improve the speed of the
small fixed size elements, this improve the speed, but increase the
memory consumption, which is the sum of the maximum memory of the pool
allocator and the memory used by the allocator. If the pool allocator
don't free the unused memory, this memory can't be reused by the
allocator.
Other additional problem is the information around the memory
allocate. Many allocators, when ou request memory for to allocate a
double number. The allocator return you a pointer to a 8 bytes of
memory, but around that memory you have additional information used by
the allocator, for to deallocate that memory. By example, allocate
50.000.000 elements of 64 bits, the std::allocator of GCC 4.6 use 1.56
Gigas of memory and the boost::fast_pool_allocator 0.52 Gigas
c) THE CACHE PERFORMANCE
The last problem associated to the
allocators is the cache performance. When you have a big data
structure (imagine a std::set with 50.000.000 elements), travel the
structure for to find an element or for to insert an element, you must
cross though many nodes. Calculate the hit rate of the cache is
extremely difficult. But some measures can provide us a very useful
information.
Take in mind, that the performance can have great variations,
depending of the processor, and mainly of the cache size. On my
computer in the allocation of 50.000.000 elements of 64 bits, on Linux
64 bits
| NAME |
Time Spent |
Time of individual allocation |
| std::allocator |
1.37 seconds |
31.2 nanoseconds |
| boost::fast_pool_allocator |
0.58 seconds |
11.6 nanoseconds |
If we check the time with the std::set and different allocators in
the insertion of 30.000.000 random elements of 64 bits (In the two
test insert the same sequence of numbers)
| NAME |
Time Spent |
Time of individual insertion |
| std::allocator |
95.35seconds |
3.17 microseconds |
| boost::fast_pool_allocator |
61.22 seconds |
2.04 microseconds |
The time difference between allocate and insert one element is more
than 1 microsecond. The time difference between in the allocation is
20 nanoseconds. If you subtract 20 nanoseconds to the
time of the std::allocator you have 3.15 microseconds. Where is the
reason of the difference of more than a microsecond in each operation?
The response is the cache performance.
4.2.-
Description of the Suballocators
The suballocator born with the idea of correct these problems. But, what
is a suballocator ? The suballocator is a layer over the allocator. It
manage in very fast way a greater number (hundred millions) of fixed
size elements.
The suballocator receives an allocator as template parameter. When the
suballocator needs memory, request memory from the allocator, and when
the memory is not used, is returned to the allocator. The suballocator
present the same interface than the STL allocator. Form the view point
of the data structures, the suballocator is other allocator.
The suballocator is a pool allocator with two important characteristics
:
a) Always provide the first element
free. This strategy improves the compactation of the area used for to
allocate the elements. This improve the cache performance and the
speed of the data structures.
b) The suballocator have a aggressive strategy in order to free and
deallocate from the allocator tha last chunk obtained. With each
element deallocated , the suballocator check it. This simple memory
schema permit to the allocators return memory to the Operating System
and decrease the memory used by the program, as can see in the
benchmarks
The results obtained in the benchmarks confirm the theory described
here. These result depend in a very important part of the cache size. In
a QuadCore with 6M of cache, the suballocator is 3 times faster than the
GCC 4.6 std::allocator, and in a AMD Athlon with 1 M of cache only 1.5
times faster)
With the suballocator
a) We have a very fast allocation
(between 1.5 and 3
times faster than the std::allocator of GCC
4.6, CLANG
3.0 and Visual Studio 10 *See details in the Suballocator
Benchmark )
b) Return memory to the allocator, for to be used by others
types of data. Many allocators, return this memory to the
Operating System, and decrease the memory used by the
program, (
as you can see in the Suballocator
Benchmark )
c) You can use with any allocator if it is according with
the STL definition. The suballocator provides speed and
memory management to any allocator
d) The suballocator always provide the first position free. With this
we obtain a very compact data areas, which improve the cache
performance ( 35% of
the time saved in the insertion in a std::set. See details in the
second graph in Benchmarks)
The next step is the join of the two previous concepts. The idea is to
integrate the suballocator inside the data structure. The STL vector
have a management of the memory used by itself. This is the same idea
applied to the data structures. I had done with the cntree data
structures : vector_tree, set , map , multiset and multimap, but it is
trivial to do with any other data structure.with fixed size elements.
The results are data structures, defined in the cntree namespace with
the suffix pool (
vector_tree_pool,
set_pool, multiset_pool, map_pool, multimap_pool) . They have
identical interface than the STL structures with the same name with the
suffix pool, but with an internal suballocator , which increase the
speed of the data structure.
INTERNAL
DESIGN
The internal algorithms and data structures are described in the
document
The_suballocator_algorithms.pdf.
The suballocator permit to do the splice and merge operations of the
std::list, and for move elements from one data structure to other as
need with the rvalues. This permit to the iterators to the elements
remain valid after these operations. They pointed to the same element,
but in different data structure.
The suballocator support the void elements
cntree::suballocator
< std::allocator <void> >
FUNCTIONAL
DESCRIPTION
The idea is simple, the suballocator is a template class which receives
an allocator as template parameter. This allocator provide memory to the
suballocator and deallocate when is freed from the suballocator.
There are two classes
suballocator32
and
suballocator64. The two
classses can run in a 32 bits or a 64 bits environment. In a 32 bits
environment, the suballocator64 is slower than suballocator32. In a 64
bits environment suballocator64 is slightly faster than suballocator32.
The class
suballocator detect the
environment and select the version according with the environment
The class suballocator receives an allocator as template parameter. You
need to include the file
boost/cntree/suballocator.hpp.
The classes are defined in the namespace cntree .The interface of
suballocator32, suballocator64 and suballocator are identical to the
definition of the class allocator in the STL. It is so simple as appear.
#include <boost/cntree/suballocator.hpp>
cntree::suballocator< std::allocator<double> >A ;
cntree::suballocator32< std::allocator<double> >B ;
cntree::suballocator64< std::allocator<double> >C ;
The suballocators run well with any number of element allocated, but
when show all the power is when the number of element up to several
millions
You can use the suballocator with all the allocations, but you must know
that when the number of elements requested to the allocator is greater
than 1, the memory is obtained directly from the allocator, not from the
static pool.
cntree::suballocator32
Documentation and
Code
cntree::suballocator64
Documentation and
Code
cntree::suballocator
Documentation and
Code
FUTURE
IMPROVEMENTS
- Improve the thread safe model with the new C++11 atomic variables
- Add support for rvalues
4.3.-
Examples
///----------------------------------------------------------------------------
// FILE : example_suballocator_00.cpp
//
// DESCRIPTION : Test program of the class suballocator
//
// MODIFICATIONS (AUTHOR,DATE,REASON) :
// Copyright (c) 2010 2012 Francisco José Tapia (fjtapia@gmail.com )
//
// Distributed under the Boost Software License, Version 1.0. (See accompanying
// file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
//
// NOTES : This program allocate 100.000.000 double elements and
// deallocate measuring the times with the function clock()
//-----------------------------------------------------------------------------
#include <boost/cntree/suballocator.hpp>
#include <iostream>
#include <time.h>
#define NELEM 100000000
int main ( void)
{ //------------------------------ begin ----------------------------------
double duracion ;
clock_t start, finish ;
double ** K = new double* [NELEM];
//-------- Creating the suballocator --------------------------
cntree::suballocator< std::allocator<double> >A ;
//--------------------------------------------------------
// Allocate of the elements
//-------------------------------------------------------
std::cout<<"Begin ---------------------->\n";
start = clock();
for ( uint32_t i =0 ; i < NELEM ; ++i)
K[i] = A.allocate(1);
finish = clock() ;
duracion = (double)(finish - start) / CLOCKS_PER_SEC;
std::cout<<"Time Alloc :"<<duracion<<" seconds\n";
//--------------------------------------------------------
// Deallocate of the elements
//-------------------------------------------------------
std::cout<<"Full ---------------------->\n";
start = clock();
for ( uint32_t i =0 ; i < NELEM ; ++i)
A.deallocate(K[i]);
finish = clock() ;
duracion = (double)(finish - start) / CLOCKS_PER_SEC;
std::cout<<"Time Dealloc :"<<duracion<<" seconds\n";
std::cout<<"Empty ---------------------->\n";
delete K;
}
4.4.-
Benchmarks (GCC 4.6, VC++ 10, CLANG 3.0)
IMPORTANT! The results of the
benchmarks have a great dependency of the processor, and the size of the
cache. Different processors can have great differences in the results of
the benchmarks. These benchmarks had been done over a quadcore 2.4GHz
with a cache of 6M , and memory of 533MHz
Allocate
50.000.000 elements of uint64_t
The benchmark of the suballocator is a vector of pointers to elements
uint64_t. The size of the vector is 50.000.000 elements. This vector is
obtained from dynamic memory, and is used for to link the elements
obtained from the allocator.The allocator is selected from a menu .
The left graph shows the time of the 5 operations over the allocator .
We take the time with the function
clock(
)
- Allocate 50000000 elements of 64 bits
- Deallocate all from the last to the first
- Deallocate all from the first to the last
- Deallocate odd elements
- Deallocate odds elements and reallocate
The right graph shows the memory used by the program in 3 moments:
- The vector of pointers had been created and 50.000.000 elements
uint64_t allocated
- The vector of pointers had been created and 50.000.000 elements
uint64_t had been deallocated
- The vector of pointers had been deleted and 50.000.000 elements
uint64_t had been deallocated
This second graph shows the memory consumption, and the capability of
the allocators for to return memory to the operating system and decrease
the memory used by the program

* The source code of malloc_allocator, mt_allocator and pool_allocator
is in a directory of the GCC compiler (
In my computer /usr/include/c++/4.6/ext). Check if your CLANG
compiler access to this directory, for to compile the benchmark programs
Allocate
30.000.000 elements uint64_t and insert in a std::set
The second Benchmark is the insertion of several collections of
30.000.000 numbers of 64 bits in a std::tree. You select the allocator
from a menu
This graph is very important,
because show other important property of the allocators. The time
spend by the allocator in the insertion of a number in the set is
less than 1%. As you can see we save around 35%.of the time The
allocator always provide the first position free. This provide us a
very compact data area, and this improve the cache performance.
The left graph is the time consumed, and the right is the memory used
by the allocator when the 30.000.000 elements are allocated , and
after when are freed, and the std::set is empty. This shows the memory
consumption, and the capability of the allocators for to return memory
to the operating system and decrease the memory used by the program
* The source code of malloc_allocator, mt_allocator and pool_allocator
is in a directory of the GCC compiler (
In my computer /usr/include/c++/4.6/ext). Check if your CLANG
compiler access to this directory, for to compile the benchmark programs