Patch to make D's GC partially precise
by Leandro Lucarella on 2009-11-06 11:43 (updated on 2009-11-06 20:09)- with 0 comment(s)
David Simcha has announced a couple of weeks ago that he wanted to work on making the D's GC partially precise (only the heap). I was planning to do it myself eventually because it looked like something doable with not much work that could yield a big performance gain, and particularly useful to avoid memory leaks due to false pointers (which can keep huge blocks of data artificially alive). But I didn't had the time and I had other priorities.
Anyway, after some discussion, he finally announced he got the patch, which he added as a bug report. The patch is being analyzed for inclusion, but the main problem now is that it is not integrated with the new operator, so if you want to get precise heap scanning, you have to use a custom function to allocate (that creates a map of the type to allocate at compile-time to pass the information about the location of the pointers to the GC).
I'm glad David could work on this and I hope this can be included in D2, since is a long awaited feature of the GC.
Update
David Schima has been just added to the list of Phobos developers. Maybe he can integrate his work on associative arrays too.
Stats for the basic GC
by Leandro Lucarella on 2009-10-08 20:08 (updated on 2009-10-08 20:08)- with 0 comment(s)
Here are some graphs made from my D GC benchmarks using the Tango (0.99.8) basic collector, similar to the naive ones but using histograms for allocations (time and space):
Some comments:
- The Wasted space is the Uncommitted space (since the basic GC doesn't track the real size of the stored object).
- The Stop-the-world time is the time all the threads are stopped, which is almost the same as the time spent scanning the heap.
- The Collect time is the total time spent in a collection. The difference with the Stop-the-world time is almost the same as the time spent in the sweep phase, which is done after the threads have being resumed (except the thread that triggered the collection).
There are a few observations to do about the results:
- The stop the world time varies a lot. There are tests where is almost unnoticeable (tree), tests where it's almost equals to the total collection time (rnd_data, rnd_data_2, split) and test where it's in the middle (big_arrays). I can't see a pattern though (like heap occupancy).
- There are tests where it seems that collections are triggered for no reason; there is plenty of free space when it's triggered (tree and big_arrays). I haven't investigated this yet, so if you can see a reason, please let me know.
Life in hell
by Leandro Lucarella on 2009-09-06 18:24 (updated on 2009-09-06 18:24)- with 0 comment(s)
Warning
Long post ahead =)
As I said before, debug is hell in D, at least if you're using a compiler that doesn't write proper debug information and you're writing a garbage collector. But you have to do it when things go wrong. And things usually go wrong.
This is a small chronicle about how I managed to debug a weird problem =)
I had my Naive GC working and getting good stats with some small micro-benchmarks, so I said let's benchmark something real. There is almost no real D applications out there, suitable for an automated GC benchmark at least [1]. Dil looked like a good candidate so I said let's use Dil in the benchmark suite!.
And I did. But Dil didn't work as I expected. Even when running it without arguments, in which case a nice help message like this should be displayed:
dil v1.000 Copyright (c) 2007-2008 by Aziz Köksal. Licensed under the GPL3. Subcommands: help (?) compile (c) ddoc (d) highlight (hl) importgraph (igraph) python (py) settings (set) statistics (stats) tokenize (tok) translate (trans) Type 'dil help <subcommand>' for more help on a particular subcommand. Compiled with Digital Mars D v1.041 on Sat Aug 29 18:04:34 2009.
I got this instead:
Generate an XML or HTML document from a D source file. Usage: dil gen file.d [Options] Options: --syntax : generate tags for the syntax tree --xml : use XML format (default) --html : use HTML format Example: dil gen Parser.d --html --syntax > Parser.html
Which it isn't even a valid Dil command (it looks like a dead string in some data/lang_??.d files).
I ran Valgrind on it and detected a suspicious invalid read of size 4 when reading the last byte of a 13 bytes long class instance. I thought maybe the compiler was assuming the GC allocated block with size multiples of the word size, so I made gc_malloc() allocate multiples of the word size, but nothing happened. Then I thought that maybe the memory blocks should be aligned to a multiple of a word, so I made gc_malloc() align the data portion of the cell to a multiple of a word, but nothing.
Since Valgrind only detected that problem, which was at the static constructor of the module tango.io.Console, I though it might be a Tango bug, so I reported it. But it wasn't Tango's fault. The invalid read looked like a DMD 1.042 bug; DMD 1.041 didn't have that problem, but my collector still failed to run Dil. So I was back to zero.
I tried the Tango stub collector and it worked, so I tried mine disabling the collections, and it worked too. So finally I could narrow the problem to the collection phase (which isn't much, but it's something). The first thing I could think it could be wrong in a collection is that cells still in use are swept as if they were unused, so I then disabled the sweep phase only, and it kept working.
So, everything pointer to prematurely freed cells. But why my collector was freeing cells prematurely being so, so simple? I reviewed the code a couple of times and couldn't find anything evidently wrong. To confirm my theory and with the hope of getting some extra info, I decided to write a weird pattern in the swept cells and then check if that pattern was intact when giving them back to the mutator (the basic GC can do that too if compiled with -debug=MEMSTOMP). That would confirm that the swept memory were still in use. And it did.
The I tried this modified GC with memory stomp with my micro-benchmarks and they worked just fine, so I started to doubt again that it was my GC's problem. But since those benchmarks didn't use much of the GC API, I thought maybe Dil was using some strange features of making some assumptions that were only true for the current implementation, so I asked Aziz Köksal (Dil creator) and he pointed me to some portion of code that allocated memory from the C heap, overriding the operators new and delete for the Token struct. There is a bug in Dil there, because apparently that struct store pointers to the GC heap but it's not registered as a root, so it looks like a good candidate.
So I commented out the overridden new and delete operators, so the regular GC-based operators were used. But I still got nothing, the wrong help message were printed again. Then I saw that Dil was manually freeing memory using delete. So I decided to make my gc_free() implementation a NOP to let the GC take over of all memory management... And finally all [2] worked out fine! =)
So, the problem should be either my gc_free() implementation (which is really simple) or a Dil bug.
In order to get some extra information on where the problem is, I changed the Cell.alloc() implementation to use mmap to allocate whole pages, one for the cell's header, and one or more for the cell data. This way, could easily mprotect the cell data when the cell was swept (and un-mprotecting them when they were give back to the program) in order to make Dil segfault exactly where the freed memory was used.
I ran Dil using strace and this is what happened:
[...]
(a) write(1, "Cell.alloc(80)\n", 15) = 15
(b) mmap2(NULL, 8192, PROT_READ|PROT_WRITE, ...) = 0xb7a2e000
[...]
(c) mprotect(0xb7911000, 4096, PROT_NONE) = 0
mprotect(0xb7913000, 4096, PROT_NONE) = 0
[...]
mprotect(0xb7a2b000, 4096, PROT_NONE) = 0
mprotect(0xb7a2d000, 4096, PROT_NONE) = 0
(d) mprotect(0xb7a2f000, 4096, PROT_NONE) = 0
mprotect(0xb7a43000, 4096, PROT_NONE) = 0
mprotect(0xb7a3d000, 4096, PROT_NONE) = 0
[...]
mprotect(0xb7a6b000, 4096, PROT_NONE) = 0
(e) mprotect(0xb7a73000, 4096, PROT_NONE) = 0
(f) mprotect(0xb7a73000, 4096, PROT_READ|PROT_WRITE) = 0
mprotect(0xb7a6b000, 4096, PROT_READ|PROT_WRITE) = 0
[...]
mprotect(0xb7a3f000, 4096, PROT_READ|PROT_WRITE) = 0
(g) mprotect(0xb7a3d000, 4096, PROT_READ|PROT_WRITE) = 0
--- SIGSEGV (Segmentation fault) @ 0 (0) ---
+++ killed by SIGSEGV (core dumped) +++
(a) is a debug print, showing the size of the gc_malloc() call that got the address 0xb7a2e000. The mmap (b) is 8192 bytes in size because I allocate a page for the cell header (for internal GC information) and another separated page for the data (so I can only mprotect the data page and keep the header page read/write); that allocation asked for a new fresh couple of pages to the OS (that's why you see a mmap).
From (c) to (e) you can see a sequence of several mprotect, that are cells being swept by a collection (protecting the cells against read/write so if the mutator tries to touch them, a SIGSEGV is on the way).
From (f) to (g) you can see another sequence of mprotect, this time giving the mutator permission to touch that pages, so that's gc_malloc() recycling the recently swept cells.
(d) shows the cell allocated in (a) being swept. Why the address is not the same (this time is 0xb7a2f000 instead of 0xb7a2e000)? Because, as you remember, the first page is used for the cell header, so the data should be at 0xb7a2e000 + 4096, which is exactly 0xb7a2f000, the start of the memory block that the sweep phase (and gc_free() for that matter) was protecting.
Finally we see the program getting his nice SIGSEGV and dumping a nice little core for touching what it shouldn't.
Then I opened the core with GDB and did something like this [3]:
Program terminated with signal 11, Segmentation fault.
(a) #0 0x08079a96 in getDispatchFunction ()
(gdb) print $pc
(b) $1 = (void (*)()) 0x8079a96 <getDispatchFunction+30>
(gdb) disassemble $pc
Dump of assembler code for function
getDispatchFunction:
0x08079a78 <getDispatchFunction+0>: push %ebp
0x08079a79 <getDispatchFunction+1>: mov %esp,%ebp
0x08079a7b <getDispatchFunction+3>: sub $0x8,%esp
0x08079a7e <getDispatchFunction+6>: push %ebx
0x08079a7f <getDispatchFunction+7>: push %esi
0x08079a80 <getDispatchFunction+8>: mov %eax,-0x4(%ebp)
0x08079a83 <getDispatchFunction+11>: mov -0x4(%ebp),%eax
0x08079a86 <getDispatchFunction+14>: call 0x80bccb4 <objectInvariant>
0x08079a8b <getDispatchFunction+19>: push $0xb9
0x08079a90 <getDispatchFunction+24>: mov 0x8(%ebp),%edx
0x08079a93 <getDispatchFunction+27>: add $0xa,%edx
(c) 0x08079a96 <getDispatchFunction+30>: movzwl (%edx),%ecx
[...]
(gdb) print /x $edx
(d) $2 = 0xb7a2f000
First, in (a), GDB tells where the program received the SIGSEGV. In (b) I print the program counter register to get a more readable hint on where the program segfaulted. It was at getDispatchFunction+30, so I disassemble that function to see that the SIGSEGV was received when doing movzwl (%edx),%ecx (moving the contents of the ECX register to the memory pointed to by the address in the register EDX) at (c). In (d) I get the value of the EDX register, and it's 0xb7a2f000. Do you remember that value? Is the data address for the cell at 0xb7a2e000, the one that was recently swept (and mprotected). That's not good for business.
This is the offending method (at dil/src/ast/Visitor.d):
Node function(Visitor, Node) getDispatchFunction()(Node n)
{
return cast(Node function(Visitor, Node))dispatch_vtable[n.kind];
}
Since I can't get any useful information from GDB (I can't even get a proper backtrace [4]) except for the mangled function name (because the wrong debug information produced by DMD), I had to split that function into smaller functions to confirm that the problem was in n.kind (I guess I could figure that out by eating some more assembly, but I'm not that well trained at eating asm yet =). This means that the Node instance n is the one prematurely freed.
This is particularly weird, because it looks like the node is being swept, not prematurely freed using an explicit delete. So it seems like the GC is missing some roots (or there are non-aligned pointers or weird stuff like that). The fact that this works fine with the Tango basic collector is intriguing too. One thing I can come up with to explain why it works in the basic collector is because it makes a lot less collections than the naive GC (the latter is really lame =). So maybe the rootless object becomes really free before the basic collector has a chance to run a collection and because of that the problem is never detected.
I spent over 10 days now investigating this issue (of course this is not near a full-time job for me so I can only dedicate a couple of days a week to this =), and I still can't find a clear cause for this problem, but I'm a little inclined towards a Dil bug, so I reported one =). So we'll see how this evolves; for now I'll just make gc_free() a NOP to continue my testing...
| [1] | Please let me know if you have any working, real, Tango-based D application suitable for GC benchmarks (i.e., using the GC and easily scriptable to run it automatically). |
| [2] | all being running Dil without arguments to get the right help message =) |
| [3] | I have shortened the name of the functions because they were huge, cryptic, mangled names =). The real name of getDispatchFunction is _D3dil3ast7Visitor7Visitor25__T19getDispatchFunctionZ19getDispatchFunctionMFC3dil3ast4Node4NodeZPFC3dil3ast7Visitor7VisitorC3dil3ast4Node4NodeZC3dil3ast4Node4Node (is not much better when demangled: class dil.ast.Node.Node function(class dil.ast.Visitor.Visitor, class dil.ast.Node.Node)* dil.ast.Visitor.Visitor.getDispatchFunction!().getDispatchFunction(class dil.ast.Node.Node) =). The real name of objectInvariant is D9invariant12_d_invariantFC6ObjectZv and has no demagled name that I know of, but I guessed is the Object class invariant. |
| [4] | Here is what I get from GDB: (gdb) bt #0 0x08079a96 in getDispatchFunction () #1 0xb78d5000 in ?? () #2 0xb789d000 in ?? () Backtrace stopped: previous frame inner to this frame (corrupt stack?) (function name unmangled and shortened for readbility) |
Allocations graphs
by Leandro Lucarella on 2009-08-26 21:54 (updated on 2009-08-26 21:54)- with 0 comment(s)
Here are a set of improved statistics graphs, now including allocation statistics. All the data is plotted together and using the same timeline to ease the analysis and comparison.
Again, all graphs (as the graph title says), are taken using the Naive GC (stats code still not public yet :) and you can find the code for it in my D GC benchmark repository.
This time the (big) graphs are in EPS format because I could render them in PNG as big as I wanted and I didn't had the time to fix that =S
The graphs shows the same as in the previous post with the addition of allocation time (how long it took to perform the allocation) and space (how many memory has been requested), which are rendered in the same graph, and an histogram of cell sizes. The histogram differentiates cells with and without the NO_SCAN bit, which might be useful in terms on seeing how bad the effect of false positives could be.
You can easily see how allocation time peeks match allocations that triggered a collection for example, and how bad can it be the effect of false positives, even when almost all the heap (99.99%) has the NO_SCAN bit (see rnd_data_2).
Graphs
by Leandro Lucarella on 2009-08-18 00:26 (updated on 2009-08-18 00:26)- with 0 comment(s)
It's been exactly 3 months since the last post. I spent the last months writing my thesis document (in Spanish), working, and being unproductive because of the lack of inspiration =)
But in the last couple of days I decided to go back to the code, and finish the statistics gathering in the Naive GC (the new code is not published yet because it needs some polishing). Here are some nice graphs from my little D GC benchmark:
The graphs shows the space and time costs for each collection in the programs life. The collection time is divided in the time spent in the malloc() that triggered the collection, the time spent in the collection itself, and the time the world has to be stopped (meaning the time all the threads were paused because of the collection). The space is measured before and after the collection, and the total memory consumed by the program is divided in 4 areas: used space, free space, wasted space (space that the user can't use but it's not used by the collector either) and overhead (space used by the collector itself).
As you can see, the naive collector pretty much sucks, specially for periods of lots of allocation (since it just allocated what it's asked in the gc_malloc() call if the collection failed).
The next step is to modify the Tango's Basic collector to gather the same data and see how things are going with it.
Naive GC fixes
by Leandro Lucarella on 2009-05-17 19:09 (updated on 2009-05-17 19:09)- with 0 comment(s)
I haven't been posting very often lately because I decided to spend some time writing my thesis document (in Spanish), which was way behind my current status, encouraged by my code-wise bad weekend =P.
Alberto Bertogli was kind enough to review my Naive GC implementation and sent me some patches, improving the documentation (amending my tarzanesque English =) and fixing a couple of (nasty) bugs [1] [2].
I'm starting to go back to the code, being that LDC is very close to a new release and things are starting to settle a little, so I hope I can finish the statistics gathering soon.
Naive Garbage Collector
by Leandro Lucarella on 2009-04-26 22:49 (updated on 2009-04-26 22:49)- with 2 comment(s)
I was working in a naive garbage collector implementation for D, as a way to document the process of writing a GC for D.
From the Naive Garbage Collector documentation:
The idea behind this implementation is to document all the bookkeeping and considerations that has to be taken in order to implement a garbage collector for D.
The garbage collector algorithm itself is extremely simple so focus can be held in the specifics of D, and not the algorithm. A completely naive mark and sweep algorithm is used, with a recursive mark phase. The code is extremely inefficient in order to keep the code clean and easy to read and understand.
Performance is, as expected, horrible, horrible, horrible (2 orders of magnitude slower than the basic GC for the simple Tango GC Benchmark) but I think it's pretty good as documentation =)
I have submitted the implementation to Tango in the hope that it gets accepted. A git repository is up too.
If you want to try it out with LDC, you have to put the files into the naive directory in tango/lib/gc and edit the file runtime/CMakeLists.txt and search/replace "basic" for "naive". Then you have to search for the line:
file(GLOB GC_D ${RUNTIME_GC_DIR}/*.d)
And replace it with:
file(GLOB GC_D ${RUNTIME_GC_DIR}/gc/*.d)
Comments and reviews are welcome, and please let me know if you try it =)
Immix mark-region garbage collector
by Leandro Lucarella on 2009-04-25 17:39 (updated on 2009-04-25 17:39)- with 0 comment(s)
Yesterday Fawzi Mohamed pointed me to a Tango forums post (<rant>god! I hate forums</rant> =) where Keith Nazworth announces he wants to start a new GC implementation in his spare time.
He wants to progressively implement the Immix Garbage Collector.
I read the paper and it looks interesting, and it looks like it could use the parallelization I plan to add to the current GC, so maybe our efforts can be coordinated to leave the possibility to integrate both improvements together in the future.
A few words about the paper: the heap organization is pretty similar to the one in the current GC implementation, except Immix proposes that pages should not be divided in fixed-size bins, but do pointer bump variable sized allocations inside a block. Besides that, all other optimizations that I saw in the paper are somehow general and can be applied to the current GC at some point (but some of them maybe don't fit as well). Among these optimizations are: opportunistic moving to avoid fragmentation, parallel marking, thread-local pools/allocator and generations. Almost all of the optimizations can be implemented incrementally, starting with a very basic collector which is not very far from the actual one.
There were some discussion on adding the necessary hooks to the language to allow a reference counting based garbage collector in the newsgroup (don't be fooled by the subject! Is not about disabling the GC =) and weak references implementation. There's a lot of discussion about GC lately in D, which is really exciting!
Understanding the current GC, conclusion
by Leandro Lucarella on 2009-04-11 16:36 (updated on 2009-04-11 16:36)- with 0 comment(s)
Now that I know fairly deeply the implementation details about the current GC, I can compare it to the techniques exposed in the GC Book.
Tri-colour abstraction
Since most literature speaks in terms of the tri-colour abstraction, now it's a good time to translate how this is mapped to the D GC implementation.
As we all remember, each cell (bin) in D has several bits associated to them. Only 3 are interesting in this case:
- mark
- scan
- free (freebits)
So, how we can translate this bits into the tri-colour abstraction?
- Black
Cells that were marked and scanned (there are no pointer to follow) are coloured black. In D this cells has the bits:
mark = 1 scan = 0 free = 0
- Grey
Cells that has been marked, but they have pointers to follow in them are coloured grey. In D this cells has the bits:
mark = 1 scan = 1 free = 0
- White
Cells that has not been visited at all are coloured white (all cells should be colored white before the marking starts). In D this cells has the bits:
mark = 0 scan = X
Or:
free = 1
The scan bit is not important in this case (but in D it should be 0 because scan bits are cleared before the mark phase starts). The free bit is used for the cells in the free list. They are marked before other cells get marked with bits mark=1 and free=1. This way the cells in the free list don't get scanned (mark=1, scan=0) and are not confused with black cells (free=1), so they can be kept in the free list after the mark phase is done. I think this is only necessary because the free list is regenerated.
Improvements
Here is a summary of improvements proposed by the GC Book, how the current GC is implemented in regards to this improvements and what optimization opportunities can be considered.
Mark stack
The simplest version of the marking algorithm is recursive:
mark(cell)
if not cell.marked
cell.marked = true
for child in cell.children
mark(child)
The problem here is, of course, stack overflow for very deep heap graphs (and the space cost).
The book proposes using a marking stack instead, and several ways to handle stack overflow, but all these are only useful for relieving the symptom, they are not a cure.
As a real cure, pointer reversal is proposed. The idea is to use the very same pointers to store the mark stack. This is constant in space, and needs only one pass through the help, so it's a very tempting approach. The bad side is increased complexity and probably worse cache behavior (writes to the heap dirties the entire heap, and this can kill the cache). Nevertheless it's
Current implementation
The D GC implementation does none of this. Instead it completes the mark phase by traversing the heap (well, not really the heap, only the bit sets) in several passes, until no more data to scan can be found (all cells are painted black or white). While the original algorithm only needs one pass through the heap, this one need several. This trades space (and the complexity of stack overflow handling) for time.
Optimization opportunities
This seems like a fair trade-off, but alternatives can be explored.
Bitmap marking
The simplest mark-sweep algorithm suggests to store marking bits in the very own cells. This can be very bad for the cache because a full traversal should be done across the entire heap. As an optimization, a bitmap can be used, because they are much small and much more likely to fit in the cache, marking can be greatly improved using them.
Current implementation
Current implementation uses bitmaps for mark, scan, free and other bits. The bitmap implementation is GCBits and is a general approach.
The bitmap stores a bit for each 16 bytes chunks, no matter what cell size (Bins, or bin size) is used. This means that 4096/16 = 256 bits (32 bytes) are used for each bitmap for every page in the GC heap. Being 5 bitmaps (mark, scan, freebits, finals and noscan), the total spaces per page is 160 bytes. This is a 4% space overhead in bits only.
This wastes some space for larger cells.
Optimization opportunities
The space overhead of bitmaps seems to be fairly small, but each byte counts for the mark phase because of the cache. A heap with 64 MiB uses 2.5 MiB in bitmaps. Modern processors come with about that much cache, and a program using 64 MiB doesn't seems very rare. So we are pushing the limits here if we want our bitmaps to fit in the cache to speed up the marking phase.
I think there is a little room for improvement here. A big object, lets say it's 8 MiB long, uses 640 KiB of memory for bitmaps it doesn't need. I think some specialized bitmaps can be used for large object, for instance, to minimize the bitmaps space overhead.
There are some overlapping bits too. mark=0 and scan=1 can never happen for instance. I think it should be possible to use that combination for freebits, and get rid of an entire bitmap.
Lazy sweep
The sweep phase is done generally right after the mark phase. Since normally the collection is triggered by an allocation, this can be a little disrupting for the thread that made that allocation, that has to absorb all the sweeping itself.
Another alternative is to do the sweeping incrementally, by doing it lazy. Instead of finding all the white cells and linking them to the free list immediately, this is done on each allocation. If there is no free cells in the free list, a little sweeping is done until new space can be found.
This can help minimize pauses for the allocating thread.
Current implementation
The current implementation does an eager sweeping.
Optimization opportunities
The sweeping phase can be made lazy. The only disadvantage I see is (well, besides extra complexity) that could make the heap more likely to be fragmented, because consecutive requests are not necessarily made on the same page (a free() call can add new cells from another page to the free list), making the heap more sparse, (which can be bad for the cache too). But I think this is only possible if free() is called explicitly, and this should be fairly rare in a garbage collected system, so I guess this could worth trying.
Lazy sweeping helps the cache too, because in the sweep phase, you might trigger cache misses when linking to the free list. When sweeping lazily, the cache miss is delayed until it's really necessary (the cache miss will happen anyway when you are allocating the free cell).
Conclusion
Even when the current GC is fairly optimized, there is plenty of room for improvements, even preserving the original global design.
Understanding the current GC, the end
by Leandro Lucarella on 2009-04-11 01:46 (updated on 2009-04-15 01:10)- with 0 comment(s)
In this post I will take a closer look at the Gcx.mark() and Gcx.fullcollect() functions.
This is a simplified version of the mark algorithm:
mark(from, to)
changes = 0
while from < to
pool = findPool(from)
offset = from - pool.baseAddr
page_index = offset / PAGESIZE
bin_size = pool.pagetable[page_index]
bit_index = find_bit_index(bin_size, pool, offset)
if not pool.mark.test(bit_index)
pool.mark.set(bit_index)
if not pool.noscan.test(bit_index)
pool.scan.set(bit_index)
changes = true
from++
anychanges |= changes // anychanges is global
In the original version, there are some optimizations and the find_bit_index() function doesn't exist (it does some bit masking to find the right bit index for the bit set). But everything else is pretty much the same.
So far, is evident that the algorithm don't mark the whole heap in one step, because it doesn't follow pointers. It just marks a consecutive chunk of memory, assuming that pointers can be at any place in that memory, as long as they are aligned (from increments in word-sized steps).
fullcollect() is the one in charge of following pointers, and marking chunks of memory. It does it in an iterative way (that's why mark() informs about anychanges (when new pointer should be followed to mark them, or, speaking in the tri-colour abstraction, when grey cells are found).
fullcollect() is huge, so I'll split it up in smaller pieces for the sake of clarity. Let's see what are the basic blocks (see the second part of this series):
fullcollect()
thread_suspendAll()
clear_mark_bits()
mark_free_list()
rt_scanStaticData(mark)
thread_scanAll(mark, stackTop)
mark_root_set()
mark_heap()
thread_resumeAll()
sweep()
Generaly speaking, all the functions that have some CamelCasing are real functions and the ones that are all_lowercase and made up by me.
Let's see each function.
- thread_suspendAll()
- This is part of the threads runtime (found in src/common/core/thread.d). A simple peak at it shows it uses SIGUSR1 to stop the thread. When the signal is caught it pushes all the registers into the stack to be sure any pointers there are scanned in the future. The threads waits for SIGUSR2 to resume.
- clear_mark_bits()
foreach pool in pooltable pool.mark.zero() pool.scan.zero() pool.freebits.zero()- mark_free_list()
foreach n in B_16 .. B_PAGE foreach node in bucket pool = findPool(node) pool.freebits.set(find_bit_index(pool, node)) pool.mark.set(find_bit_index(pool, node))- rt_scanStaticData(mark)
- This function, as the name suggests, uses the provided mark function callback to scan the program's static data.
- thread_scanAll(mark, stackTop)
- This is another threads runtime function, used to mark the suspended threads stacks. I does some calculation about the stack bottom and top, and calls mark(bottom, top), so at this point we have marked all reachable memory from the stack(s).
- mark_root_set()
mark(roots, roots + nroots) foreach range in ranges mark(range.pbot, range.ptop)- mark_heap()
This is where most of the marking work is done. The code is really ugly, very hard to read (mainly because of bad variable names) but what it does it's relatively simple, here is the simplified algorithm:
// anychanges is global and was set by the mark()ing of the // stacks and root set while anychanges anychanges = 0 foreach pool in pooltable foreach bit_pos in pool.scan if not pool.scan.test(bit_pos) continue pool.scan.clear(bit_pos) // mark as already scanned bin_size = find_bin_for_bit(pool, bit_pos) bin_base_addr = find_base_addr_for_bit(pool, bit_pos) if bin_size < B_PAGE // small object bin_top_addr = bin_base_addr + bin_size else if bin_size in [B_PAGE, B_PAGEPLUS] // big object page_num = (bin_base_addr - pool.baseAddr) / PAGESIZE if bin == B_PAGEPLUS // search for the base page while pool.pagetable[page_num - 1] != B_PAGE page_num-- n_pages = 1 while page_num + n_pages < pool.ncommitted and pool.pagetable[page_num + n_pages] == B_PAGEPLUS n_pages++ bin_top_addr = bin_base_addr + n_pages * PAGESIZE mark(bin_base_addr, bin_top_addr)The original algorithm has some optimizations for proccessing bits in clusters (skips groups of bins without the scan bit) and some kind-of bugs too.
Again, the functions in all_lower_case don't really exist, some pointer arithmetics are done in place for finding those values.
Note that the pools are iterated over and over again until there are no unvisited bins. I guess this is a fair price to pay for not having a mark stack (but I'm not really sure =).
- thread_resumeAll()
- This is, again, part of the threads runtime and resume all the paused threads by signaling a SIGUSR2 to them.
- sweep()
mark_unmarked_free() rebuild_free_list()
- mark_unmarked_free()
This (invented) function looks for unmarked bins and set the freebits bit on them if they are small objects (bin size smaller than B_PAGE) or mark the entire page as free (B_FREE) in case of large objects.
This step is in charge of executing destructors too (through rt_finalize() the runtime function).
- rebuild_free_list()
This (also invented) function first clear the free list (bucket) and then rebuild it using the information collected in the previous step.
As usual, only bins with size smaller than B_PAGE are linked to the free list, except if the pages they belong to have all the bins freed, in which case the page is marked with the special B_FREE bin size. The same goes for big objects freed in the previous step.
I think rebuilding the whole free list is not necessary, the new free bins could be just linked to the existing free list. I guess this step exists to help reducing fragmentation, since the rebuilt free list group bins belonging to the same page together.
Understanding the current GC, part IV
by Leandro Lucarella on 2009-04-10 18:33 (updated on 2009-04-10 18:33)- with 0 comment(s)
What about freeing? Well, is much simpler than allocation =)
GC.free(ptr) is a thread-safe wrapper for GC.freeNoSync(ptr).
GC.freeNoSync(ptr) gets the Pool that ptr belongs to and clear its bits. Then, if ptr points to a small object (bin size smaller than B_PAGE), it simply link that bin to the free list (Gcx.bucket). If ptr is a large object, the number of pages used by the object is calculated then all the pages marked as B_FREE (done by Pool.freePages(start, n_pages)).
Then, there is reallocation, which is a little more twisted than free, but doesn't add much value to the analysis. It does what you think it should (maybe except for a possible bug) using functions already seen in this post or in the previous ones.
Understanding the current GC, part III
by Leandro Lucarella on 2009-04-10 02:28 (updated on 2009-04-10 02:28)- with 0 comment(s)
In the previous post we focused on the Gcx object, the core of the GC in druntime (and Phobos and Tango, they are all are based on the same implementation). In this post we will focus on allocation, which a little more complex than it should be in my opinion.
It was not an easy task to follow how allocation works. A GC.malloc() call spawns into this function calls:
GC.malloc(size, bits)
|
'---> GC.mallocNoSync(size, bits)
|
|---> Gcx.allocPage(bin_size)
| |
| '---> Pool.allocPages(n_pages)
| |
| '---> Pool.extendPages(n_pages)
| |
| '---> os_mem_commit(addr, offset, size)
|
|---> Gcx.fullcollectshell()
|
|---> Gcx.newPool(n_pages)
| |
| '---> Pool.initialize(n_pages)
| |
| |---> os_mem_map(mem_size)
| |
| '---> GCBits.alloc(bits_size)
|
'---> Gcx.bigAlloc(size)
|
|---> Pool.allocPages(n_pages)
| '---> (...)
|
|---> Gcx.fullcollectshell()
|
|---> Gcx.minimize()
| |
| '---> Pool.Dtor()
| |
| |---> os_mem_decommit(addr, offset, size)
| |
| |---> os_mem_map(addr, size)
| |
| '---> GCBits.Dtor()
|
'---> Gcx.newPool(n_pages)
'---> (...)
Doesn't look so simple, ugh?
The map/commit differentiation of Windows doesn't exactly help simplicity. Note that Pool.initialize() maps the memory (reserve the address space) while Pool.allocPages() (through Pool.extendPages()) commit the new memory (ask the OS to actually reserve the virtual memory). I don't know how good is this for Windows (or put in another way, how bad could it be for Windows if all mapped memory gets immediately committed), but it adds a new layer of complexity (that's not even needed in Posix OSs). The whole branch starting at Gcx.allocPage(bin_size) would be gone if this distinction it's not made. Besides this, it worsen Posix OSs performance, because there are some non-trivial lookups to handle this non-existing non-committed pages, even when the os_mem_commit() and os_mem_decommit() functions are NOP and can be optimized out, the lookups are there.
Mental Note
See if getting rid of the commit()/decommit() stuff improves Linux performance.
But well, let's forget about this issue for now and live with it. Here is a summary of what all this functions do.
Note
I recommend to give another read to the (updated) previous posts of this series, specially if you are not familiar with the Pool concept and implementation.
- GC.malloc(size, bits)
- This is just a wrapper for multi-threaded code, it takes the GCLock if necessary and calls GC.mallocNoSync(size, bits).
- GC.mallocNoSync(size, bits)
This function has 2 different algorithms for small objects (less than a page of 4KiB) and another for big objects.
It does some common work for both cases, like logging and adding a sentinel for debugging purposes (if those feature are enabled), finding the bin size (bin_size) that better fits size (and cache the result as an optimization for consecutive calls to malloc with the same size) and setting the bits (NO_SCAN, NO_MOVE, FINALIZE) to the allocated bin.
- Small objects (bin_size < B_PAGE)
- Looks at the free list (Gcx.bucket) trying to find a page with the minimum bin size that's equals or bigger than size. If it can't succeed, it calls Gcx.allocPage(bin_size) to find room in uncommitted pages. If there still no room for the requested amount of memory, it triggers a collection (Gcx.fullcollectshell()). If there is still no luck, Gcx.newPage(1) is called to ask the OS for more memory. Then it calls again Gcx.allocPage(bin_size) (remember the new memory is just mmap'ped but not commit'ed) and if there is no room in the free list still, an out of memory error is issued.
- Big objects (B_PAGE and B_PAGEPLUS)
- It simply calls Gcx.bigAlloc(size) and issue an out of memory error if that call fails to get the requested memory.
- Gcx.allocPage(bin_size)
- This function linearly search the pooltable for a Pool with an allocable page (i.e. a page already mapped by not yet committed). This is done through a call to Pool.allocPages(1). If a page is found, its bin size is set to bin_size via the Pool's pagetable, and all the bins of that page are linked to the free list (Gcx.bucket).
- Pool.allocPages(n_pages)
- Search for n_pages consecutive free pages (B_FREE) in the committed pages (pages in the pagetable with index up to ncommited). If they're not found, Pool.extendPages(n_pages) is called to commit some more mapped pages to fulfill the request.
- Pool.extendPages(n_pages)
- Commit n_pages already mapped pages (calling os_mem_commit()), setting them as free (B_FREE) and updating the ncommited attribute. If there are not that many uncommitted pages, it returns an error.
- Gcx.newPool(n_pages)
- This function adds a new Pool to the pooltable. It first adjusts the n_pages variable using various rules (for example, it duplicates the current allocated memory until 8MiB are allocated and then allocates 8MiB pools always, unless more memory is requested in the first place, of course). Then a new Pool is created with the adjusted n_pages value and it's initialized calling to Pool.initialize(n_pages), the pooltable is resized to fit the new number of pools (npools) and sorted using Pool.opCmp() (which uses the baseAddr to compare). Finally the minAddr and maxAddr attributes are updated.
- Pool.initialize(n_pages)
- Initializes all the Pool attributes, mapping the requested number of pages (n_pages) using os_mem_map(). All the bit sets (mark, scan, freebits, noscan) are allocated (using GCBits.alloc()) to n_pages * PAGESIZE / 16 bits and the pagetable too, setting all bins to B_UNCOMMITTED and ncommitted to 0.
- Gcx.bigAlloc(size)
This is the weirdest function by far. There are very strange things, but I'll try to explain what I understand from it (what I think it's trying to do).
It first make a simple lookup in the pooltable for n_pages consecutive pages in any existing Pool (calling Pool.allocPages(n_pages) as in Gcx.allocPage()). If this fails, it runs a fullcollectshell() (if not disabled) then calls to minimize() (to prevent bloat) and then create a new pool (calling newPool() followed by Pool.allocPages()). If all that fails, it returns an error. If something succeed, the bin size for the first page is set to B_PAGE and the remaining pages are set to B_PAGEPLUS (if any). If there is any unused memory at the end, it's initialized to 0 (to prevent false positives when scanning I guess).
The weird thing about this, is that a lot of lookups into the pooltable are done in certain condition, but I think they are not needed because there are no changes that can make new room.
I don't know if this is legacy code that never got updated and have a lot of useless lookups or if I'm getting something wrong. Help is welcome!
There is not much to say about os_mem_xxx(), Gcx.minimize() and Gcx.fullcollectshell() functions, they were briefly described in the previous posts of this series. Pool.Dtor() just undo what was done in Pool.initialize().
A final word about the free list (Gcx.bucket). It's just a simple linked list. It uses the first size_t bytes of the free bin to point to the next free bin (there's always room for a pointer in a bin because their minimum size is 16 bytes). A simple structure is used to easy this:
struct List {
List *next;
}
Then, the memory cell is casted to this structure to use the next pointer, like this:
p = gcx.bucket[bin] gcx.bucket[bin] = (cast(List*) p).next
I really have my doubts if this is even a little less cryptic than:
p = gcx.bucket[bin] gcx.bucket[bin] = *(cast(void**) p)
But what the hell, this is no really important =)
Understanding the current GC, part II
by Leandro Lucarella on 2009-04-05 21:00 (updated on 2009-04-15 01:10)- with 0 comment(s)
Back to the analysis of the current GC implementation, in this post I will focus on the Gcx object structure and methods.
Gcx attributes
- Root set
- roots (nroots, rootdim)
- An array of root pointers.
- ranges (nranges, rangedim)
- An array of root ranges (a range of memory that should be scanned for root pointers).
- Beginning of the stack (stackBottom)
- A pointer to the stack bottom (assuming it grows up).
- Pool table (pooltable, npools)
- An array of pointers to Pool objects (the heap itself).
- Free list (bucket)
- A free list for each Bins size.
- Internal state
- anychanges
- Set if the marking of a range has actually marked anything (and then using in the full collection.
- inited
- Set if the GC has been initialized.
- Behaviour changing attributes
- noStack
- Don't scan the stack if activated.
- log
- Turn on logging if activated.
- disabled
- Don't run the collector if activated.
- Cache (for optimizations and such)
- p_cache, size_cache
- Querying the size of a heap object is an expensive task. This caches the last query as an optimization.
- minAddr, maxAddr
- All the heap is in this range. It's used as an optimization when looking if a pointer can be pointing into the heap (if the pointer is not in this range it can be safely discarded, but if it's in the range, a full search in the pooltable should be done).
Gcx main methods
- initialize()
- Initialization, set the Gcx object attributes to 0, except for the stackBottom (which is set to the address of a dummy local variable, this works because this function is one of the first functions called by the runtime) and the inited flag, that is set to 1. The log is initialized too.
- Dtor()
- Destruction, free all the memory.
- Root set manipulation
- addRoot(p), removeRoot(p), rootIter(dg)
- Add, remove and iterate over single root pointers.
- addRange(pbot, ptop), remove range(pbot), rangeIter(dg)
- Add, remove and iterate over root pointer ranges. This methods are almost the same as the previous ones, so the code duplication here can be improved here.
- Flags manipulation
Each Bin has some flags associated (as explained before). With this functions the user can manipulate some of them:
- FINALIZE: this pool has destructors to be called (final flag)
- NO_SCAN: this pool should not be scanned for pointers (noscan flag)
- NO_MOVE: this pool shouldn't be moved (not implemented)
- getBits(pool, biti)
- Get which of the flags specified by biti are set for the pool Pool.
- setBits(pool, mask)
- Set the flags specified by mask for the pool Pool.
- clrBits(pool, mask)
- Clear the flags specified by mask for the pool Pool.
- Searching
- findPool(p)
- Find the Pool object that pointer p is in.
- findBase(p)
- Find the base address of block containing pointer p.
- findSize(p)
- Find the size of the block pointed by p.
- getInfo(p)
- Get information on the pointer p. The information is composed of: base (the base address of the block), size (the size of the block) and attr (the flags associated to the block, as shown in Flag manipulation). This information is returned as a structure called the BlkInfo.
- findBin(size)
- Compute Bins (bin size) for an object of size size.
- Heap (pagetable) manipulation
The pooltable is kept sorted always.
- reserve(size)
- Allocate a new Pool of at least size bytes.
- minimize()
- Minimizes physical memory usage by returning free pools to the OS.
- bigAlloc(size)
- Allocate a chunk of memory that is larger than a page.
- newPool(npages)
- Allocate a new Pool with at least npages pages in it.
- allocPage(bin)
- Allocate a page of bin size.
- Collection
- mark(pbot, ptop)
This is the mark phase. It search a range of memory values and mark any pointers into the GC heap. The mark bit is set, and if the noscan bit is unset, the scan bit is activated (indicating that the block should be scanned for pointers, equivalent to coloring the cell grey in the tri-colour abstraction).
The mark phase is not recursive (nor a mark stack is used). Only the passed range is marked, pointers are not followed here.
That's why the anychanges flag is used, if anything has got marked, anychanges is set to true. The marking phase is done iteratively until no more blocks are marked, in which case we can safely assume that we marked all the live blocks.
- fullcollectshell()
- The purpose of the shell is to ensure all the registers get put on the stack so they'll be scanned.
- fullcollect(stackTop)
Collect memory that is not referenced by the program. The algorithm is something like this:
- Stop the world (all other threads)
- Clear all the mark/scan bits in the pools
- Manually mark each free list entry (bucket), so it doesn't get scanned
- mark() the static data
- mark() stacks and registers for each paused thread
- mark() the root set (both roots and ranges)
- mark() the heap iteratively until no more changes are detected (anychanges is false)
- Start the world (all other threads)
- Sweep (free up everything not marked)
- Free complete pages, rebuild free list
Note
This is a very summarized version of the algorithm, what I could understand from a quick look into the code, which is pretty much undocumented. A deeper analysis should be done in a following post.
GC optimization for contiguous pointers to the same page
by Leandro Lucarella on 2009-04-01 20:41 (updated on 2009-04-01 20:41)- with 0 comment(s)
This optimization had a patch, written by Vladimir Panteleev, sitting on Bugzilla (issue #1923) for a little more than an year now. It was already included in both Tango (issue #982) and DMD 2.x but DMD 1.x was missing it.
Fortunately is now included in DMD 1.042, released yesterday.
This optimization is best seen when you do word splitting of a big text (as shown in the post that triggered the patch):
import std.file, std.string;
void main() {
auto txt = cast(string) read("text.txt"); // 6.3 MiB of text
auto words = txt.split();
}
Now in words we have an array of slices (a contiguous area in memory filled with pointers) about the same size of the original text, as explained by Vladimir.
The GC heap is divided in (4KiB) pages, each page contains cells of a fixed type called bins. There are bin sizes of 16 (B_16) to 4096 (B_PAGE), incrementing in steps of power of 2 (32, 64, etc.). See Understanding the current GC for more details.
For large contiguous objects (like txt in this case) multiple pages are needed, and that pages contains only one bin of size B_PAGEPLUS, indicating that this object is distributed among several pages.
Now, back with the words array, we have a range of about 3 millions interior pointers into the txt contiguous memory (stored in about 1600 pages of bins with size B_PAGEPLUS). So each time the GC needs to mark the heap, it has to follow this 3 millions pointers and find out where is the beginning of that block to see its mark-state (if it's marked or not). Finding the beginning of the block is not that slow, but when you multiply it by 3 millions, it could get a little noticeable. Specially when this is done several times as the dynamic array of words grow and the GC collection is triggered several times, so this is kind of exponential.
The optimization consist in remembering the last page visited if the bin size was B_PAGE or B_PAGEPLUS, so if the current pointer being followed points to the last visited (cached) page, we can skip this lookup (and all the marking indeed, as we know we already visited that page).
Understanding the current GC
by Leandro Lucarella on 2009-01-04 18:37 (updated on 2009-04-09 19:53)- with 1 comment(s)
Oh, yeah! A new year, a new air, and the same thesis =)
After a little break, I'm finally starting to analyze the current D (druntime) GC (basic) implementation in depth.
First I want to say I found the code really, but really, hard to read and follow. Things are split in several parts without apparent reason, which make it really hard to understand and it's pretty much undocumented.
I hope I can fully understand it in some time to be able to make a full rewrite of it (in a first pass, conserving the main design).
Overview
I'll start with a big picture overview, and then I'll try to describe each component with more detail.
The implementation in split in several files:
- gcstats.d
- I didn't took a look at this one yet, but I guess it's about stats =).
- gcbits.d
- A custom bitset implementation for collector bit/flags (mark, scan, etc.).
- gcalloc.d
- A wrapper for memory allocation with several versions (malloc, win32, mmap and valloc). 4 functions are provided: map, unmap, commit and decommit. The (de)commit stuff if because (in Sean Kelly's words) Windows has a 2-phase allocation process. You can reserve the address space via map and unmap, but the virtual memory isn't actually created until you call commit. So decommit gets rid of the virtual memory but retains ownership of the address space.
- gcx.d
- The real GC implementation, split in 2 main classes/structs: GC and Gcx. GC seems to be a thin wrapper over Gcx that only provides the allocation logic (alloc/realloc/free) and Gcx seems to be the responsible for the real GC work (and holding the memory).
- gc.d
- This is just a thin wrapper over gcx.d to adapt it to the druntime GC interface.
The Gcx struct is where most magic happens. It holds the GC memory organized in pools. It holds the information about roots, the stack and free list, but in this post I'll focus in the memory pools:
Pool Concept
A pool is a group of pages, each page has a bin size (Bins) and host a fixed number of bins (PAGESIZE / Bins, for example, if Bins == 1024 and PAGESIZE == 4096, the page holds 4 bins).
Each bin has some bits of information:
- mark
- Setted when the Bin is visited by the mark phase.
- scan
- Setted when the Bin is has been visited by the mark phase (the mark bit is set) but it has pointers yet to be scanned.
- free
- Setted when the Bin is free (linked to a free list).
- final
- The object stored in this bin has a destructor that must be called when freed.
- noscan
- This bin should be not scanned by the collector (it has no pointers).
+----------------------------------------+-----+-----------------+ | Page 0 (bin size: Bins) | ... | Page (npages-1) | | | | | | +--------+-----+---------------------+ | | | | | Bin 0 | ... | Bin (PAGESIZE/Bins) | | | | | +--------+-----+---------------------+ | | | | | mark | ... | | | | | | | scan | ... | | | | ... | | | free | ... | ... | | | | | | final | ... | | | | | | | noscan | ... | | | | | | +--------+-----+---------------------+ | | | +----------------------------------------+-----+-----------------+
Pool Implementation
A single chunk of memory is allocated for the whole pool, the baseAddr points to the start of the chunk, the topAddr, to the end. A pagetable holds the bin size (Bins) of each page
. ,-- baseAddr topAddr --,
| ncommitted = i |
| |
|--- committed pages ---,------ uncommitted pages -------|
V | V
+--------+--------+-----+--------+-----+-----------------+
memory | Page 0 | Page 1 | ... | Page i | ... | Page (npages-1) |
+--------+--------+-----+--------+-----+-----------------+
/\ /\ /\ /\ /\ /\
|| || || || || ||
+--------+--------+-----+--------+-----+-----------------+
pagetable | Bins 0 | Bins 1 | ... | Bins i | ... | Bins (npages-1) |
(bin size) +--------+--------+-----+--------+-----+-----------------+
The bin size can be one of:
- B_XXX
- The XXX is a power of 2 from 16 to 4096. The special name B_PAGE is used for the size 4096.
- B_PAGEPLUS
- The whole page is a continuation of a large object (the first page of a large object has size B_PAGE).
- B_FREE
- The page is completely free.
- B_UNCOMMITED
- The page is not committed yet.
- B_MAX
- Not really a value, used for iteration or allocation. Pages can't have this value.
The information bits are stored in a custom bit set (GCBits). npages * PAGESIZE / 16 bits are allocated (since the smallest bin is 16 bytes long) and each bit is addressed using this formula:
bit(pointer) = (pointer - baseAddr) / 16
This means that a bit is reserved each 16 bytes. For large bin sizes, a lot of bits are wasted.
The minimum pool size is 256 pages. With 4096 bytes pages, that is 1 MiB.
The GCBits implementation deserves another post, it's a little complex and I still don't understand why.


















