Wednesday, April 23, 2014

GSoC Progress Report #1: Complete Repository Garbage Collection

In my first week I worked on completing the garbage collection for repositories.

Darcs stores all the information needed under _darcs directory. In this part of the project we are only interested in the files stored in three directories:
  • _darcs/patches/: stores the patches.
  • _darcs/pristine.hashed/: stores the last saved state of working copy.
  •  _darcs/inventories/: stores the inventories (lists of patches).
While working on a project under version control, these directories grow in size.
Every time we record a new patch:
  • A new inventory file is stored in _darcs/inventories/ containing the augmented list of patches. Now, the old inventory file (without the new patch) is no longer needed (this is true in most cases).
  • A new patch file is stored in darcs/patches/. If we later unrecord this patch, the patch file is no longer needed.
  • The same happens with _darcs/pristine.hashed/. 

So, why do we keep these files if we no longer need them? Well, that’s because darcs wants to be fast and does not delete these files over time. Also it’s because if the repository is public and someone is cloning it, you don’t want to have some files disappearing in the process. 

Darcs, using "darcs optimize" command, only knows how to clean up the _darcs/pristine.hashed directory. Until now, the only way to clean the other two directories was doing a "darcs get". With the changes introduced, now "darcs optimize" also clean these directories.

Algorithms:

The implemented algorithm was pretty straightforward, in pseudo-code:

- inventory = _darcs/hashed_inventory
- while (inventory) 
    - useful_inventories += inventory
    - inventory = next_inventory(inventory) 
- remove files not in useful_inventories.

- inventory = _darcs/hashed_inventory
- while (inventory) 
    - useful_patches += get_patches(inventory)    
    - inventory = next_inventory(inventory)
- remove files not in useful_patches.

We can see that we travel the inventory list twice, one for inventories and one for the patches. Although this is not optimal, I think it is more modular, since now we have a function that gets the list of patches.


Commands affected:

- darcs optimize

Use cases:

It is useful when you need to free memory on your hard disk.  
For example:
- Record a new patch.
- Unrecord the new patch.
- Run optimize for garbage collecting the unused files corresponding to the unrecorded patch. Details in: http://pastebin.com/vYHiYV0F
You can find more use cases in the regression test script:
http://hub.darcs.net/darcs/darcs-screened/browse/tests/issue1987.sh.

Issues solved:

http://bugs.darcs.net/issue1987.

Patches created:

http://bugs.darcs.net/patch1134.