Database overview

Gw files

Genealogy database could be created by GeneWeb from one or from multiple source files with .gw extension. Those files describe structurally persons, families all kinds of relationships, different events, etc. You can read more about the file structure here. Binary executable gwc reads files .gw, extracts all persons and families information and passes it to the Gwdb module in order to create the database.

Database entries

Transmitted to Gwdb information is composed mainly from:

  • Array of all strings that could be any kind of information encoded as a string, like for example: person’s name, birth place, marriage place, etc. Identifier istr allows to reference the string in the given array (index of an element inside the array).

  • Array of persons where each element encompasses information about one person. Every string field of a person (like his name, birthplace, etc.) is an identifier where the real string is stored in array mentioned before. Reference to other persons by means of identifier iper that reference person in the current array (index of an element inside the persons array).

  • Array of families where each element encompasses information about one family (couple, children, marriage date, etc.). Identifier ifam allows to reference the family in the given array (index of element inside the array).

Each array keeps a data structure defined in the module Def. Further, those entries will be the main source for every database request.

Storage

Gwdb is responsible for creating the database on the disk from the provided inputs. It creates a directory dbname.gwb containing several files. The main file base contains marshalled representation of each array and base.acc stores offsets to every entry entry that allows to make constant time access. Additionally, it creates some index files that associate useful for requests information to the entry’s identifier in the base file. That helps to requests to find instantly entry without iteration over all existing ones in the database. For example strings.inx is a string index that allows to find id for a searched string. One file is slightly different: the patches file. It stores every modification done inside the base (see Modifications subsection). The storage manipulation interface is described in lib/gwdb_driver.mli/gwdb_driver.mli. This is a virtual module whose current implementation is available on gwdb-legacy. Format and description for every database file is listed below:

base - the base itself
  magic number (magic_gwb)                 : string of length 8
  number of persons                        : binary_int
  number of families                       : binary_int
  number of strings                        : binary_int
  persons array offset in file             : binary_int
  ascends array offset in file             : binary_int
  unions array offset in file              : binary_int
  families array offset in file            : binary_int
  couples array offset in file             : binary_int
  descends array offset in file            : binary_int
  strings array offset in file             : binary_int
  notes origin file                        : value
  persons array                            : value
  ascends array                            : value
  unions array                             : value
  families array                           : value
  couples array                            : value
  descends array                           : value
  strings array                            : value

base.acc - direct accesses to arrays inside base
  persons offsets   : array of binary_ints
  ascends offsets   : array of binary_ints
  unions offsets    : array of binary_ints
  families offsets  : array of binary_ints
  couples offsets   : array of binary_ints
  descends offsets  : array of binary_ints
  strings offsets   : array of binary_ints

names.inx - index for names, strings of first names and surnames
  offset to sindex : binary_int
  offset to findex : binary_int
  1st index (mixes between names) : value 
    array, length = 16383, associating:
      - a hash value of a "crushed" (module "Name") name 
        (modulo length)
      - to the array of ids of the corresponding persons
  2nd index (surnames sub-strings) : value
    array, length = "table_size", associating:
      - a hash value of the "crushed" (module "Name") surname 
        sub-string (modulo length)
      - to the array of the corresponding surnnames (string ids) 
      that contain giving surname sub-string
  3rd index (first name sub-strings) : value 
    array, length = 16383, associating:
      - a hash value of the "crushed" (module "Name") first name 
      sub-string (modulo length)
      - to the array of the corresponding string ids that contains 
      giving first name sub-string

names.acc - direct accesses to values inside arrays in names.inx

strings.inx - index for all strings
  length of the strings offset array : binary_int
  strings hash table index           : 2 arrays of binary_ints
    strings offset array (length = prime after 10 * strings 
    array length)
      - associating a hash value of the string modulo length
      - to its id in the string array
    strings list array (length = string array length)
      - associating a string id
      - to the id of the next index (previous value) holding the 
      same hash value

snames.inx - index for surnames
  array ordered by surname  
    - associating the string id of a surname
    - to a pointer (offset) inside snames.dat

snames.dat - data associated with snames.inx
  array of list of persons holding a surname

fnames.inx - index for first names
  array ordered by first name 
    - associating the string id of a first name
    - to a pointer (offset) inside fnames.dat

fnames.dat - data associated with fnames.inx
  array of list of persons holding a first name

notes - text file containing data base notes.

notes_d - directory containing .txt for each extended page

particles.txt - text file with autorised name's particles

patches - modification inside the database
  When updated, none of the previous files are modified. 
  Only this one is written and rewritten. It holds a record 
  of type "patches", composed of association lists 
  "index" - "new value".

nb_persons - number of real persons (with those added by patches)

synchro_patches - timestamped history of base's modifications. 

restrict - defines visibility of each person in the base 

Modifications

When a modification is requested, geneweb does not update base file itself. It completes the patches file containing all the latest modifications on the base. Every modification (patch) done is pended until patches are committed with commit_patches request. Commit performs update of the patches file.

Patching signifies only operations that add or modify an entry. Entry suppression is done quite differently. It is replaced by a dummy entry and then removed by GeneWeb’s garbage collector gwgc that performs compaction of database arrays. Another useful fixbase tool, locates and fixes inconsistencies on the base and updates all database files.

Example

Here is an example how GeneWeb displays birth dates of persons that have given name (let’s say “Pierre”) without considering caches:

  • Firstly, it makes dichotomous search inside fnames.inx of a string id (istr) that references “Pierre”

  • Then it reads (with associated to “Pierre” offset from fnames.inx) position in the file fnames.data where list of ids of persons (iper) with first name “Pierre” are stored.

  • For every person’s id it gets person’s entry offset from base.acc file

  • Then it reads person’s entry with giving offset and get field associated to the birth date.

  • Displays all extracted birth dates.