SQL and GEDCOM – VOL 1

Introduction

Two of my hobbies – SQL and genealogy brought me to create SQL database with genealogical data, analyze it, make some statistics and solve graph exercises on it.

Main storage for my genealogical data (and not only for me) is Geni.com and from Geni I can export data to GEDCOM. Some of my friends asked me to make local copy of the info stored in Geni – so this export carries also practical values.

GEDCOM export is a bit limited – 50K person records on time (but in Geni there are close to 100M person records). So, if you want to download a big “family tree”, you need to make several exports. Fortunately same person in different exports is always identified with same id – this allows to merge those exports later.

Today I can play with quite big (in local mean) databases – approx 2M persons. This is a result of close to 200 exports. I can test both – performance things and graph algorithms.

During this I have solved several SQL-oriented problems and built one nice Excel-tool for searching different paths between persons. I can’t share data but I’m happy to share tools I’ve created. So, anyone interested in can try and play with their own exports.

Current article is 1st from a series – preparation. In next article I plan to describe how to extract info from prepared data. The third will demonstrate some path-finders and after that I plan to play a bit with Excel and genealogical data. The last one will try to generate a new GEDCOM with different scope from combined GEDCOM-s – targeted to for other software like FamilyTreeMaker or similar.

0. GEDCOM structure (short review)

GEDCOM is a typical and standardized way to store and exchange genealogical (and not only) information between different applications.

0 @I1448552@ INDI
1 NAME Henn /Sarv/
2 GIVN Henn
2 SURN Sarv
2 NICK सींग
1 SEX M
1 BAPL
2 DATE 20 APR 1929
1 ENDL
2 DATE 21 NOV 1929
1 BIRT
2 DATE 7 MAR 1955
2 ADDR
3 CITY Tallinn
3 CTRY Estonia
1 GRAD
2 DATE 1961
…
1 FAMC @F6000000001528333119@
1 FAMS @F6000000001656449693@
2 NOTE {geni:marriage_order} 1
1 FAMS @F6000000001646452992@
2 NOTE {geni:marriage_order} 2

GEDCOM file is a text file containing rows with objects, attributes, sub-attributes and references between them. Objects (INDI – person, FAM – family jne), object attributes (NAME, BIRT), components of attributes (components of NAME: GIVN, SURN; components of BIRT: DATE and ADDR), components of components (components of ADDR: CITY, CTRY) etc. Each row begins with number indicating the level (0 – object, 1 – attribute, 2 – sub-attribute etc). That means – bigger level belongs to lower – hierarchical structure.

The sample GEDCOM fragment that was brought above tells that PERSON with ID=“I1448552” has NAME=“Henn Sarv” (first name GIVN “Henn”, family name SURN “Sarv”, nickname NICK “सींग”) and he was born on (BIRT) date (DATE “7 MAR 1955”), in place (ADDR CITY “Tallinn” CTRY “Estonia”).

Object might refer to other objects. Henn’s “object set” has reference to 3 families (FAM) – one where Henn belongs as a child (FAMC with ID=“F6000000001528333119”) and two where Henn belongs as a parent (FAMS).

Families are later in GEDCOM as objects (row-sets) like:

0 @F6000000001528333119@ FAM
1 MARR
2 DATE 28 AUG 1936
1 HUSB @I6000000001528333113@
1 WIFE @I6000000001528640081@
1 CHIL @I6000000001592998311@
1 CHIL @I6000000001604210254@
1 CHIL @I6000000001604632474@
1 CHIL @I6000000001528309143@
1 CHIL @I6000000001617582391@
1 CHIL @I6000000012433193658@
1 CHIL @I1448552@

This example describes a family with ID=“F6000000001528333119”. The parents were married (MARR) on date (DATE “28 AUG 1936”), the place of the marriage is missing (not known). The father of the family (HUSB) is the person with ID=“I6000000001528333113”, mother (WIFE) is the person with ID=“I6000000001528640081” and there are 7 kids in the family (CHIL).

1. First phase – reading GEDCOM rows into database

  • problems – UTF-8 / Unicode
  • line numbering

BULK INSERT command is suitable to read a GEDCOM file into database, but there we have 3 problems:

1st – filename parameter can’t be a variable nor an expression – so I have to use built SQL (exec (@x));

2nd – SQL server doesn’t accept (interpret correctly) UTF-8, the format I get from Geni export, so I have to convert from UTF-8 to UNICODE. For that I can use POWERSHELL:

Get-Content -Encoding UTF8 -path $x | Set-Content -Encoding Unicode –path $x

3rd is numbering of the lines. In relational databases all sets (tables) are un-ordered, even when table have clustered index because reading the table might happen in parallel. So, during the load I have to assign unique and rising number to each row – order of rows is extremely important in GEDCOM. BULK INSERT allows to load only info I get from file, so INSERT FROM OPENROWSET doesn’t work here.

I solved this in the following way:

  • created table with IDENTITY column for storing table rows
  • created view containg only line-column from that table
  • BULK INSERT target will be this view

Result is that rows from the file will be stored in the table according to their ordinal numbers and they can be uniquely referenced using this number.

For logging and history I created one more table to store loading history.

1.1 Creating necessary objects

  • Table gedcom_data::
CREATE TABLE dbo.gedcom_data
(
	nr bigint IDENTITY(1,1) PRIMARY KEY,
	line nvarchar(1024) NULL,
)
  • View gedcom_file:
CREATE VIEW dbo.gedcom_file
AS
SELECT line FROM dbo.gedcom_data
  • Table load_history:
CREATE TABLE dbo.load_history
(
jrknr int IDENTITY(1,1) PRIMARY KEY,  -- jrknr means [loading] order number
first_nr bigint NOT NULL,
ridu bigint NOT NULL,  -- ridu means count of lines
filename nvarchar(258) NULL,
load_date datetime NULL
)
  • Procedure Loe_Gedcom:
CREATE PROCEDURE [dbo].[Loe_Gedcom] (@file nvarchar(256))
as
select h.load_date, h.filename from load_history h where h.filename = @file
IF @@ROWCOUNT = 0
BEGIN
	declare @x nvarchar(max) = 
'
DECLARE @i int = (select coalesce(max(nr), 0) from dbo.gedcom_data) + 1
IF @i = 1 set @i = (select coalesce(max(nr), 0) from dbo.gedcom_data_arhiiv) + 1
begin tran;
bulk insert dbo.gedcom_file
from ''' + @file + '''
with (
rowterminator = ''\r''
, datafiletype = ''widechar''
);
insert dbo.load_history (first_nr, ridu, filename) values (@i, @@rowcount, ''' + @file + ''');
commit;
'
execute(@x)
end

else
print 'allready loaded: ' + @file

1.2 Execution:

exec Loe_Gedcom 'c:\kataloog\failinimi.ged'

Once per each file exported from Geni. Sometimes I create those exec-s manually, sometimes using POWERSHELL.

1.3 Correction:

update gedcom_data set line = ltrim(replace(line, char(10), ''))

2 Initial extraction

Next I need to make some row-level modifications. In case of 1-2 GEDCOM files this isn’t a challenge but when I load together several tens or hundreds of files I need to implement batch-wise processing to keep transaction log and transaction duration under control.

Exported (and now loaded) GEDCOM files contain a lot of meaningless (for my goal) rows (approx 50%), it’s reasonable to filter out the those rows without deleting them, because later I might still need to use them to create a summary GEDCOM file from the merged ones.

For this reason I’ve created a new table containing only required rows and with new columns with parsed dates and id-s.

At the beginning I created a view to show me only the rows I need and then I created a table to store those rows. With small number of rows this technique worked but when I needed to implement batch-wise process (because of the large number of rows) I needed to convert this view to procedure.

2.1 Create objects:

  • View parsed_view:
CREATE VIEW [dbo].[parsed_view]
as
with t1 as
(
	select
	nr
	, TRY_PARSE(left(line,2) as int) lvl
	, line
	, CHARINDEX('@', line) [start] -- 1st occurence of @
	from dbo.gedcom_data
),
t2 as
(
	select *
	, case when [start] > 0 then CHARINDEX('@', line, [start]+1) end [stop]
	from t1
	where lvl is not null
),
t3 as
(
	select *
	, case when [start] > 0 and [stop] > 0 then SUBSTRING(line, [start], [stop]-[start]+1) end kood
	, case when [start] = 3 then SUBSTRING(line, [stop] + 2, 4) else SUBSTRING(line, 3, 4) end class
	, case when [start] <> 3 then ltrim(substring(line, 7, 100)) end data
	from t2
)
select nr, lvl,
case when class in ('indi', 'fam') then kood end kood,   -- important to read codes only from right rows
class, data
, case when class = 'DATE' then try_parse (data as date using 'en-us') end as datevalue
, case when class = 'DATE' then data end as datetext
, case when class = 'NAME' then REPLACE(data, '/', '') end as name
from t3
where class not in -- unneccessary 'classes'
(
'head','titl', 'sour', 'vers', 'subm',
'gedc', 'vers', 'form', 'char', 'nick',
'cont', 'note', 'addr', 'rfn', 'time',
'obje', 'occu', 'phon', 'plac', 'post',
'chan', 'form', 'file', 'city'
)

This view was created step-by-step using CTE (common table expression). At first I needed to find code in rows between „@“ and „@“.

However, MS SQL implementation (T-SQL) doesn’t allow to reuse projected aliases in projection, so I needed to create 3 expressions: t1 find first occurence of „@“ (if exists), t2 find second occurence of „@“(if exists) and t3 extract the required code with SUBSTR.

Level numbering at the beginning of the row goes into [lvl], class name (type of row) into [class]. NB! Expression t3 from t2 needs to differentiate – in rows of level 0, the class comes from the end of the row, but in the other rows it comes from the beginning.

And finally 2 simple conversions:

  1. with class NAME, filling column [name] (with rempved ‘/’)
  2. with class DATE, trying fill [datevalue] and filling [datetext]
  3. and last – excluding rowsd (classes) unused in system
  • Table parsed_data:
CREATE TABLE dbo.parsed_data(
nr bigint constraint parsed_data_x0 primary key,
lvl int NULL,
kood char(24) NULL,
class char(4) NULL,
data nvarchar(1000) NULL,
datevalue date NULL,
datetext nvarchar(40) NULL,
name nvarchar(128) NULL,
)

CREATE NONCLUSTERED INDEX x1 ON dbo.parsed_data (kood)
INCLUDE (class, data, datevalue, datetext, name)

CREATE NONCLUSTERED INDEX x2 ON dbo.parsed_data (class)
INCLUDE (nr, datevalue, datetext)

CREATE NONCLUSTERED INDEX x3 ON dbo.parsed_data (lvl, kood)
INCLUDE (nr)

CREATE NONCLUSTERED INDEX x4 ON dbo.parsed_data (class)
INCLUDE (nr, kood, data)

This table is to store (materialize) previous view results. Indexes are added during the work by execution plan recommendations (missing index detail).

  • Protseduur parsed_load

Initially I used INSERT-SELECT to store data but batch-wise processing requires some modifications. I can find row for starting next batch but where on view level wasn’t visible for query optimizer on t1-expression level so where nr > @start was applyed on last expression level. To increase performance I created procedure doing exactly same as INSERT and SELECT.

Initial version of the batch-processing:

declare @start bigint =
(select top 1 nr from parsed_data order by nr desc)
-- most fast way to get 1st number to process

declare @batch int = 1000000

insert top (@batch) parsed_data
select * from parsed_view
where nr > @start

The procedure converted from the initial view:

CREATE PROCEDURE dbo.Parsed_Load
@batch int = 1000000
as
declare @start bigint = (select coalesce( (select top 1 nr from parsed_data order by nr desc), 0))
;
with t1 as
(
	select nr
	, TRY_PARSE(left(line,2) as int) lvl
	, line
	, CHARINDEX('@', line) algus
	from dbo.gedcom_data
	where nr > @start
	-- main difference - where is transfered to expression t1
),
t2 as
(
	select *
	, case when algus > 0 then CHARINDEX('@', line, algus+1) end lõpp
	from t1
	where lvl is not null
),
t3 as
(
	select *
	, case when algus > 0 and lõpp > 0 then SUBSTRING(line, algus, lõpp-algus+1) end kood
	, case when algus = 3 then SUBSTRING(line, lõpp + 2, 4) else SUBSTRING(line, 3, 4) end class
	, case when algus <> 3 then ltrim(substring(line, 7, 100)) end data
	from t2
),
t4 as
(
	select nr, lvl,
	case when class in ('indi', 'fam') then kood end kood,   
	-- oluline, et koodid tulevad vaid õigetest ridadest
	class, data
	, case when class = 'DATE' then try_parse (data as date using 'en-us') end as datevalue
	, case when class = 'DATE' then data end as datetext
	, case when class = 'NAME' then REPLACE(data, '/', '') end as name
	from t3
	where class not in
	(
	'head','titl', 'sour', 'vers', 'subm',
	'gedc', 'vers', 'form', 'char', 'nick',
	'cont', 'note', 'addr', 'rfn', 'time',
	'obje', 'occu', 'phon', 'plac', 'post',
	'chan', 'form', 'file', 'city'
	)
)
insert top (@batch) parsed_data
select * from t4

2.2 Execution batch-wise

The last procedure could be executed in the following loop:

select 1 -- to guarantee that @@rowcount > 0
while @@ROWCOUNT > 0
begin
	checkpoint -- to minimize transaction log
	exec Parsed_Load 1000000 --  batchsize
end

Other batch-wise loops I execute in the same way.

The origin rownumbers (column [nr]) are good for referencing last executed row, start a new batch and later reference to “source”.

2.3 Once loaded then archive

  • Gedcom_Data_arhiiv
  • batch-wise delete
  • Truncate table delete

Once the rows are loaded, it is good to archive them. By intent I didn’t use partitioned table – probably better way than I used.

The most simple way to archive loaded rows (at beginning I used this way):

set xact_abort on
begin tran

insert dbo.gedcom_data_arhiiv select * from dbo.gedcom_data
delete dbo.gedcom_data
commit

Again, when I load exported GEDCOM files one at a time, this solution works. I tried to reload all my 200 GEDCOM files and the solution failed – transaction log got full.

I made a loop:

declare @batch int
declare @start bigint
select @batch = 1000000
while @@rowcount > 0
begin

checkpoint
select @start = coalesce((select top 1 nr from gedcom_data_arhiiv order by nr desc), 0)

insert top (@batch) gedcom_data_arhiiv
select * from gedcom_data where nr > @start

end

After that I need to delete the archived rows. Deleting rows (by one or batch-wise) isn’t the shortest way, but it’s dangerous to truncate, because I need to keep identity value and truncate destroys this.

The solution is to simply save the last identity and restore after truncate:

declare @nr bigint
= (select top 1 nr from gedcom_data order by nr desc)

truncate table gedcom_data

dbcc checkident (gedcom_data, reseed, @nr)

3. Transfer data between rows

  • Date transfer to event rows
  • Code transfer between master (object) and detail (rest) rows

The following step makes the rows “independent”. Current rows depend from the info in other rows. Some (simplest and most important) techniques could be combined to simplify following queries.

  1. 1st is transferring date info. Each DATE-class row (level 3) is immediately followed by event rows (BIRT, DEATH, MARR etc – level 2). It is reasonable to transfer DATE info into matching event row.
  2. 2nd is transferring codes. Person and family rows (level 0) contain the codes of person and family. It would be best for the following queries to have person and family codes in each row they belong to. All rows (level >0) belong to the last (level 0) object row.

The slowest way is to use cursors. It’s tested. One GEDCOM file under cursor might run 15min or longer. With 200 GEDCOM files we get an unacceptable time slice.

3.1 Procedure Date_update 2 versions

The initial procedure simply self-joined the table using neighbor rows. Still, rising number of rows decreases performance too much.

CREATE PROCEDURE [dbo].[Date_Update]
as
update d
set datevalue = coalesce(d.datevalue,e.datevalue), datetext = coalesce(d.datetext,e.datetext)
--select *
from dbo.parsed_data d
join dbo.parsed_data e on d.nr+1 = e.nr
where d.class in ('birt', 'deat', 'marr', 'div') and e.class = 'date'

I guessed that I can prevent sort because table (and self) are joined by same [nr], but during testing I didn’t get the best execution plan.

I had to create another version – the new one works so fine that there is no need for batch-wise:

CREATE PROCEDURE [dbo].[Date_Update_2]
as
with e as
(
	select nr, datevalue, datetext
	, lead(datevalue) over (order by nr) n_datevalue
	, lead(datetext) over (order by nr) n_datetext
	, lead(class) over (order by nr) n_class
	from parsed_data d
)
update parsed_data

set datevalue = coalesce(d.datevalue,n_datevalue), datetext = coalesce(d.datetext,n_datetext)

from parsed_data d
join e on e.nr = d.nr
where n_class = 'date'
and d.class in ('birt', 'deat', 'marr', 'div')

3.2 Procedure Kood-Flow – flowing codes from master rows to others

CREATE PROCEDURE [dbo].[Kood_Flow]  
@batch int = 1000000
as
with koodid as
(
	select nr, kood
	from dbo.parsed_data where lvl = 0 and kood is not null
)
, vahemik as
(
	select
	nr, kood,
	(lead(nr) over (order by nr))-1 nnr
	from koodid
)
update top (@batch) v
set kood = i.kood
from dbo.parsed_data v
join vahemik i on v.nr between i.nr and i.nnr
where v.kood is null

This procedure virtually makes 2 tables – one contains of the master rows (level 0) containing codes to transfer and the other contains of the detail rows (level >0) belonging to master rows and waiting for the codes.

Once master table is formed (expression ‘koodid’) using lead we can find row of the next master row and all details with [nr] between those 2 masters get code from the previous master.

As we update only code-less rows the procedure can be executed batch-wise.

select 1
while @@ROWCOUNT > 0
exec Kood_Flow 1000000

Next time …

… I will explain how to easily extract person, family and relations.

This entry was posted in Geni, MS SQL Server, Sugupuud and tagged , . Bookmark the permalink.

Lisa kommentaar

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Muuda )

Twitter picture

You are commenting using your Twitter account. Log Out / Muuda )

Facebook photo

You are commenting using your Facebook account. Log Out / Muuda )

Google+ photo

You are commenting using your Google+ account. Log Out / Muuda )

Connecting to %s