2010-02-25: by Eugene Reimer; notes made while writing related.ll, became too long to be included in the program, so moved to my website under genealogy. About the literature on the subject: ==================================== Note that Relatedness is also known as the Consanguinity-coefficient, or as the Inbreeding-coefficient when applied to mating individuals, or prospective mates. The definition by Sewall Wright that I'm using: Relatedness of two persons is the expected fraction of their DNA that's identical by descent. For example, parent and child are 0.5, grandparent and grandchild 0.25, half-siblings 0.25, full-siblings 0.5. Each A-to-B path through a common-ancestor consisting of N parent-child links contributes 0.5^N; contributions are summed over all such paths. Confusingly, some authors define relatedness as twice that fraction, for example: http://en.wikipedia.org/wiki/Coefficient_of_relationship, although subsequent prose in that article contradicts that doubling. Some employ a peculiar method of counting which yields path-lengths one more than than the way I count, for example: http://biology.semo.edu/agathman/genetics/inbreeding.htm, and that seems to be why they introduce the notion that doubling is needed. Still others deliberately introduce a spurious halving and then a later doubling, apparently just to make a simple calculation sound difficult. Why does this subject bring out the worst in people? [2010-11-13: while explaining "relatedness" to a friend, I believe I discovered how this all-too-frequent miscounting arises: because we tend to show these paths by showing the "nodes" although what we're counting are the "edges", and the number of "edges" (or "links") is one less than the number of "nodes". Anyone not familar with Graph- theory may have trouble with this distinction. For an introduction to this language of "nodes", "edges", etc, I'll refer the reader to http://en.wikipedia.org/wiki/Graph_theory.] Teschler says: consanguinity-coefficient C(A,B) = 2 * F(A,B), where inbreeding-coefficient F(A,B) = sum(0.5^(n(i)+p(i)) * (1+F(J(i))/2), is summed over all distinct paths connecting A and B, the ith path having n(i)+p(i) links connecting A and B through common ancestor J(i) and having no point in common except J(i). Teschler uses without defining F(x) for an individual, although it's usually defined as C(A,B) for the person's parents A,B. For the common-ancestor we'll rarely know enough about that person's ancestry to get a nonzero answer. In those rare situations where we do, the reporting becomes confusing and seems anomalous rather than useful, since it is a great deal more complex for such a minute difference. Furthermore, although that factor for the Parent-Relatedness of a common-ancestor may seem sensible when computing an Inbreeding-metric, it is not clear that it makes any sense for a Relatedness-metric. Upon thinking it over I'm having a hard time seeing any sense at all in the idea that my father's parents being related makes my sister and me be more closely related than we'd otherwise be -- we each still get half our DNA from him, and some mighty peculiar reasoning is required to think otherwise. Ergo, I'm inclined to use the simpler definitions without that factor. (If the user-interface made options convenient...) By the way, that "inbreeding-factor" will possibly seem less absurd when contemplating how my father and I are related, however thinking about such cases may lead to headaches. Teschler contributes to the confusion over doubling by reporting each path's contribution one way, then doubling the sum at the end. Note that his per-path numbers are consistently one-half of what his prose says they ought to be, so one doubling is needed just to make them be what he defines as the inbreeding-coefficient. His prose calls for another doubling, but that he doesn't do. All that leaves his final answer agreeing with the definitions I'm using (although not with his). The good features of Teschler's program include being a great deal faster than most of the alternatives, as well as more complete, and more accurate. One testcase I'll be using is "me and Elmer (Al) Reimer" which, together with output from all the report-programs I'm mentioning, and additional testcases is found in: http://ereimer.net/genealogy. The Lifelines-report genetics for "degree of blood relatedness" by Eggert has a much fuller explanation than Alexander Ottl's genetics2. Its admitted flaws: not handling nontraditional families (more than one husband and/or wife) although it appears to handle half-siblings, and not checking for adoptions. Its asking the user to state whether or not apparent twins are identical-twins can become annoying since one usually has no way of knowing, and asking for all ancestors instead of only relevant ones is a serious flaw. Furthermore treating same year-of-birth as proof of twins seems dubious, however it appears to apply only to dates-of-birth of same-gender siblings that match, are impossibly close but different, or of the "born Abt 1762" sort. A confounding flaw is showing the answer to the identical-twins question as "are twins" or "not twins". After all the questions, it runs very quickly indeed. Its output uses "Expected degree of genetic overlap". The output strikes me as neither fish nor fowl, not enough for those who want details, yet almost certainly too much for anyone who doesn't. The output appears to be wrong due to missing some paths, although the answer 41/1024 is very close to what I'm expecting. For each common-ancestor it shows the distance expressed as "second half-cousin once removed", "second half-cousin twice removed", "fourth half-cousin once removed", and these are correct (although "half" in "half-cousin" better omitted). However it only found 8 of the 10 ways we're related. The 2 missing ones together contribute 1/1024, and that's exactly the error in the final answer. On another testcase, it only found 20 of the 56 ways Duane Friesen and I are related, for 51/8192 = 102/16384 where 157/16384 was expected, and that's not what I'd call close. [Note: when considering various algorithms (below) I wondered whether discarding paths containing more than one CA (common-ancestor) might be what's wrong with genetics; I tried it, getting answers much closer to those of genetics but not the same, so its flaw can't be that simple. In the me-and-Duane example, discarding such paths leaves 18 paths and relatedness 49/8192, whereas genetics finds those 18, plus two of the ones that go through another CA, these involving I100623 Elizabeth Fast and one of her spouses I101335 Bernard Rempel. I100623 Elizabeth Fast is our CA in 8 ways, only 2 having a path not involving another CA, but a 3rd is found by genetics in an exception to the hypothesis. I101335 Bernard Rempel is my ancestor in 4 ways, Duane's in one, making him 4-times a candidate for common-ancestorhood, but 3 of those are ruled out where his son Peter Bernhard Rempel is the relevant CA; the other one is legitimate but also an exception to the hypothesis that genetics misses paths through more than one CA. The half-siblings may be relevant as to why these exceptions occur, but I haven't studied the code and don't know.] The Lifelines-report genetics2 for "degree of blood relatedness" by Alexander Ottl (ottl@informatik.uni-muenchen.de) sounds as though it computes the same Relatedness-coefficient we are. On my testcase it seemed to run forever and I'd pretty much concluded that it never ends when after 3.25-hours it finally did, reporting 21/512, which agrees with both my definitions and with cons.ll. The output also uses "Expected degree of genetic overlap". I like its answers but not the waiting for them. Aha, a blessedly straightforward explanation free of the usual halving and doubling obfuscation: http://taumoda.com/web/PD/library/kin.html About the algorithm =================== Identical-twins should be treated specially, however since genealogical records don't indicate whether twins are identical-twins, we'll just ignore that possibility. Adoptions: there are Gedcom proposals for Adoption tags, but I'm not aware of anything one can rely on. Those whose records show adoptive-parents as parents will see faulty output in those cases. ==[Research needed; see "adopt" in grand.ll, and in GEDCOM specs.] The program Teschler set himself to write is far from easy. It badly needs procedures within procedures, so that a recursive procedure can have local variables accessible by its sub-procedures, but the lifelines-report-language lacks nested procedures, so he makes do with global variables, and uses them in a way that is not easily understood. This complexity comes about for 2 reasons: wanting to print the paths as well as compute the relatedness, and the recursion from needing the same algorithm for the parents of each common-ancestor. The 2nd of those I've decided to drop. Or to report it but in a footnote-like manner after the basic output. Either of those changes will make the problem considerably simpler and the solution more comprehensible. A 2nd testcase where a common-ancestor has nonzero Parent-Relatedness: http://ereimer.net/relation-EugeneReimer-IrisReimer-LLcons.txt shows how my sister and I are related based on GRANDMA data which is complete enough to show how our father's parents are related. Full siblings always have a relatedness of at least 0.5000; with my (Sewall Wright's as I understand them) definitions my sister and I have relatedness of 0.5288; whereas with Teschler's we have 0.5366. The difference seems significant, albeit barely. Data-structures: First I improved Teschler's by pushing dup(Path), rather than marker + individual elements, to aid comprehensibility, and to solve A-to-B path becoming B-to-A; NOTE: the A<->B swap makes comparisons to old output nearly useless!! It works though, and actually got faster. However I'm tempted to re-design from scratch, before implementing the reordering of paths. Computing a set of Half-Paths, from person to common-ancestor, is appealing: simplifies re-ordering, same algorithm for A-to-anc as for B-to-anc (whereas Teschler needs both an Ascending and a Descending algorithm). NOTE: enumerating "all" paths not as easy as it sounds: stopping each half-path at 1st common-ancestor would miss some paths? YUP, its being a COMMON-ancestor means there's a path to OTHER-person, however one of its ancestors may have another path to OTHER-person that doesn't go thru this common-anc, and such path ought to be included!! Teschler solution: continue, but abort the descent-part when hitting a person already in the path; Using half-paths: compute half-paths to all common-ancestors, but prune them while combining by discarding a combination containing same person within both halves. [Stopping at 1st common-ancestor is close but not exactly the flaw in genetics.ll.] Ordering: (A) for closest-first order, sort by: total-length, A-to-anc-len, A-to-anc-string, B-to-anc-string (String: F=father, M=mother; or bitstring: 0=father, 1=mother) (Note: the F/M-strings will ensure comanc spouses are adjacent) (B) to group by A-to-anc paths, sort by: A-to-anc-len, A-to-anc-string, B-to-anc-len, B-to-anc-string; (C) to group by common-ancestor, with ancestors ordered by min-path-len-thru-this-anc, sort by: min-len-this-anc, anc-key, ...(as in A/B) (C') variant of C (no guarantee of closest-first tho almost certain): min-A-to-anc-len, min-B-to-anc-len, ... (D) closest-first, grouped by A-to-anc paths (feasible w Teschler paths): total-length, A-to-B-string; --tried this, using person-keys to approximate A-to-B-string, looks to be doing what I want for common-ancestor spouses; lacks gender-bias: smaller KEY comes first, and that's fine by me; same-length paths grouped not by comanc, but by A's-parent, etc, seems fine; SORT examples: anniver.ll: uses rsort with 2nd list, of dates; by Stephen Dum; cron.ll: uses rsort with 2nd list, of dates; by Stephen Dum; grand.ll: uses sort with 2nd list, to sort by date; marriages.ll: uses sort with 2nd list, to sort by date or place; --background on sort/rsort being added to LL-language, in anniver.ll; --found no evidence that comparing lists works with builtin sort?? --all examples supply 2nd-list of integer or string; ahnenliste, gender_order, names_freq, stats: all by Jim Eggert, have a quicksort with 2000-01-15 fix, where user supplies compare function; (ll2html4, namefreq: by JRE, Chandler, also have quicksort); VERSIONS: consER.ll: the Teschler program with minor fixes and minor changes to the output by me. related-withHalfPaths.ll: has unfinished code to compute half-paths -- see "athlist"; abandoned it and went back to improving the Teschler-like version, upon realizing: that converting an A-to-B path into half-paths is perfectly feasible; the much nicer solution I'd been envisioning was not going to happen because Tables in LL lack the features it needed (most of the programs I've written using "Tables" couldn't be done in the LL-language because it provides no way to step through the entries, nor has it a Table-to-List conversion which is one way to provide that stepping-through capability); related.ll: is cleaned-up, simplified, still has Teschler-like data-structures (full A-to-B paths), but has my defn of Relatedness, has halving eliminated, has re-ordered paths by sorting on: length||keys-of-persons-in-path; has pathlength expressed as Nth-cousin-Mx-removed; has 2-column BK-like display of paths; (for "BK" output converted into Half-Paths => could use in sorting) fixed anomaly when B is the common-ancestor (NOT in withHalfPaths); --may want to show Date-of-Birth for each person? --am "publishing" this version;