For some of the posts on this blog I’ll be using one way to measure the similarity of two sample sets of data. The statistic is called the Jaccard Index, or the Jaccard Similarity Coefficient. This post is a technical explanation of the calculation itself.
The sets of data are the unique ancestral surnames of my DNA matches. The question I’m asking for any two of my matches is: how similar are their lists of direct ancestral surnames?
If two lists of unique surnames are identical, they will have the exact same surnames. They will also have the same number of surnames in their lists, as each surname is only represented once regardless of how many times it appears in the direct tree. They will be 100% similar.
However, I’m also interested in trees that are “nearly” the same. Suppose two siblings create separate trees, and both get as far as all their great great grandparents. Tom’s research leads him to one pair of 3rd greats, and Joe finds a different pair. Neither are aware yet of the other’s research, but both have one extra maiden name each in their trees. Those lists will be very similar, and I’d like to highlight their similarity in some way.
So I need a way of defining the “similarity” of two lists of surnames. The Jaccard Similarity Index compares two sets (or lists) to see which members (surnames) are shared and which are different. It calculates the percentage of similarity from 0 to 100%. The math is pretty simple, and is described here in understandable terms.
In the simplest terms, we count the intersection of the lists i.e. the number of surnames common to both trees. We count the differences for each side, and we count the total number of surnames in all. The Jaccard index expresses this mathematically as:
J(X,Y) = |X∩Y| / |X∪Y| or (|X∩Y| / |X| + |Y| – |X∩Y|
Taking our two brothers, Tom and Joe:
|X∩Y is the number of shared surnames: 8 for the brothers.
|X| is the length of the set, or the number of surnames for Tom’s tree: 9.
|Y| is the length of the set, or the number of surnames for Joe’s tree: also 9.
So our equation is: 8 / (9 + 9 – 8) * 100 = 80% similarity for our brothers.
If brother’s had exactly the same trees, they’d be 100% similar. If the postman’s tree had no overlapping surnames with the brothers, his index compared to both would be 0%.
So the ultimate task is to compare every surname list within my matches with every every other surname list. As the Jaccard index only works on two sets at a time, to calculate the similarity across N sets requires N squared calculations.
This becomes unfeasible for large numbers of sets, and there are other methods that can be brought into play to reduce processing time. I had about 4.4 million pairs of sets to compare, which took a matter of hours to complete.
Note that for my current purposes, I am using unique surnames. If one match has entered father, grandfather and great-grandfather John Smith, his list has Smith represented once. This is to simplify data collection and computation.
Note also that For my current purposes, the direction of surnames is unimportant. Match #1 may have a two-person tree with Mary Smith as the mother of Bob Jones, while Match #2 has Anne Jones as the mother of Bob Smith. That is “Smith->Jones” and “Jones->Smith”. If I include direction, these lists are different. I am treating the lists as a “bag of words”, where direction is not important – so these two lists “Jones, Smith”, and “Smith, Jones” are the same. This is to simplify data collection and computation.
Two caveats must be considered with the Jaccard Index. One is that it can be erroneous for small sample sizes, so I intend to exclude small trees.
The other problem for the index is when there are missing observations in the data sets. It’s safe to say that most of my lists have missing observations, as I’m not drawing from a sample of relatives with perfect trees to four generations. The trees tend to be ragged i.e. people know more about one branch than another.
2 thoughts on “Measuring Genealogical Similarity using the Jaccard Index”
This is interesting – I’m looking forward to seeing what you do with it. How are you planning to handle surname spelling variations?
A great question that I am grappling with, and am undecided. What to do with O Raghallaigh/O’Reilly/Reilly/Riley: a particular Irish ancestral line may see all variants through successive generations.
The easiest sledgehammer is to collapse the variants into one, the first step being to strip the “O'”/”Mc”/”Mac” from surnames (I’m being very Irish-centric here but that is my personal domain challenge). Then going further, using various sources of name variations to collapse names to a single version.
Yet this could lose the richness that allows historical tracing. For example, a set of my matches share 4th/5th generation surnames of a very unusual and distinct variant of a common Irish surname. This allows easy tracking through U.S. records of their line.
Instead, I am thinking of using a similarity measure within surnames themselves, which would ensure that variants are not treated as completely dissimilar but do have an effect of lowering the overall similarity index. The challenge there is that the computation times get higher.