-
-
Notifications
You must be signed in to change notification settings - Fork 166
Description
Pelias supports advanced deduplication for records that represent the same "thing", but have records in different layers. In particular, we support custom layers that Pelias users can fill with their own data.
Pelias knows or cares very little about what this custom data represents, and so for deduplication purposes considers them to effectively be part of the venue layer, with priority in deduplication. This works well in most cases, as the custom data will replace any duplicate from the standard open data sources. However, there are other options that might be useful in some cases.
For example, consider the following three records. Assume that other than the columns below, they have effectively the same information and would be considered duplicates:
| Name | Source | Layer |
|---|---|---|
| Some Place | openstreetmap | venue |
| Some Place | custom | custom-layer-1 |
| Some Place | custom | custom-layer-2 |
Right now, we would deduplicate all three records, returning a single result. It would be one of the custom records, but there's no easy way to control which one.
Other possibilities
Since there are three records, there are (at least) three ways we could handle deduplication.
Deduplicate all three into a single record
This is what we do now.
Don't deduplicate anything
Pelias has a built in assumption at current that custom layers are equivalent to venue. This assumption has proven to usually be reasonable, but there's no reason that holds across all possible use cases. Another straightforward way to handle this would be to not deduplicate at all, essentially removing this special venue layer equivalence. The end result would then be that all three records are displayed.
*Display all custom records, deduplicate away canonical records
This is a more complicated but potentially useful way to handle things. Here, we would not display the OpenStreetMap record, but we'd show both the custom layers.
This is potentially more in the spirit of custom data. Since it doesn't come "out of the box" with Pelias, anyone using Pelias likely added it intentionally and wants to see it.
This is the new use case that would be most beneficial.
Pitfalls and considerations
Our deduplication logic is, after many refactors, both fairly powerful and well organized now, so overall these changes shouldn't be too hard to make.
The main issue I see is that the deduplication to display only custom records would break the transitivity of our duplication detection. Right now, there's basically two steps to deduplication: find a set of equivalent records, and then choose from them the single record to display. If we have three records A, B, and C, if A is equivalent to B and B is equivalent to A, then A is also equivalent to C. In this new case with two custom records and a single regular record, that would no longer hold. There might be places where this presents problems, we'll have to investigate a bit more.
Proposal summary
My proposal is that we add a new API configuration option to control deduplication behavior with custom records. We'd offer all three options listed above, with the current behavior as the default.