Skip to content

Proposal: configurable deduplication behavior for custom layers #1715

@orangejulius

Description

@orangejulius

Pelias supports advanced deduplication for records that represent the same "thing", but have records in different layers. In particular, we support custom layers that Pelias users can fill with their own data.

Pelias knows or cares very little about what this custom data represents, and so for deduplication purposes considers them to effectively be part of the venue layer, with priority in deduplication. This works well in most cases, as the custom data will replace any duplicate from the standard open data sources. However, there are other options that might be useful in some cases.

For example, consider the following three records. Assume that other than the columns below, they have effectively the same information and would be considered duplicates:

Name Source Layer
Some Place openstreetmap venue
Some Place custom custom-layer-1
Some Place custom custom-layer-2

Right now, we would deduplicate all three records, returning a single result. It would be one of the custom records, but there's no easy way to control which one.

Other possibilities

Since there are three records, there are (at least) three ways we could handle deduplication.

Deduplicate all three into a single record

This is what we do now.

Don't deduplicate anything

Pelias has a built in assumption at current that custom layers are equivalent to venue. This assumption has proven to usually be reasonable, but there's no reason that holds across all possible use cases. Another straightforward way to handle this would be to not deduplicate at all, essentially removing this special venue layer equivalence. The end result would then be that all three records are displayed.

*Display all custom records, deduplicate away canonical records

This is a more complicated but potentially useful way to handle things. Here, we would not display the OpenStreetMap record, but we'd show both the custom layers.

This is potentially more in the spirit of custom data. Since it doesn't come "out of the box" with Pelias, anyone using Pelias likely added it intentionally and wants to see it.

This is the new use case that would be most beneficial.

Pitfalls and considerations

Our deduplication logic is, after many refactors, both fairly powerful and well organized now, so overall these changes shouldn't be too hard to make.

The main issue I see is that the deduplication to display only custom records would break the transitivity of our duplication detection. Right now, there's basically two steps to deduplication: find a set of equivalent records, and then choose from them the single record to display. If we have three records A, B, and C, if A is equivalent to B and B is equivalent to A, then A is also equivalent to C. In this new case with two custom records and a single regular record, that would no longer hold. There might be places where this presents problems, we'll have to investigate a bit more.

Proposal summary

My proposal is that we add a new API configuration option to control deduplication behavior with custom records. We'd offer all three options listed above, with the current behavior as the default.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions