Hundreds of thousands of digitised photographs stored across Zurich's major institutional archives contain duplicate entries — identical or near-identical image files catalogued under multiple reference numbers — and the technical and legal decisions about how to clean them up are now overdue. The problem is not new, but pressure to resolve it has sharpened this year as the city prepares to migrate several legacy collections onto a unified open-access platform by the first quarter of 2027.
Why does it matter now? The Stadtarchiv Zürich on Alfred-Escher-Strasse and the ETH Zürich image library both participate in the planned consolidation, which will bring their holdings together with collections from the Zentralbibliothek Zürich on Zähringerplatz into a single searchable portal. If duplicate records are carried over unremedied, archivists warn the merged database will surface conflicting metadata — different dates, photographers, or usage rights attached to what is in fact the same image — making the new platform unreliable from day one. That would undermine a project that has already drawn institutional funding and years of preparatory work.
What Deduplication Actually Involves — and Where It Gets Complicated
Duplicate image replacement is not simply a matter of deleting extra files. Each duplicate record may carry its own rights annotations, donor agreements, or researcher citations. Deleting the wrong canonical version could erase a legitimate usage licence or break an external link cited in a published academic paper. ETH Zürich's e-pics platform alone hosts more than 1.2 million images, and internal assessments have identified duplication rates of between four and nine percent in certain legacy collections imported before 2018, according to documentation circulated at a Swiss digital heritage conference in Bern last November.
The core technical choice facing the institutions is between two approaches. The first is automated hash-matching, which identifies byte-for-byte identical files quickly but misses near-duplicates — the same photograph scanned twice at slightly different resolutions or crops. The second is perceptual hashing combined with manual review, which catches near-duplicates but requires trained staff time measured in months, not weeks. A pilot run conducted by the Zentralbibliothek in late 2025 found that perceptual hashing flagged roughly 23,000 candidate pairs in one subcollection of 180,000 images — a ratio that, if applied across the full merged archive, implies a review workload that cannot be absorbed without dedicated temporary staffing or external contract support.
Cost is already a friction point. Zurich's cultural institutions operate under the city's Kulturleitbild framework, which sets multi-year budget envelopes rather than allowing flexible mid-cycle hiring. The Stadtarchiv's current budget period runs through December 2026, meaning any request for supplementary funds to hire deduplication contractors would need cantonal approval — a process that typically takes three to four months under normal circumstances.
The Decisions That Will Define the Timeline
Three choices will determine whether the 2027 launch date holds. First, the institutions must agree on a shared metadata standard before October 2026, so that when one record is designated canonical and duplicates are retired, the rights and citation data migrate cleanly rather than being lost. Discussions between the three institutions are ongoing, with a working group meeting scheduled at the Zentralbibliothek in September.
Second, they must decide whether to phase the public launch — opening non-duplicate-prone collections first while continuing to clean others behind the scenes — or delay the entire platform until the full corpus is resolved. A phased launch carries reputational risk if users encounter obvious errors early; a full delay pushes the project past municipal election season in spring 2027, when institutional priorities tend to shift.
Third, and most consequentially, is the question of public transparency. Under Swiss data protection law and the city's own open-government commitments, researchers who have previously cited archived image reference numbers deserve notification if those numbers are retired. Building that notification system adds development cost but avoids the alternative: a wave of broken citations in academic literature that would damage the credibility of Zurich's archival infrastructure for years to come.
The September working group meeting is the next firm deadline. What comes out of it — a phased timeline, a funding request, or a revised scope — will set the terms for everything that follows before the 2027 go-live date.