Hundreds of thousands of duplicate images are cluttering Zurich's public digital archives, municipal databases, and university repositories — and the cost of storing, managing, and manually reviewing them is rising faster than the institutions tasked with the problem can keep up. That is the central finding emerging from a review of data management practices across several of the city's largest digital infrastructure holders, compiled ahead of a cantonal digitisation policy update expected later this year.
The issue matters now because Zurich is in the middle of an ambitious push to digitise civic records, urban planning documents, and cultural heritage collections. The city's Stadtarchiv on Neumarkt, ETH Zurich's institutional image repository, and the Zentralbibliothek on Zähringerplatz are all expanding their digital holdings. But without systematic deduplication protocols, the same image — a building permit photograph, a heritage site scan, a public event picture — can exist in dozens of variant copies across separate systems, each consuming server space and staff hours.
The Numbers Behind the Clutter
Storage is not cheap. Enterprise-grade archival storage in Switzerland currently costs roughly CHF 0.08 to CHF 0.15 per gigabyte per month depending on redundancy requirements, according to published rates from Swiss data centre operators. A single high-resolution institutional photograph can run to 50 megabytes in uncompressed form. Multiply that across a collection where duplication rates commonly reach 30 to 40 percent — a figure consistent with benchmarks published by the International Council on Archives — and the wasted expenditure becomes significant at institutional scale.
ETH Zurich's library system alone manages more than 12 million digitised objects, a figure the institution has published in its annual reports. Even a conservative 15 percent duplication rate across image files would represent roughly 1.8 million redundant files. At current Swiss storage pricing, eliminating that redundancy could free up meaningful budget for actual digitisation work rather than maintenance overhead.
The Zentralbibliothek, which holds the canton's largest collection of historical photographs and maps, began an internal audit of its image metadata in early 2025. The process revealed that multiple scanning runs of the same physical documents — often conducted years apart by different project teams — had produced near-identical files catalogued under different identifiers. Without automated hash-matching or perceptual hashing tools to flag visual similarity, staff had to cross-reference records manually, a process that consumed an estimated several hundred staff hours over the audit period.
What Deduplication Actually Requires
The technical fix is well understood. Perceptual hashing algorithms — which generate a fingerprint based on an image's visual content rather than its exact file data — can identify near-duplicate photographs even when they differ in resolution, compression, or slight cropping. Tools built on this approach have been deployed by major European national libraries, including the Bibliothèque nationale de France and the British Library, both of which have published case studies on the efficiency gains.
For Zurich's institutions, the barrier is less technical than it is procedural and financial. Integrating deduplication workflows into existing content management systems requires upfront investment and staff retraining. The Stadtarchiv's collections management runs on a legacy platform that would require a custom integration layer. Budget cycles for such infrastructure work typically run two to three years in cantonal procurement processes.
The cantonal government is expected to release updated digital infrastructure guidelines before the end of 2026, following a consultation period that closed in the spring. Those guidelines are likely to address minimum metadata standards and interoperability requirements across public institutions — both of which are prerequisites for any city-wide deduplication effort to work.
For institutions that cannot wait for top-down policy, the practical path is to begin with new acquisitions rather than attempting to retrospectively clean entire legacy collections. Establishing a deduplication checkpoint at the point of ingest — before a file enters the main repository — prevents the backlog from growing while longer-term remediation is planned. Several Swiss university libraries have already adopted this approach quietly, treating it as standard archival hygiene rather than a special project. The Zurich institutions with the largest collections may find that starting small, one collection at a time, is the only realistic way forward given current staffing levels and the scale of the problem the numbers now make hard to ignore.