Thousands of duplicate images are sitting inside Zürich's public and institutional digital systems, consuming server capacity, slowing retrieval times, and complicating the work of archivists, researchers, and city planners who depend on clean data. The problem is not new, but pressure to address it has intensified in 2026 as storage costs climb and several major digitisation projects approach completion deadlines.
The timing matters. The city of Zürich's Stadtarchiv on Neumarkt has been expanding its digital holdings as part of a multi-year migration programme, while ETH Zürich's library has simultaneously pushed forward with the digitisation of historical scientific collections. Both institutions are now grappling with a common byproduct of bulk scanning and automated ingestion: thousands of near-identical image files that slipped through without proper deduplication protocols at the point of upload.
What the Institutions Are Dealing With
The Stadtarchiv's migration programme, which covers physical records going back several decades, relies on third-party scanning contractors who deliver batches of files in standardised formats. Industry practice in comparable European city archives — including those in Vienna and Hamburg — has shown that bulk deliveries can carry duplicate rates of between eight and fifteen percent when metadata tagging is inconsistent. No official figure has been released for Zürich's current holdings, but archivists working in the field describe the problem as systemic rather than exceptional.
At ETH Zürich, the research library's digital collections include scientific photography, microscopy outputs, and historical expedition imagery. These files pass through multiple research groups before landing in the central repository, and the same base image frequently reappears with different file names, compression levels, or crop dimensions — each version technically distinct but functionally redundant. The library has been testing hash-based deduplication software since early 2025 as part of its open-access infrastructure upgrade, according to publicly available project documentation on the ETH library's website.
The University of Zürich's Zentralbibliothek on Zähringerplatz faces a related challenge. Its partnership with the Europeana cultural heritage platform, which pools digitised content from institutions across Europe, means that images submitted by Zürich institutions sometimes return to domestic databases as apparent duplicates after passing through the aggregation pipeline — a circular problem that manual review alone cannot efficiently solve.
What Experts Are Recommending
Specialists in digital preservation point to two complementary fixes. The first is perceptual hashing — an algorithmic approach that identifies visually similar images even when file metadata differs, catching the crop-and-rename duplicates that exact-match tools miss. The second is upstream governance: stricter ingestion standards that flag duplicates before they enter the archive rather than requiring costly retrospective cleaning.
The Swiss Federal Archives in Bern adopted a revised ingestion protocol in January 2026 that includes mandatory deduplication checks at transfer, a standard that cantonal and municipal institutions are being encouraged — though not yet required — to match. For Zürich, where the Stadtarchiv operates under city rather than federal authority, adoption of any such standard would require a formal directive from the Stadtrat or a procedural update within the relevant departmental guidelines.
Storage is not cheap. Enterprise-grade archival storage in Switzerland currently runs at roughly CHF 0.03 to CHF 0.06 per gigabyte per month on managed platforms, according to published pricing from Swiss hosting providers. For a mid-sized archive holding several hundred terabytes of image data, duplicate rates in even the lower range of industry estimates translate to tens of thousands of francs in avoidable annual costs.
The practical path forward involves three steps that archivists and software specialists broadly agree on: an audit of existing holdings using perceptual hashing tools, a revised contractor brief that mandates pre-delivery deduplication, and a metadata governance framework that assigns canonical identifiers to master image files. Institutions that have completed similar exercises — the Bibliothèque nationale de France completed a major deduplication audit of its Gallica platform in 2024 — report significant reductions in retrieval errors as well as storage savings. For Zürich's digital archivists, the question is no longer whether to act, but how quickly the administrative machinery can move.