Zurich's Digital Archives Are Drowning in Duplicate Images — and the Numbers Are Staggering
New data from municipal digitisation projects reveals the hidden cost of redundant image files piling up inside public institutions across the city.
New data from municipal digitisation projects reveals the hidden cost of redundant image files piling up inside public institutions across the city.

Zurich's publicly funded cultural institutions are carrying a growing dead weight inside their digital storage systems: tens of thousands of duplicate image files that consume server capacity, distort collection counts, and quietly inflate operational costs. The problem has a name — duplicate image proliferation — and, according to internal assessments circulating among archivists in the city's library and museum sector, it has reached a scale that demands systematic attention.
The issue has sharpened in urgency because several of Zurich's largest institutions are mid-way through major digitisation programmes. Zentralbibliothek Zürich, which holds one of the most extensive manuscript collections in the German-speaking world, completed a milestone in 2025 when its digital portal crossed two million indexed objects. Staff working on quality control found that a measurable share of those records contained at least one duplicate image attachment — in some cases, the same high-resolution scan uploaded three or four times under slightly different file names or metadata tags. Internal estimates, shared in sector briefings but not yet published, put the redundancy rate in the mid-single-digit percentages across the broader collection. On a two-million-item base, that translates to a six-figure count of unnecessary files.
Storage is not free. Enterprise-grade archival storage for cultural institutions typically runs between CHF 0.03 and CHF 0.08 per gigabyte per month on managed Swiss-hosted infrastructure — figures consistent with pricing published by Swiss cloud providers operating under the Federal Act on Data Protection. A single uncompressed TIFF scan of an A3 document can exceed 200 megabytes. Multiply that by 100,000 duplicate files and the monthly carry cost alone reaches into the tens of thousands of francs annually, before staff time spent on remediation is counted.
Museum Rietberg, which digitised significant portions of its ethnographic photographic archive ahead of a 2024 gallery renovation in the Rieter Park complex near Gablerstrasse, encountered a related challenge: photographers and digitisation contractors had delivered overlapping image batches with inconsistent naming conventions, seeding duplicates that took curators months to untangle. The museum does not publish its storage expenditure separately, but the broader Canton of Zurich cultural budget allocated roughly CHF 12 million to digital infrastructure across its institutions in the 2025 fiscal year, a figure drawn from cantonal budget documents. Even a two percent efficiency loss from duplicate data management represents real money.
ETH Zurich's scientific computing researchers have published peer-reviewed work on perceptual hashing algorithms — techniques that can identify visually identical or near-identical images even when file names and metadata differ. The core technology is mature. What has lagged is its adoption inside municipal cultural institutions, which tend to run smaller IT teams and operate on procurement cycles that favour stability over experimentation.
The practical pathway involves three steps that archivists at Stadtarchiv Zürich, located on Neumarkt in the Niederdorf district, have begun piloting on a subset of their photographic holdings. First, a batch scan using open-source perceptual hash tools flags candidate duplicates. Second, human reviewers confirm genuine redundancies versus legitimately similar images. Third, a canonical version is retained and secondary copies are either deleted or demoted to a low-priority cold-storage tier.
The pilot, covering approximately 40,000 image files in the municipal photograph collection, is expected to conclude by the end of the third quarter of 2026. Early internal results, described in general terms at a May conference on digital preservation held at the Literaturhaus Zürich on Limmatquai, suggested a redundancy rate of around eight percent in the test batch — higher than most administrators had anticipated going in.
For institutions weighing whether to act now or wait for a sector-wide standard, the arithmetic is fairly blunt. Every month of delay is another month of storage costs for data that, by definition, adds nothing to the collection. Zurich has set itself an ambitious goal of making its entire publicly held cultural heritage digitally accessible by 2030. Getting the data clean is not an obstacle to that target — it is a precondition.
How does this story make you feel?
Spread the word
About this article
Published by The Daily Zurich
Daily brief
Free, in your inbox before 7am. Weekdays.
More in News