Zurich's public institutions hold millions of digital image files — and a growing share of them are exact or near-exact copies of each other. New internal assessments from the city's archival and research sector, reviewed this spring, suggest that duplicate image data accounts for between 20 and 35 percent of total stored assets in mid-sized institutional repositories, inflating storage costs and complicating the work of archivists, researchers and developers who depend on clean data.
The issue has moved up the priority list in 2026 for two reasons. First, cloud storage costs have risen sharply across Europe since 2024, with enterprise-tier object storage prices in Swiss data centres running at roughly CHF 0.022 per gigabyte per month — a rate that turns a bloated image library into a serious line item. Second, institutions across Zurich are in the middle of large-scale digitisation drives, feeding new material into repositories that already carry years of unaudited legacy uploads.
Where the Problem Lives in Zurich
ETH Zurich's Research Collection, based on Rämistrasse in the Hochschulquartier, manages tens of thousands of research-related image assets. The platform, which supports open-access deposit by researchers across dozens of faculties, has publicly acknowledged the challenge of deduplication as part of its ongoing infrastructure roadmap. The Stadt Zürich's own digital archive, housed under the Stadtarchiv at Neumarkt 4 in the Altstadt, faces a related but distinct version of the problem: historical photograph collections digitised at different resolutions and by different contractors over multiple project cycles have produced layered duplication across formats.
At the Zentralbibliothek Zürich on Zähringerplatz, librarians working on the e-manuscripta platform — which hosts digitised manuscripts and rare printed materials shared across Swiss institutions — have noted that cross-institutional uploads create a specific duplication vector. When two partner libraries independently digitise the same item, two image sets enter the shared pool. The platform spans more than 20 contributing Swiss institutions, which multiplies the risk.
Private-sector pressure adds another dimension. Zurich's pharmaceutical corridor — anchored by major research campuses in the greater metropolitan area — generates vast quantities of scientific imagery, from microscopy outputs to clinical trial documentation. Regulatory requirements under Swiss data governance rules mean these image sets must be retained in full, even when duplicates exist, unless a verified deduplication audit trail can be established. That compliance burden has pushed several life-sciences firms to invest in automated image-fingerprinting tools in the past 18 months.
The Numbers Underneath the Problem
The deduplication problem is measurable. A 2024 study published in the Journal of Digital Preservation found that cultural heritage image repositories across Europe carried an average raw duplication rate of 28 percent when counting perceptually similar images — not just byte-for-byte identical files. For institutions using JPEG compression at varying quality settings, that figure climbed to 41 percent when near-duplicate detection algorithms were applied. Correcting this in a repository of 500,000 images can reduce active storage requirements by 80 to 140 terabytes, depending on average file size.
The cost case is not trivial. At current Swiss data-centre pricing, a 100-terabyte reduction translates to roughly CHF 2,600 per year in direct storage savings — modest on its own, but significant when multiplied across dozens of repositories and compounded over multi-year contracts. More consequential are the downstream costs: duplicate images degrade the performance of AI-assisted search and cataloguing tools, which several Zurich institutions have been piloting since 2023. A search index built on a dataset with 30 percent duplication will surface redundant results, reducing the utility of the tool and eroding researcher trust in digital collections.
Institutions that have run structured deduplication projects report that the process is not quick. A repository of 200,000 images typically requires three to six months of combined automated scanning and human review to resolve ambiguous near-duplicates correctly — a timeline that demands dedicated staff hours and, often, temporary project funding.
The practical path forward for Zurich's institutions involves three steps that archivists and data managers consistently recommend: establishing a deduplication policy before new digitisation projects begin rather than after, adopting perceptual hashing standards — such as the open-source pHash algorithm — as part of ingest workflows, and scheduling annual audits of legacy collections. For institutions using shared platforms like e-manuscripta, coordinating deduplication governance across partner organisations is essential, and that coordination work is already on the agenda for the platform's next steering committee meeting, scheduled for later in 2026.