Zurich's public institutions collectively manage tens of millions of digital image files, and a growing share of that storage is being eaten up by exact or near-exact duplicates — a problem that, according to digital asset management specialists, now costs Swiss cultural institutions an estimated CHF 2–4 million annually in unnecessary infrastructure spending. The figure reflects wasted server capacity, redundant backup cycles, and the staff hours required to manually review bloated image catalogues.
The issue has sharpened this summer because several major digitisation contracts are coming up for renewal across the city. The Stadtarchiv Zürich on Neumarkt is mid-way through a five-year plan to digitise its pre-1950 photographic holdings. ETH Zürich's library, headquartered on Rämistrasse, is expanding its e-rara.ch platform. Both institutions face the same underlying problem: when images are ingested from multiple sources — scanning bureaus, donor collections, legacy hard drives — duplicates multiply faster than cataloguing teams can catch them.
The Scale of the Problem in Swiss Institutional Archives
Duplicate image rates in large-scale digitisation projects vary significantly depending on workflow. Industry benchmarks from the European Commission's Europeana aggregation network — which includes contributions from Swiss partners — suggest that between 8 and 22 percent of images in newly ingested batches can be flagged as duplicates or near-duplicates when automated hash-matching tools are applied. At the lower end, that is still one in twelve files paying for storage it does not need.
For ETH Zürich's image collections alone, which span over 1.2 million digitised objects as of the institution's last published annual report, an 8 percent duplication rate would represent roughly 96,000 redundant files. Cold storage on enterprise-grade servers in Switzerland runs at approximately CHF 0.03–0.05 per gigabyte per month, depending on the provider and redundancy tier. A high-resolution archival scan can weigh anywhere from 80 to 400 megabytes. The arithmetic compounds quickly.
The Zentralbibliothek Zürich on Zähringerplatz has publicly documented its digitisation pipeline in procurement notices published on simap.ch, the federal tender platform. Those documents note that image quality verification — including duplicate checks — is a contractual deliverable. But verification and deletion are not the same thing. Files flagged as duplicates often sit in quarantine folders for months pending curatorial sign-off, continuing to consume storage in the interim.
Why Automation Has Not Yet Solved It
Hash-matching identifies bit-for-bit identical files reliably. The harder problem is perceptual duplication — two scans of the same photograph made at different resolutions, or slightly different crops of the same archival print. Perceptual hashing algorithms, which compare image structure rather than raw bytes, have improved substantially since 2020, but they generate false positives that require human review. In a collection spanning millions of objects, even a 0.5 percent false-positive rate means thousands of items that curators must manually examine before anything is deleted permanently.
Zurich's Hochschule der Künste (ZHdK), based at the Toni-Areal in Zürich-West, has been piloting machine-learning-based deduplication tools as part of a research collaboration with its interaction design department. The pilot, which began in autumn 2025, focuses on the institution's own digital image archive rather than public collections, but the methodology is directly transferable. Results from that internal project have not yet been published.
For institutions operating under Switzerland's Bundesgesetz über das Archivwesen — the Federal Archives Act — permanent deletion of any digitised record carries legal risk and requires documented approval chains. That regulatory reality means the storage problem tends to persist longer in the public sector than in commercial contexts, where a straightforward cost-benefit calculation usually drives faster action.
Institutions with upcoming contract renewals or digitisation tenders would benefit from including explicit deduplication benchmarks — stated as maximum acceptable duplication percentages — in their service-level agreements rather than treating cleanup as an afterthought. The Stadtarchiv's contract cycle runs through late 2027, making the next 18 months the practical window for embedding those standards before the next major ingestion phase begins.