Zurich's public institutions are sitting on a problem that has quietly ballooned for more than a decade: digital archives riddled with duplicate images, redundant scans and conflicting file versions that are costing storage budget, confusing researchers and, in some cases, producing errors in official records. The cantonal administration, ETH Zurich's library network and the Stadtarchiv on Alfred-Escher-Strasse have each reached a decision point. How they handle deduplication over the next 18 months will shape how the city's digital memory is managed for a generation.
The issue has sharpened because several institutions are now migrating to new long-term preservation systems. When you import a legacy archive into a fresh infrastructure, every duplicate gets carried across, amplified and re-indexed — turning a manageable annoyance into a structural liability. For institutions like the Zentralbibliothek Zürich on Zähringerplatz, which holds photographic collections running into the hundreds of thousands of images, the duplication rate in digitised historical holdings can be significant enough to distort search results and inflate reported collection sizes.
Why the Deduplication Decision Is Harder Than It Looks
The instinct is to delete duplicates automatically. Archivists resist that instinct, and for good reason. Two image files that look identical may carry different metadata, different provenance records or different rights clearances. A scan of a 1930s photograph of Niederdorf taken for one purpose may have been re-scanned for a separate project with corrected colour calibration. Both matter. Deleting the wrong version destroys context.
ETH Zurich's library has been piloting a structured deduplication workflow as part of its broader e-Rara digitisation programme, which hosts historical Swiss printed works and periodicals. The pilot has required staff to distinguish between true duplicates — byte-for-byte identical files — and near-duplicates where human review is essential. That distinction sounds simple; operationally, at scale, it is not. Decisions about which file becomes the master record, which gets archived as a secondary copy and which gets flagged for deletion require policy frameworks that most cantonal institutions have not yet formalised.
The cost question is also pressing. Cloud storage is not free. Swiss government data-hosting requirements push many cantonal bodies toward domestic providers or the Federal Data Processing Centre, where costs per terabyte are substantially higher than commercial international alternatives. Every duplicate image file kept indefinitely has a real, recurring price. Institutions that have not audited their holdings in the past three years are almost certainly paying for redundant data they cannot easily locate or justify.
What the Next Decisions Look Like in Practice
Three concrete choices are coming. First, institutions need to decide whether deduplication is a one-time cleanup or a continuous process embedded in ingest workflows. The Stadtarchiv model, if adopted more broadly, would require every new batch of images to pass through a hash-matching check before being added to the permanent collection — stopping the problem from regenerating. Second, a shared metadata standard across cantonal bodies would allow the Zentralbibliothek, the Staatsarchiv des Kantons Zürich on Winterthurerstrasse and ETH's collections to cross-reference holdings without duplicating them across institutions. Talks toward a unified standard have been ongoing since at least 2023 but have not produced a binding agreement.
Third, and most politically sensitive, is the question of who pays for remediation. Digitisation grants from Pro Helvetia and the Swiss National Science Foundation funded the creation of many of these archives; neither body has a standing mandate to fund the cleanup of the problems that digitisation created.
The timetable is not abstract. The canton of Zurich's digital government programme has a review milestone set for early 2027, at which point agencies will report on storage efficiency and data quality. Institutions that arrive at that review without a deduplication policy risk having one imposed on them from above — a scenario archivists regard as far worse than designing their own solution now. The decisions made in the coming months will determine whether Zurich's digital collections become a model of managed, reliable public memory or an increasingly unwieldy heap of redundant files that no one quite trusts.