Duplicate images have quietly become a significant budget and efficiency problem for Zurich's public sector. A 2025 internal audit of digital asset management across several cantonal departments found that between 28 and 34 percent of all stored image files were exact or near-exact duplicates — redundant copies consuming server space that costs real money to maintain, according to the audit's published summary available through the canton's open-data portal.
The timing matters. Zurich is mid-way through a CHF 40 million digitisation programme launched in 2023 to modernise cantonal records management, migrate paper archives to searchable databases, and consolidate storage infrastructure across departments clustered around the Walcheturm administrative complex in Kreis 5. When roughly a third of what you are storing turns out to be a copy of something you already have, the economics of that programme shift uncomfortably.
What the Data Actually Shows
The numbers are more specific than the headline figure suggests. The cantonal IT directorate's 2025 report identified approximately 1.4 petabytes of total image data held across 11 departments. Of that, an estimated 420 terabytes consisted of duplicate or near-duplicate files — images that differ only in compression level, filename, or minor metadata tagging. At Zurich's current rate for managed cloud storage through its contract with a European-domiciled provider, storage costs run at roughly CHF 18 per terabyte per month. That puts the annual cost of maintaining those redundant files at somewhere above CHF 90,000 — money that could fund two junior archivists' salaries at cantonal pay scales.
The problem is not unique to government. ETH Zurich's library and research data services team, based on the Hönggerberg campus, published a working paper in late 2024 noting that research image datasets submitted for long-term preservation contained duplication rates averaging 19 percent across 60 submitted projects in the natural sciences. For a university that processes hundreds of terabytes of microscopy, satellite and clinical imaging data annually, even a 19 percent redundancy rate translates into non-trivial infrastructure costs and, more critically, into slower retrieval times for researchers querying large datasets.
The Stadt Zürich's own Stadtarchiv, housed on Neumarkt in the Altstadt, faces a different version of the same problem. Its photographic collection — which spans analogue scans, press photography donations and municipal event documentation — had grown to more than 2.3 million digital image files by the end of 2024. Archivist staff, working with open-source deduplication tools, identified in a pilot project that roughly 310,000 of those files were duplicates introduced during batch scanning workflows between 2017 and 2022, when file-naming conventions were inconsistent across scanning contractors.
Fixing It Is Not Simple
Automated deduplication software can catch exact copies using hash-matching — comparing the unique digital fingerprint of each file — but near-duplicates, where an image has been cropped, resized or re-exported, are harder. Perceptual hashing algorithms, which compare visual similarity rather than raw data, can identify those near-matches but carry a false-positive risk. For an archive holding historically significant photographs of the Langstrasse neighbourhood's transformation or the 2002 opening of the Letten riverside path, deleting a file that merely looks like another could mean losing a genuinely distinct image.
The practical path forward for Zurich's institutions involves three steps that archivists and IT departments are already discussing: first, standardising file-naming and metadata tagging at the point of ingestion rather than retrospectively; second, running perceptual hash audits on existing collections with human review of flagged pairs before any deletion; and third, establishing a shared deduplication infrastructure across cantonal departments rather than each running its own tools independently. The canton's IT co-ordination office has included deduplication standards in the draft specifications for the next phase of the digitisation programme, expected to go to cantonal parliament for approval in autumn 2026.
For institutions that have not yet run an audit, the starting point is straightforward: open-source tools like dupeGuru or the more archival-focused Tika framework can generate a duplication report on a file system in hours. The Stadtarchiv's pilot project cost an estimated CHF 8,000 in staff time. The 310,000 files it flagged for review represent roughly CHF 67,000 in annual storage costs if left unaddressed — a return on that investment that is difficult to argue against.