Zurich's public institutions collectively manage hundreds of terabytes of digital image data, and a growing share of it is exact or near-exact copies of files that already exist somewhere else in the same system. That is not a trivial housekeeping problem. According to storage benchmarking research published by ETH Zurich's Information Science group, duplicate and near-duplicate files can account for between 20 and 40 percent of total unstructured data held by large organisations — a figure that translates directly into wasted infrastructure spend and inflated asset counts that mislead decision-makers.
The issue lands with particular force right now because Swiss federal and cantonal digitisation programmes are in full acceleration. The canton of Zurich's own digital transformation roadmap, Digitale Verwaltung Schweiz, has pushed dozens of offices to scan legacy paper records and ingest photographic archives at volume. When ingestion pipelines lack deduplication logic at the point of entry, the same image — a planning photograph of a Kreis 5 building facade, for example, or a satellite tile over Zürichsee — can enter a system five or six times via different upload channels, each instance logged as a discrete asset.
What the Numbers Look Like on the Ground
Storage is not cheap, even at institutional scale. Colocation rack space at data centres in the greater Zurich area, including facilities along the Glattal corridor near Kloten, runs at roughly CHF 800 to CHF 1,400 per rack unit per year depending on redundancy tier and power draw. If a cantonal archive is carrying 30 percent duplicate image data across 500 terabytes of object storage, the direct infrastructure cost of those redundant files runs into tens of thousands of francs annually before staff time is counted.
The problem compounds in analytics. When a researcher at the University of Zurich's Digital Society Initiative on Rämistrasse pulls an image dataset for a computer-vision training run, duplicate files skew class distributions and inflate apparent dataset size. A model trained on a corpus where 35 percent of images are duplicates will perform differently — and usually worse on unseen data — than benchmark results suggest. That gap between reported dataset size and effective dataset size is a reproducibility hazard that has drawn formal attention from the Swiss National Science Foundation, which funds a significant portion of applied machine-learning research at Zurich institutions.
Deduplication itself has a measurable track record. Perceptual hashing algorithms — which compare image content rather than file metadata — can identify near-duplicates that byte-level checksums miss. A 2024 audit framework piloted by a German federal archive and later referenced in European digital preservation circles found that content-aware deduplication reduced effective image corpus size by 28 percent in a collection of approximately 4 million files, with processing time under 72 hours on commodity hardware. Zurich's Stadtarchiv on Neumarkt has not published comparable figures for its own holdings, but archivists in peer institutions describe similar ratios as typical.
What Organisations Should Do Next
The practical path forward involves three steps that cost relatively little compared to the ongoing storage bill. First, any institution running image ingest pipelines should implement perceptual hash checks — tools like pHash or ImageHash are open-source and integrate with standard Python-based workflows — before files are written to long-term storage. Second, existing archives should schedule a one-time deduplication audit; for a collection under 50 terabytes, the compute cost at Zurich cloud-market rates is a few hundred francs. Third, asset counts used in budget reporting and grant applications should be recalibrated to reflect unique-image totals rather than raw file counts, since the latter have been systematically overstated at institutions that have not yet run deduplication passes.
For organisations that procure storage as part of UBS or other large corporate IT frameworks — still relevant given the ongoing integration of Credit Suisse infrastructure into unified UBS systems — deduplication ratios negotiated at enterprise level can materially affect the true cost per gigabyte. The numbers are there. Acting on them is now the question.