Abstract
Large-scale web-scraped datasets have contributed significantly to progress in deep learning, yet the extensive presence of biometrics data, such as faces, poses a legitimate legal, ethics, and privacy issue. Existing approaches address this by removing sensitive images entirely, often sacrificing downstream performance, or purchasing use of licensed images. We present FaceSafe, a novel privacy preserving transformation pipeline that uses a diffusion-based inpainting model to systematically replace detected faces in images with synthetic variants conditioned on different demographic attributes, resulting in a privacy-preserving dataset. Evaluated on 12,000 images transformed from LAION-400M and CelebA-HQ, FaceSafe eliminates privacy risks without significant loss of image quality or diversity.