Subject: Protecting your audio from Artificial Intelligence (Video at bottem of page)
TapeRot INTRODUCTION: Neural voice cloning has progressed rapidly from needing many hours of training data to "zero shot" or few shot models that can produce convincing imitations of a target speaker from short audio samples. public availability of these systems include Retrieval-Based Voice Conversions (RVC) and Chatterbox among many others has created a dire need for proactive protection of voice recordings. existing protection approaches fall into broad classes: reactive systems and proactive systems. reactive systems address the misuse after a clone has been produced and distributed. proactive systems modify the source audio before publication so that down stream cloning receives a degraded signal. TapeRot belongs to the second class. the goal of TapeRot is to introduce carefully shaped adversarial perturbations that remain near or below the threshold of human auditory detectability while actively redirecting the internal representations that a neural cloning system relies on.
Subject: Protecting your audio from Artificial Intelligence
TapeRot THE PROBLEM: The human voice is a unique sonic identitiy. Artists, voice actors, content creators and the avrage person all rely on there voice for comunication and here inlies the PROBLEM: as little as a few seconds of audio is enough to sucessfully clone your voice. thats all thats needed to make you say things you ever would have. It has become extreamly simple and easy that anyone with an internet connection can do it. your voice can be used to convince your family to send a scammer money, a voice clone can be used to say a racial slur and costs you your job or it can be used to simply bypass you out of a business deal if your say a voice actor or content creator. Stan Lee is dead and they are STILL using his likeness to make videos and content with. Which is ethically gross and brings us right to our next point - the law cannot keep up. There is virtually NO legislation that prevents somone from stealing your voice and likeness and using it. So I created TapeRot.
Subject: Protecting your audio from Nonconsensual AI Voice Cloning
TapeRot ABSTRACT: The core premise rests on perturbation TapeRot is a per-clip adversarial perturbation infusion (PCAPI) system for protecting voice recordings from unauthorized cloning. given an audio file contains a target speaker TapeRot produces a perceptually similar protected output whose extracted features actively redirect downstream voice cloning systems away from reproducing the speakers identity and content. TapeRot v combines a psychoacoustical masked projected gradient descent (PGD) optimization loop with a stacked surrogate ensemble that includes both generic speaker and content embedding models and target aligned surrogates components that are bit identical to the actual encoders used inside specific voice cloning systems. A differentiable salience divergence loss (DSDL) attacks RVCs primary F information bypasses RVCs FAISS content retrieval and therefor cannot be "denoised" back towards training distribution at inference time. we describe the threat model, the surrogate parget gap problem and our target alignment response, the role of FAISS retrieval as an unintended adversarial defense in RVC, the F attack mechanism and the full attack composition behind a single user facing "strength" parameter. er report qualitative effects against Chatterbox (a popular zero shot voice cloner) and discuss the limitations of surrogate space metrics relative to the ground truth cloner audits.
Subject: Protecting your audio from Artificial Intelligence
TapeRot LIMITATIONS: While TapeRot shows promise, it also has a flaws. In this document we describe the v architecture and we try to be explicit about what TapeRot can do and what it can not. Notably the threat model is per-clip protection under a dominant corpus assumption. If an adversary's training corpus is dominated by TapeRot protected audio of the target speaker then cloning quality is meaningfully degraded. we make no claims of robustness under dilution where protected audio is a minority of a much larger corpus the exact ratio of this is still unknown and under active testing. surrogate space metrics in this document are reported as proxies and the final word on protection is largely qualitative.