“Hi kids, do you like violence? Wanna see me dye my hair bright green just like Billie Eilish?” spits Eminem in an impudent flow that recalls the mischievous energy present on his 1999 landmark The Slim Shady LP. He goes on to joke about snatching Donald Trump’s wig, boasts of taking “Xannies” to keep his head straight, and finally ponders which K-Pop girl he’d like to impregnate, as the lo-fi beat hits a similar groove to Eminem’s Labi Siffre-sampling breakthrough single “My Name Is.”
Yet one of the most classic-sounding Eminem verses in recent memory wasn’t created by a resurgent Detroit rap legend, but a bored twentysomething experimenting in his bedroom. “I guess Eminem’s new songs are okay, but the vocal delivery is kind of boring. He’s shouting too much into the mic!” the deepfake song’s mysterious creator 30 Hz tells Billboard. “The lyrics just aren’t [as satirical] and he’s too angry. As a fan, I missed the nasally way Eminem used to sound.”
30 Hz has long produced his own beats, but he’s always been too shy to rap over them himself. Yet thanks to the rise of free deepfake audio software, he has discovered a novel new way to get one of his favorite rappers to spit over his production, while also ensuring Eminem eternally sounds 26 years old. “I found out I could use technology to make my favorite artists sound like they were back when they were at their best,” he explains. “I thought it seemed kind of neat.”
The bedroom producer creates deepfakes on Tacotron 2, a text-to-speech program developed by Google that enables users to build a convincing model of an artist’s voice by processing hours of audio from their songs and then isolating it into words that can be formed into original sentences and flows. 30 Hz uses LJ speech — a public domain speech dataset consisting of 13,000 short audio clips of a single speaker reading passages — as his base model and then attempts to re-create as many of its phrases as possible into a secondary vocal model, which feeds directly into hours and hours of reverb-less, studio-quality, a capella audio recordings that feature a Golden era Slim Shady.
“My Name Is [the 2021 version]” has over 1 million views on YouTube and has been covered with intrigue by media outlets across the world. Its lyrics were written by a fan of 30 Hz’s YouTube channel and then performed by the deepfake Eminem voice model itself. All of 30 Hz’s songs are written by his subscribers, because he says he has always struggled to write lyrics himself and wants to champion a truly collaborative, human-based creative process that isn’t a million miles away from how pop songs are created in a studio. (It’s more labor of love than work for profit; 30 Hz claims he ensures he doesn’t make any money off his Eminem songs.)
Whether it’s a manically laughing Tom Cruise playing golf, Donald Trump being superimposed into Better Call Saul to talk about the ins and outs of money laundering, or Kim Kardashian boasting about the joys of manipulating people online in order to make stacks of money, video deepfakes have taken the internet by storm in recent years. There were even fears such clips could play an active role in spreading disinformation during the 2020 U.S. Election, with The Brookings Institution fearfully warning in a report that the rise of hyper-realistic video deepfakes would ultimately “distort democratic discourse; manipulate elections; erode trust in institutions; weaken journalism; exacerbate social divisions; undermine public safety; and inflict hard-to-repair damage on the reputation of prominent individuals.”
However, so far in the music world deepfakes have mostly been used for parody rather than manipulation, with 30 Hz suggesting the time that it takes to create a truly decent voice model means they’re far from becoming a true facet of mainstream music. “It has taken me around 10 months of work to get my Eminem voice model to a place where it sounds really good,” he says. “At best, I found about 70 minutes of audio for Eminem [from that time period] that were good enough for me to use and process into a convincing voice model. Some of the deepfake voices of Trump and Biden have been built from over 48 hours of audio data and they’re nearly indistinguishable from the real thing. If there’s lots of high-quality a cappella audio available of someone talking on the internet then it’s a lot easier to re-create their voice in a convincing way.”
It isn’t just hobbyists creating deepfake audio recreations of famous rappers and uploading the results to YouTube. Los Angeles-based creative tech agency space150, which has high-profile clients including Nike, Fox and Activision, fed all of Travis Scott’s songs into the Lyrebird A.I. algorithm to create a convincing vocal model of the Astroworld rapper. A generative adversarial network then generated a list of words from the Travis voice model, with the agency using the TextGenn program to filter them into what it considered to be cohesive lyrics. The space150 team then curated these lyrics into an original, Auto-Tune-heavy track (“Jack Park Canny Dope Man”) by an artist they lovingly refer to as Travis Bott. Although the lyrics didn’t fully make sense (the final song includes the bizarre bar: “I don’t really wanna f–k your party food”), the agency wanted to ensure everything the listener heard truly came from an AI model.
The subsequent Travis Bott song gained global press attention and even a review from popular YouTuber Anthony Fantano. It’s like something out of a Black Mirror episode. “Around the 2020 U.S. election there was all this unfounded fear that deepfake technology would be used for nefarious means,” says Josh Lundquist, the post-production director at space150. “We wanted to use AI and deepfake audio in a truly creative way, and not in that doomsday machine-kind of way that everyone was predicting. We wanted to show its potential [as a branding tool].”
Referring to the rise of hobbyists like 30 Hz, Lundquist claims deepfake audio has effectively become “online fan fiction, but for music.” He doesn’t personally see much of a moral issue with re-creating an artist’s voice, likening deepfake audio creation to the art of sampling in hip-hop. “I compare deepfake audio to when [artists like] Biz Markie started using a sampler in the mid-’80s for the first time. [They were] re-formulating something old into something completely new. Vocals will go the same way.”
Some could argue, however, that this process is an example of tech-obsessives playing God, effectively robbing artists of the one thing over which they should have true ownership: their own voice. “Everything I do is parody and clearly labelled as Artificial Intelligence vocals,” counters 30 Hz, who believes his work shouldn’t be treated any differently to a Weird Al Yankovic sendup.
That said, 30 Hz admits that in the past he’s taken the technology beyond the realm of good taste; an old 2Pac deepfake song that 30 Hz created manipulates the late rapper’s vocals to be about the death of George Floyd and reference the Black Lives Matter movement. This creation runs dangerously close to gaudy corpse reanimation, and feels particularly dodgy given 30 Hz is a white man altering a dead Black man’s vocals to make a political statement. “I regret the 2Pac song” 30 Hz concedes. “It was the wrong thing to do. I wrote it early on before I had my rule set of only doing parody songs, rather than original work.” (The clip lives on as a “cautionary tale” on YouTube.)
You could also potentially argue 30 Hz’s faux 2Pac song wasn’t ethically a million miles away from turning the late rapper into a hologram that shouted ‘Wassup Coachella!” at the 2012 festival — with ghoulish touring holograms perhaps one of the precedents for deepfake. But the technology itself is a very different thing: Holograms are more about mirroring than AI, and although the perception was of a 3D likeness of Tupac Shakur, the Coachella hologram image was actually a 2D image. Shakur’s likeness was projected onto an angled piece of glass on ground, which in turn projected the image onto a Mylar screen on stage.
Whether people want to use deepfake audio as parody or create brand new songs referencing real-world events by artists who have been dead since the 1990s is ultimately irrelevant; right now, there’s very little the music industry can do to stop deepfake music. “The sound of the voice itself is not covered by copyright law,” says Professor Joe Bennett, who is a forensic musicologist at Berklee College of Music, and deals with issues of music copyright infringement and song similarity. “The reason you can impersonate someone with deepfake audio is because there are only two protected objects in copyright law — the musical work and the sound recording. The musical work refers to the song — notes, chords and lyrics. And the sound recording protection can only be applied to a specific track. This means deepfake audio is a grey area, as the voice isn’t considered [by law] as a part of the composition.”
This could explain why Jay-Z was unsuccessful in getting deepfake vocals taken off YouTube, with Bennett insisting that artists may simply have to come to grips with a new reality. “Perhaps the music industry might one day figure something out where Jay-Z’s publishers get X percent for the Jay-Z songs feeding the neural network that generates an AI song, but the fact is the genie is now out of the bottle,” he says. “Artists may just have to accept that thousands of people are now able to recreate their vocal sounds.”
Referring to the lawsuits in 1999 that attempted to punish Napster and its associates for spreading free downloadable music without permission, Bennett says he hopes deepfake creators aren’t treated in a similar fashion. “My sincerest hope is that we don’t see a repeat of the Napster lawsuits,” Bennett says. “Any technological advancement can be used for bad or good, but trying to un-invent it isn’t the solution. This should be embraced. One day I’m sure there will be a Kendrick Lamar voice plug-in — and it’ll be just another new creative tool for artists and songwriters.”
However, deepfake audio has been used as a tool for criminals in the finance world. Rob Volkert, a researcher at NISOS who helps financial firms detect deepfake audio fraud, recounts a story of criminals using deepfake to recreate the voice of the owner of a U.K. energy firm. It was so convincing that an employee was tricked by his so-called boss into sending €220,000 ($240,000) to a Hungarian supplier. Yet Volkert doesn’t believe deepfake audio poses anywhere near the same potential for fraudulent activity in the music industry.
“There isn’t a clear path to large financial returns from criminal audio deepfake use in the music industry right now, and we are seeing mostly parody and experimentation by music fans,” he says. “People who want to carry out a phishing attack are unlikely to use deepfake audio right now, as too much work goes into making it sound authentic.” Yet, he warns, “If there was a big criminal success story then maybe that could change and create a domino effect. It just seems very unlikely right now.”
“The technology just isn’t good enough to be completely convincing, as you cannot currently change deepfake vocals to express actual emotions,” confirms Jon Bateman, a fellow in the Cyber Policy Initiative of the Technology and International Affairs Program at the Carnegie Endowment for International Peace. “It is not a hopeless situation; it can be controlled, and the social media firms like YouTube can monitor it, so it’s obvious to users when something is deepfake and not a real song by a verified artist.”
Although U.S. law protects people from disinformation in video deepfake, audio deepfake remains a grey area. Bateman says a more obvious threat could potentially be around misinformation. “If you listen to a three-minute song by a deepfake Eminem, you will be able to tell it’s fake. But if you listened to a 10-second clip on TikTok where the fake Eminem disses a bunch of people, you might not notice the same technical defects, and libel could be spread.” According to 30 Hz, deepfake audio technology is only one to two years away from being completely life-like. Subsequently, space105’s executive creative director Ned Lampert believes the world is heading for a future where record labels will use deepfake audio as a tool to effectively create immortality for their artists.
“After Peter Cushing’s posthumous CGI performance in Star Wars: Rogue One, it’s likely we will see more actors sign off their likeness to film as CGI characters, so they can continue doing films long after their death,” he says. “It isn’t unfeasible to expect artists to commit their vocals via the same kind of technology that’s powering deepfake music. It would mean we could have a fairly convincing new album by Tupac or Frank Sinatra, with brand new vocals generated by AI. The AI would be working in partnership with a team of songwriters to create new albums.”
Deepfake prevention expert Volkert also speculates, “Maybe we’ve not seen deepfake audio be fully utilized from a marketing or media perspective yet. You can imagine celebrities will let ad agencies create deepfakes of their vocals, so someone like Kim Kardashian can appear in commercials without even having to do anything.”
Whatever the outcome, deepfake audio isn’t going away, and the online communities dedicated to re-creating their favorite artist’s vocals only seem to be growing each day. When 30 Hz first started out, he said his Discord forum had only a couple of members. “It’s now a couple of thousand,” he laughs, something he says points towards an explosion of new deepfake audio creators popping up during lockdown. “What we do is primarily about parody and experimentation. Lots of us got bored during the pandemic and thought, ‘Why not try this?’
“We’re all competing to create the best possible model,” he continues. “It’s a real community now. Maybe big artists will try and push back to copyright their voices, but it would be insane if they punished everybody who sounded like Jay-Z. We’re not a threat to them. This isn’t about making money, we’re just making the most of a technological evolution.”