The training speech data contains a minimum of 15 hours of high-quality speech recordings sampled at 48 kHz. We segmented the speech into half-phones using forced alignment, i.e., automatic speech recognition to align the input phone sequence with acoustic features extracted from the speech signal. This segmentation process results in around 1–2 million half-phone units, depending on the amount of recorded speech.
The entire methodology sounds very impressive.
While I've done some basic research on machine and deep learning in the recent past, it wasn't nearly enough to keep up with this entire article. This is something I do often. I read an article and if there are large portions I simply cannot comprehend I'll do research until I grok it. To that end, I plan to set aside some time in September to do enough research on DL and ML for me to understand posts like this at a basic level.