All About Vocaloid: Vocaloid

Vocaloid (ボーカロイド Bōkaroido) is a singing synthesizer application, with its signal processing part developed through a joint research project between the Pompeu Fabra University in Spain and Japan's Yamaha Corporation, who backed the development financially—and later developed the software into the commercial product "Vocaloid" The software enables users to synthesize singing by typing in lyrics and melody. It uses synthesizing technology with specially recorded vocals of voice actors or singers. To create a song, the user must input the melody and lyrics. A piano roll type interface is used to input the melody and the lyrics can be entered on each note. The software can change the stress of the pronunciations, add effects such as vibrato, or change the dynamics and tone of the voice. Each Vocaloid is sold as "a singer in a box" designed to act as a replacement for an actual singer. The software was originally only available in English and Japanese, but as of Vocaloid 3, Spanish, Chinese and Korean will be added.
The software is intended for professional musicians as well as light computer music users and has so far sold on the idea that the only limits are the users' own skills. Japanese musical groups Livetune of Victor Entertainment and Supercell of Sony Music Entertainment Japan have released their songs featuring Vocaloid as vocals. Japanese record label Exit Tunes of Quake Inc. also have released compilation albums featuring Vocaloids. Artists such as Mike Oldfield have also used Vocaloids within their work for back up singer vocals and sound samples.

Technology

The Vocaloid singing synthesizer technology is categorized as concatenative synthesis, which splices and processes vocal fragments extracted from human singing voices in the frequency domain. In singing synthesis, the system produces realistic voices by adding information of vocal expressions like vibrato to score information. The Vocaloid synthesis technology was initially called "Frequency-domain Singing Articulation Splicing and Shaping" (周波数ドメイン歌唱アーティキュレーション接続法 Shūhasū-domain Kashō Articulation Setsuzoku-hō, although Yamaha no longer uses this name on its websites. "Singing Articulation" is explained as "vocal expressions" such as vibrato and vocal fragments necessary for singing. The Vocaloid and Vocaloid 2 synthesis engines are designed for singing, not reading text aloud. They cannot naturally replicate singing expressions like hoarse voices or shouts, either.

System architecture

The main parts of the Vocaloid 2 system are the Score Editor (Vocaloid 2 Editor), the Singer Library, and the Synthesis Engine. The Synthesis Engine receives score information from the Score Editor, selects appropriate samples from the Singer Library, and concatenates them to output synthesized voices. There is basically no difference in the Score Editor and the Synthesis Engine provided by Yamaha among different Vocaloid 2 products. If a Vocaloid 2 product is already installed, the user can enable another Vocaloid 2 product by adding its library. The system supports two languages, Japanese and English, although other languages may be optional in the future. It works standalone (playback and export to WAV) and as a ReWire application or VSTi accessible from DAW.

Score Editor

The Score Editor is a piano roll style editor to input notes, lyrics, and some expressions. For a Japanese Singer Library, the user can input gojūon lyrics in hiragana, katakana or romaji writing. For an English library, the Editor automatically converts the lyrics into the IPA phonetic symbols using the built-in pronunciation dictionary. The user can directly edit the phonetic symbols of unregistered words. A Japanese library and an English library differ in the lyrics input method, but share the same platform. Therefore, the Japanese editor can load an English library and vice versa. As mentioned above, the lyrics input method is library-dependent, and so the Japanese and English editors differ only in the menus. The Score Editor offers various parameters to add expressions to singing voices. The user is supposed to optimize these parameters that best fit the synthesized tune when creating voices. This editor supports ReWire and can be synchronized with DAW. Real-time "playback" of songs with predefined lyrics using a MIDI keyboard is also supported.

Singer Library

Each Vocaloid licensee develops the Singer Library, or a database of vocal fragments sampled from real people. The database must have all possible combinations of phonemes of the target language, including diphones (a chain of two different phonemes) and sustained vowels, as well as polyphones with more than two phonemes if necessary. For example, the voice corresponding to the word "sing" ([sIN]) can be synthesized by concatenating the sequence of diphones "#-s, s-I, I-N, N-#" (# indicating a voiceless phoneme) with the sustained vowel ī. The Vocaloid system changes the pitch of these fragments so that it fits the melody. In order to get more natural sounds, three or four different pitch ranges are required to be stored into the library. Japanese requires 500 diphones per pitch, whereas English requires 2,500. Japanese has fewer diphones because it has fewer phonemes and most syllabic sounds are open syllables ending in a vowel. In Japanese, there are basically three patterns of diphones containing a consonant: voiceless-consonant, vowel-consonant, and consonant-vowel. On the other hand, English has many closed syllables ending in a consonant, and consonant-consonant and consonant-voiceless diphones as well. Thus, more diphones need to be recorded into an English library than into a Japanese one. Due to this linguistic difference, a Japanese library is not suitable for singing in English.

Synthesis Engine

The Synthesis Engine receives score information contained in dedicated MIDI messages called Vocaloid MIDI sent by the Score Editor, adjusts pitch and timbre of the selected samples in frequency domain, and splices them to synthesize singing voices. When Vocaloid runs as VSTi accessible from DAW, the bundled VST plug-in bypasses the Score Editor and directly sends these messages to the Synthesis Engine.

Timing adjustment

In singing voices, the consonant onset of a syllable is uttered before the vowel onset is uttered. The starting position of a note called "Note-On" must be the same as that of the vowel onset, not the start of the syllable. Vocaloid keeps the "synthesized score" in memory to adjust sample timing so that the vowel onset should be strictly on the "Note-On" position. No timing adjustment would result in delay.

Pitch conversion

Since the samples are recorded in different pitches, pitch conversion is required when concatenating the samples. The engine calculates a desired pitch from the notes and attack and vibrato parameters, and then selects the necessary samples from the library.

Timbre manipulation
The engine smooths the timbre around the junction of the samples. The timbre of a sustained vowel is generated by interpolating spectral envelopes of the surrounding samples. For example, when concatenating a sequence of diphones "s-e, e, e-t" of the English word "set", the spectral envelope of a sustained ē at each frame is generated by interpolating ē in the end of "s-e" and ē in the beginning of "e-t".

Transforms

After pitch conversion and timbre manipulation, the engine does transforms such as Inverse Fast Fourier transform (IFFT) to output synthesized voices.

Derivative products

Software

Vocaloid-flex

Yamaha developed Vocaloid-flex, a singing software application based on the Vocaloid engine, which contains a speech synthesizer. According to the official announcement, users can edit its phonological system more delicately than those of other Vocaloid series to get closer to the actual speech language; for example, it enables final devoicing, unvoicing vowel sounds or weakening/strengthening consonant sounds.^[33] It was used in a video game Metal Gear Solid: Peace Walker released on April 28, 2010. It is still a corporate product and a consumer version has not been announced.^[34] This software was also used for the robot model HRP-4C at CEATEC Japan 2009.^[35] Gachapoid has access to this engine and it is used through the software V-Talk.

Sabtu, 24 September 2011

Vocaloid