The software is intended for professional musicians as well as light computer music users and has so far sold on the idea that the only limits are the users' own skills. Japanese musical groups Livetune of Victor Entertainment and Supercell of Sony Music Entertainment Japan have released their songs featuring Vocaloid as vocals. Japanese record label Exit Tunes of Quake Inc. also have released compilation albums featuring Vocaloids. Artists such as Mike Oldfield have also used Vocaloids within their work for back up singer vocals and sound samples.
Technology
The Vocaloid singing synthesizer technology is categorized as concatenative synthesis, which splices and processes vocal fragments extracted from human singing voices in the frequency domain. In singing synthesis, the system produces realistic voices by adding information of vocal expressions like vibrato to score information. The Vocaloid synthesis technology was initially called "Frequency-domain Singing Articulation Splicing and Shaping" (周波数ドメイン歌唱アーティキュレーション接続法 Shūhasū-domain Kashō Articulation Setsuzoku-hō, although Yamaha no longer uses this name on its websites. "Singing Articulation" is explained as "vocal expressions" such as vibrato and vocal fragments necessary for singing. The Vocaloid and Vocaloid 2 synthesis engines are designed for singing, not reading text aloud. They cannot naturally replicate singing expressions like hoarse voices or shouts, either.
System architecture
The main parts of the Vocaloid 2 system are the Score Editor (Vocaloid 2 Editor), the Singer Library, and the Synthesis Engine. The Synthesis Engine receives score information from the Score Editor, selects appropriate samples from the Singer Library, and concatenates them to output synthesized voices. There is basically no difference in the Score Editor and the Synthesis Engine provided by Yamaha among different Vocaloid 2 products. If a Vocaloid 2 product is already installed, the user can enable another Vocaloid 2 product by adding its library. The system supports two languages, Japanese and English, although other languages may be optional in the future. It works standalone (playback and export to WAV) and as a ReWire application or VSTi accessible from DAW.Score Editor
The Score Editor is a piano roll style editor to input notes, lyrics, and some expressions. For a Japanese Singer Library, the user can input gojūon lyrics in hiragana, katakana or romaji writing. For an English library, the Editor automatically converts the lyrics into the IPA phonetic symbols using the built-in pronunciation dictionary. The user can directly edit the phonetic symbols of unregistered words. A Japanese library and an English library differ in the lyrics input method, but share the same platform. Therefore, the Japanese editor can load an English library and vice versa. As mentioned above, the lyrics input method is library-dependent, and so the Japanese and English editors differ only in the menus. The Score Editor offers various parameters to add expressions to singing voices. The user is supposed to optimize these parameters that best fit the synthesized tune when creating voices. This editor supports ReWire and can be synchronized with DAW. Real-time "playback" of songs with predefined lyrics using a MIDI keyboard is also supported.Singer Library
Each Vocaloid licensee develops the Singer Library, or a database of vocal fragments sampled from real people. The database must have all possible combinations of phonemes of the target language, including diphones (a chain of two different phonemes) and sustained vowels, as well as polyphones with more than two phonemes if necessary. For example, the voice corresponding to the word "sing" ([sIN]) can be synthesized by concatenating the sequence of diphones "#-s, s-I, I-N, N-#" (# indicating a voiceless phoneme) with the sustained vowel ī. The Vocaloid system changes the pitch of these fragments so that it fits the melody. In order to get more natural sounds, three or four different pitch ranges are required to be stored into the library. Japanese requires 500 diphones per pitch, whereas English requires 2,500. Japanese has fewer diphones because it has fewer phonemes and most syllabic sounds are open syllables ending in a vowel. In Japanese, there are basically three patterns of diphones containing a consonant: voiceless-consonant, vowel-consonant, and consonant-vowel. On the other hand, English has many closed syllables ending in a consonant, and consonant-consonant and consonant-voiceless diphones as well. Thus, more diphones need to be recorded into an English library than into a Japanese one. Due to this linguistic difference, a Japanese library is not suitable for singing in English.Synthesis Engine
The Synthesis Engine receives score information contained in dedicated MIDI messages called Vocaloid MIDI sent by the Score Editor, adjusts pitch and timbre of the selected samples in frequency domain, and splices them to synthesize singing voices. When Vocaloid runs as VSTi accessible from DAW, the bundled VST plug-in bypasses the Score Editor and directly sends these messages to the Synthesis Engine.Timing adjustment
In singing voices, the consonant onset of a syllable is uttered before the vowel onset is uttered. The starting position of a note called "Note-On" must be the same as that of the vowel onset, not the start of the syllable. Vocaloid keeps the "synthesized score" in memory to adjust sample timing so that the vowel onset should be strictly on the "Note-On" position. No timing adjustment would result in delay.Pitch conversion
Since the samples are recorded in different pitches, pitch conversion is required when concatenating the samples. The engine calculates a desired pitch from the notes and attack and vibrato parameters, and then selects the necessary samples from the library.Timbre manipulation
The engine smooths the timbre around the junction of the samples. The timbre of a sustained vowel is generated by interpolating spectral envelopes of the surrounding samples. For example, when concatenating a sequence of diphones "s-e, e, e-t" of the English word "set", the spectral envelope of a sustained ē at each frame is generated by interpolating ē in the end of "s-e" and ē in the beginning of "e-t".
Tidak ada komentar:
Posting Komentar