summarize #
This interface is used to connect UE side and AnimaCore server to realize two-way communication between UE side and server side. The client can send text, audio and other messages, and the server will return the corresponding Metahuman character face.animation curvedigitalThe
You can use this interface to build a complete native client or embed it into your workflow. For example you may have already implemented llm interaction or voice interaction, so you only need to call specific functions or patterns.
You can't get animations/live animations/animation sequences that are playable in the UE engine directly through the API, the server side only returns the raw data of the animation/audio/text. If you need to get animations directly, you can use our UE pluginMHC TalkerThe
💡 This interface is intended for individuals/teams with some development skills and requires a more in-depth understanding of the UE engine.
💡 This project is in the development stage, the logic and functionality is not perfect. the API may change at any time, we will try to synchronize and update this document.
If you need the first time to get information about API changes, you can join our WeChat exchange group.
connection information #
Server clients communicate with each other using the websocket protocol
URL #
wss://yxintv.top:9876/ws/{session_id}
💡 session_id currently requires the client to implement a 12-bit random string, and the same session id is treated as the same session (the same websocket connection).
Query parameters #
| parameters | typology | (be) worth | clarification |
token | String | Authentication tokens are obtained in the account | |
interact_type | String | 'chat': Chat mode, where the returned text is generated by the language model'read': Read mode, where the returned text is the user input itself | The default value is'chat'You can find out more about this in the subsequenteach roundModify at any time during the dialog |
msg_type | String | 'text': Text Interaction'bytes': Voice Interaction | The default value is'text'You can find out more about this in the subsequenteach roundModify at any time during the dialog |
ani | String | 'JESSE': 'SHIYAO': 'SHIYAO_V2''SHIYAO_EMO' | 'SHIYAO'Model includes head movement'JESSE'Model does not include head movement'SHIYAO_V2'Models retrained for the original SHIYAO model on a larger dataset'SHIYAO_EMO'For a beta version of the sentiment model |
tts | String | 'AZURETTS': Microsoft Speech Synthesis Services'COSYVOICE': AliCloud Speech Synthesis Service'GPTSOVITS':: Server-side local speech synthesis | The default value is'GPTSOVITS'The local tts model is currently for testing purposes only and is not capable of commercial projects. You can find out more about this in the subsequenteach roundRevise anytime during the dialog. |
llm_model | String | 'qwen-turbo':: Tongyi Qianqian large model interface | Currently there is only one language model interface, more will be added and defaults will suffice. |
role | String | Current SupportMicrosoft azurerespond in singingAliCloud dashscoperole name in the speech synthesis service, e.g. 'zh-CN-XiaoxiaoNeural', you need to select the correspondingttsServices. | ttsbe valued at'AZURETTS'The default value is 'zh-CN-XiaoxiaoNeural', i.e. Microsoft "Xiaoxiao". You can find the default value in the subsequenteach rounddialog to modify it at any time. For a complete list of role support seeazure documentation:Azure AI servicesttsbe valued at'COSYVOICE'The default value is 'longxiaochun', i.e. Ali "Long Xiaochun". You can find the default value of 'longxiaochun' in the subsequenteach rounddialog to modify it at any time. For a complete list of role support seeDashscope Documentation:AliCloud Help Center |
is_greeting | String | 'true' 'false' | Whether to return a welcome message when connecting to the server for the first time. |
prmpt | String | llm's system cue word, or character persona.base64 encodingThe | |
emo | Int | -1: Automatic0: Dissatisfaction1: Excitement2: Calm. | The facial emotion of the character animation, this field is currently only available in the'SHIYAO_EMO'The starting value in the model is-1When it does, the server tries to analyze the sentiment categories of the text and audio. |
emo_s | Float | 0.0-1.0 | Emotional intensity, with 1 being the highest. |
💡 Audio messages are currently only supported in 16k sample, mono, 16bit wav format.
Example: #
Connect to the server and set the Interaction Mode to Read Mode and the Message Type to Audio
wss://yxintv.top:9876/ws/h6ssd23t34e8?token=123456&interact_type=read&msg_type=bytes
receive and send messages #
Once the connection is established, the client receives a welcome message from the server "Hello, ........" See below for the full list of fields included.
After that the client can send a message to the server.
Client messages and server messages are in json format.
Client Message Sending #
field:: #
| field | typology | (be) worth | clarification |
token | String | verification token | |
message_id | String | Random string of 12 bits in length | Client-side implementations are required. The samemessage_idwould be consideredSame round (one question, one answer)dialogues |
interact_type | String | 'chat''read' | Override the corresponding values in the query parameters |
msg_type | String | 'text''bytes' | Override the corresponding values in the query parameters |
text | String | (msg_typebe valued at'text'(time) Text for interaction | |
bytes | String | (msg_typebe valued at'bytes'time) for interactive audio, base64 encoded | |
ani | String | 'JESSE': 'SHIYAO': 'SHIYAO_V2''SHIYAO_EMO' | 'SHIYAO'The model contains part of the movement'JESSE'Model does not include head movement'SHIYAO_V2'Models retrained for the original SHIYAO model on a larger dataset'SHIYAO_EMO'For a beta version of the sentiment model |
tts | String | 'AZURETTS''GPTSOVITS''COSYVOICE' | Override the corresponding values in the query parameters |
role | String | Override the corresponding values in the query parameters | |
is_greeting | String | 'true' 'false' | Whether to return a welcome message when connecting to the server for the first time. |
prmpt | String | llm's system cue word, or character persona.base64 encodingThe | |
emo | Int | -1: Automatic0: Dissatisfaction1: Excitement2: Calm. | The facial emotion of the character animation, this field is currently only available in the'SHIYAO_EMO'The starting value in the model is-1When it does, the server tries to analyze the sentiment categories of the text and audio. |
emo_s | Float | 0.0-1.0 | Emotional intensity, with 1 being the highest. |
💡 Audio messages are currently only supported in 16k sample, mono, 16bit wav format.
Example: #
{
"token": "123456",.
"message_id": "d8rt653gh8gh",
"interact_type": "chat",
"msg_type": "text", "text": "text".
"text": "Now in text chat mode." ,
"tts": "AZURETTS",
"role": "zh-CN-XiaoxiaoNeural"
}
Server-side message return #
Upon receiving a client request, the server returns a json indicating that the task is starting to be processed. In low latency environments, this message can also be ignored:
{
"task_id": message_id, "status": "pending", "status".
"status": "pending"
}
The server then starts processing and sends the results one after another as they are generated. Currently, the server will process and send the result sentence by sentence.
Fields: #
| field | typology | (be) worth | clarification |
message_id | String | The client sends themessage_id | |
text | String | Text content | |
audio | String | Audio corresponding to text content, base64 encoded | |
animation | Json | Facial animations corresponding to text and audio. json format | |
user_id | String | user ID | |
emo | String | Sentiment information, testing phase, currently returns values with no specific meaning | |
is_first | Bool | TrueFalse | whether or notroundfirst sentence |
is_end | Bool | TrueFalse | Round of dialoguesIs it over? |
duration | Float | The actual duration of the animation of the sentence, 30 frames/sec. | |
X_time | Json | server-side processing time | |
status | String | 'pending''processing''done' | Message Processing Status |
remaining_time | int | Remaining animation generation time for user account, seconds | |
remaining_day | int | Remaining length of expiration of user account, days |
💡 Currently the server-side audio returned is 16k, mono, 16bitRaw bytes without format header (raw). Requires the client to deal with audio formatting issues on its own.
💡 Microsoft's East Asia servers may be experiencing very high latency during peak times, resulting in slower overall responses. We are also working on this issue and may add more TTS engines for testing in the near future.
💡 2024/08/08 Added AliCloud tts engine.
💡 2024/09/ Updated Microsoft TTS for domestic servers (previously Asian servers) to drastically reduce inference latency.
Animation Fields #
The data in the animation field is basically constructed according to the formatting requirements of the Curve Table data type in the UE engine, where each curve represents a Metahuman controller, and the value of the curve represents the motion state of the controller. The attachment "UE_curve_table.json" can be directly transformed into a Curve Table in UE engine, you can download it and drag it into UE editor.
UE's Curve Table requires each curve to have a "name" field to identify the name of the curve. In the actual transmission process, we replace the "name" field with "ID" and use an int type number to distinguish different curves. In the actual transmission process, we replace the "name" field with "ID" and use an int-type number to distinguish different curves, and then re-map the "ID" to "name" in the client, which is to reduce the amount of transmitted data, because curve This was originally done to reduce the amount of data transferred, as curves often have long names. However, as the project progresses, the reduction in data size by using "ID" is almost negligible, and we plan to change it in the future, so that the server can directly transfer the curve name without the need for client-side mapping.
However, there is still a need to map "ID" to "name" on the client side, you can download the key-value pair json file here:
This json file contains the names and corresponding IDs of all 276 curves used to control head motion in Metahuman.
typical example:: #
Below is a complete example file of a server-side return json.
(Hello! How can I help you?)
Below is a simplified example of the server return json, omitting the audio and animation fields:
{
"text":"\u4f60\u597d\uff01\u6709\u4ec0\u4e48\u6211\u53ef\u4ee5\u5e2e\u52a9\u4f60\u7684\u5417\uff1f",
"audio": "AQAAAAAAAAA...." ,
"animation": {
"AnimArray": [
{
"ID": 0,
"0.0": 0.0,
"0.0333": 0.0,
},
{
"ID": 1,
"0.0": 0.0,
"0.0333": 0.0,
}
]
},
"user_id": "Onlooker",
"message_id": "123456aaa",
"emo": [],
"is_end": false,
"is_first": true,
"status": "processing", "remaining_time": 1922
"remaining_time": 192273, "x_time": {first_time: {first_time
"x_time": {
"llm_infer": 0.5445709228515625, "asr_process": 0.5445709228515625, {
"asr_process": 0,
"vc_process": null,
"tts_process": 1.1153051853179932,
"ani_process": 0.2299818992614746,
"all_time": 1.6371128559112549
}
}
The server returns the processing results sentence by sentence until the end of the round of dialog.
Conclusion of the dialogue #
At the end of a round of dialog, the server side sends an additional json for transitioning the end state of the animation to the default state, the json has the same field format as the example above. In this json, the value of status field is 'done' and the value of is_end field is True.
The values of the text field and the audio field in this json file are empty.
Update Log: #
2025/03/31
New model fields have been added
New query parameters have been added
New request fields have been added
Updated with new Microsoft AZURE and AliCloud Hundred Refine documentation addresses
Updated some descriptions
2024/08/08
Added description of end of dialog json
Added description of new tts field
Added description of new role field
Added description of the new animation model
2024/08/05
Initial Document Submission