评估指标示例:正确性(由AI判断)
中级
这是一个Engineering、AI领域的自动化工作流,包含 13 个节点。主要使用 Set、Evaluation、Agent、OpenAi、EvaluationTrigger 等节点,结合人工智能技术实现智能自动化。 评估指标示例:正确性(由AI判断)
前置要求
- •OpenAI API Key
工作流预览
可视化展示节点连接关系,支持缩放和平移
导出工作流
复制以下 JSON 配置到 n8n 导入,即可使用此工作流
{
"meta": {
"instanceId": "bf40384a063e00f3b983f4f9bada22b57a8231a04c0fb48d363e26d7b0f2b7e7",
"templateCredsSetupCompleted": true
},
"nodes": [
{
"id": "4e2acf3b-3629-4719-b6dd-80e0efdd1cad",
"name": "便签1",
"type": "n8n-nodes-base.stickyNote",
"position": [
200,
20
],
"parameters": {
"color": 7,
"width": 300,
"height": 180,
"content": "检查答案是否与预期答案具有相同含义"
},
"typeVersion": 1
},
{
"id": "08f2b16f-766f-4d80-8a16-7b41ce4da472",
"name": "便签3",
"type": "n8n-nodes-base.stickyNote",
"position": [
-1200,
40
],
"parameters": {
"width": 200,
"height": 500,
"content": "## 工作原理"
},
"typeVersion": 1
},
{
"id": "e8674263-6cb6-49dc-9b93-3ce167b35608",
"name": "便签4",
"type": "n8n-nodes-base.stickyNote",
"position": [
-960,
280
],
"parameters": {
"color": 7,
"width": 220,
"height": 220,
"content": "读取[此测试数据集](https://docs.google.com/spreadsheets/d/1uuPS5cHtSNZ6HNLOi75A2m8nVWZrdBZ_Ivf58osDAS8/edit?gid=662663849#gid=662663849)中的问题"
},
"typeVersion": 1
},
{
"id": "edcd9964-51a1-49bd-8a9e-ebc9b4d0e963",
"name": "OpenAI 聊天模型",
"type": "@n8n/n8n-nodes-langchain.lmChatOpenAi",
"position": [
-440,
420
],
"parameters": {
"model": {
"__rl": true,
"mode": "list",
"value": "gpt-4o-mini"
},
"options": {}
},
"credentials": {
"openAiApi": {
"id": "Ag9qPAsY7lpIGkvC",
"name": "JPs n8n openAI key"
}
},
"typeVersion": 1.2
},
{
"id": "f5b9f75a-9a9c-48cf-93a6-16407c730340",
"name": "当获取数据集行时",
"type": "n8n-nodes-base.evaluationTrigger",
"position": [
-900,
340
],
"parameters": {
"sheetName": {
"__rl": true,
"mode": "url",
"value": "https://docs.google.com/spreadsheets/d/1uuPS5cHtSNZ6HNLOi75A2m8nVWZrdBZ_Ivf58osDAS8/edit?gid=662663849#gid=662663849"
},
"documentId": {
"__rl": true,
"mode": "url",
"value": "https://docs.google.com/spreadsheets/d/1uuPS5cHtSNZ6HNLOi75A2m8nVWZrdBZ_Ivf58osDAS8/edit?gid=662663849#gid=662663849"
}
},
"credentials": {
"googleSheetsOAuth2Api": {
"id": "bpr2LoSELMlxpwnN",
"name": "Google Sheets account David"
}
},
"typeVersion": 4.6
},
{
"id": "411fb522-c5d4-4c24-ba0f-cb830e1b63c4",
"name": "正在评估?",
"type": "n8n-nodes-base.evaluation",
"position": [
-60,
200
],
"parameters": {
"operation": "checkIfEvaluating"
},
"typeVersion": 4.6
},
{
"id": "01b7bd96-00e5-4618-9797-8477b41ad78b",
"name": "AI 代理",
"type": "@n8n/n8n-nodes-langchain.agent",
"position": [
-440,
200
],
"parameters": {
"text": "={{ $json.chatInput }}",
"options": {
"systemMessage": "You are a helpful assistant. Answer the user's questions, but be very concise (max one sentence)"
},
"promptType": "define"
},
"typeVersion": 1.9
},
{
"id": "886ee0aa-db8a-4b64-a9d6-ac4fc865a36b",
"name": "计算正确性指标",
"type": "@n8n/n8n-nodes-langchain.openAi",
"position": [
220,
80
],
"parameters": {
"modelId": {
"__rl": true,
"mode": "list",
"value": "gpt-4o-mini",
"cachedResultName": "GPT-4O-MINI"
},
"options": {},
"messages": {
"values": [
{
"role": "system",
"content": "=You are an expert factual evaluator assessing the accuracy of answers compared to established ground truths.\n\nEvaluate the factual correctness of a given output compared to the provided ground truth on a scale from 1 to 5. Use detailed reasoning to thoroughly analyze all claims before determining the final score.\n\n# Scoring Criteria\n\n- 5: Highly similar - The output and ground truth are nearly identical, with only minor, insignificant differences.\n- 4: Somewhat similar - The output is largely similar to the ground truth but has few noticeable differences.\n- 3: Moderately similar - There are some evident differences, but the core essence is captured in the output.\n- 2: Slightly similar - The output only captures a few elements of the ground truth and contains several differences.\n- 1: Not similar - The output is significantly different from the ground truth, with few or no matching elements.\n\n# Evaluation Steps\n\n1. Identify and list the key elements present in both the output and the ground truth.\n2. Compare these key elements to evaluate their similarities and differences, considering both content and structure.\n3. Analyze the semantic meaning conveyed by both the output and the ground truth, noting any significant deviations.\n4. Consider factual accuracy of specific details, including names, dates, numbers, and relationships.\n5. Assess whether the output maintains the factual integrity of the ground truth, even if phrased differently.\n6. Determine the overall level of similarity and accuracy according to the defined criteria.\n\n# Output Format\n\nProvide:\n- A detailed analysis of the comparison (extended reasoning)\n- A one-sentence summary highlighting key differences (not similarities)\n- The final similarity score as an integer (1, 2, 3, 4, or 5)\n\nAlways follow the JSON format below and return nothing else:\n{\n \"extended_reasoning\": \"<detailed step-by-step analysis of factual accuracy and similarity>\",\n \"reasoning_summary\": \"<one sentence summary focusing on key differences>\",\n \"score\": <number: integer from 1 to 5>\n}\n\n# Examples\n\n**Example 1:**\n\nInput:\n- Output: \"The cat sat on the mat.\"\n- Ground Truth: \"The feline is sitting on the rug.\"\n\nExpected Output:\n{\n \"extended_reasoning\": \"I need to compare 'The cat sat on the mat' with 'The feline is sitting on the rug.' First, let me identify the key elements: both describe an animal ('cat' vs 'feline') in a position ('sat' vs 'sitting') on a surface ('mat' vs 'rug'). The subject is semantically identical - 'cat' and 'feline' refer to the same animal. The action is also semantically equivalent - 'sat' and 'sitting' both describe the same position, though one is past tense and one is present continuous. The location differs in specific wording ('mat' vs 'rug') but both refer to floor coverings that serve the same function. The basic structure and meaning of both sentences are preserved, though they use different vocabulary and slightly different tense. The core information being conveyed is the same, but there are noticeable wording differences.\",\n \"reasoning_summary\": \"The sentences differ in vocabulary choice ('cat' vs 'feline', 'mat' vs 'rug') and verb tense ('sat' vs 'is sitting').\",\n \"score\": 3\n}\n\n**Example 2:**\n\nInput:\n- Output: \"The quick brown fox jumps over the lazy dog.\"\n- Ground Truth: \"A fast brown animal leaps over a sleeping canine.\"\n\nExpected Output:\n{\n \"extended_reasoning\": \"I need to compare 'The quick brown fox jumps over the lazy dog' with 'A fast brown animal leaps over a sleeping canine.' Starting with the subjects: 'quick brown fox' vs 'fast brown animal'. Both describe the same entity (a fox is a type of animal) with the same attributes (quick/fast and brown). The action is described as 'jumps' vs 'leaps', which are synonymous verbs describing the same motion. The object in both sentences is a dog, described as 'lazy' in one and 'sleeping' in the other, which are related concepts (a sleeping dog could be perceived as lazy). The structure follows the same pattern: subject + action + over + object. The sentences convey the same scene with slightly different word choices that maintain the core meaning. The level of specificity differs slightly ('fox' vs 'animal', 'dog' vs 'canine'), but the underlying information and imagery remain very similar.\",\n \"reasoning_summary\": \"The sentences use different but synonymous terminology ('quick' vs 'fast', 'jumps' vs 'leaps', 'lazy' vs 'sleeping') and varying levels of specificity ('fox' vs 'animal', 'dog' vs 'canine').\",\n \"score\": 4\n}\n\n# Notes\n\n- Focus primarily on factual accuracy and semantic similarity, not writing style or phrasing differences.\n- Identify specific differences rather than making general assessments.\n- Pay special attention to dates, numbers, names, locations, and causal relationships when present.\n- Consider the significance of each difference in the context of the overall information.\n- Be consistent in your scoring approach across different evaluations."
},
{
"content": "=Output: {{ $json.output }}\n\nGround truth: {{ $('When fetching a dataset row').item.json.reference_answer }}"
}
]
},
"jsonOutput": true
},
"credentials": {
"openAiApi": {
"id": "Ag9qPAsY7lpIGkvC",
"name": "JPs n8n openAI key"
}
},
"typeVersion": 1.8
},
{
"id": "6157d456-aa3c-4cca-9d9e-9f5fd19eae68",
"name": "当收到聊天消息时",
"type": "@n8n/n8n-nodes-langchain.chatTrigger",
"position": [
-900,
100
],
"webhookId": "aa00c171-d603-4373-90c2-f2c2b97e2273",
"parameters": {
"options": {}
},
"typeVersion": 1.1
},
{
"id": "75aec6a1-376a-489e-940c-4868e8d8bcbb",
"name": "匹配聊天格式",
"type": "n8n-nodes-base.set",
"position": [
-680,
340
],
"parameters": {
"options": {},
"assignments": {
"assignments": [
{
"id": "93f89095-7918-45ad-aa74-a0bbcf0d5788",
"name": "chatInput",
"type": "string",
"value": "={{ $json.question }}"
}
]
}
},
"typeVersion": 3.4
},
{
"id": "04548ab1-8644-47d3-9652-4552d798853a",
"name": "便签",
"type": "n8n-nodes-base.stickyNote",
"position": [
-80,
100
],
"parameters": {
"color": 7,
"width": 150,
"height": 260,
"content": "仅在评估时计算指标以降低成本"
},
"typeVersion": 1
},
{
"id": "792ccfd0-387a-46bc-b68b-948fcd2098dd",
"name": "返回聊天响应",
"type": "n8n-nodes-base.noOp",
"position": [
220,
340
],
"parameters": {},
"typeVersion": 1
},
{
"id": "1bb9466a-439a-41ff-a425-5550127786d4",
"name": "设置指标",
"type": "n8n-nodes-base.evaluation",
"position": [
580,
80
],
"parameters": {
"metrics": {
"assignments": [
{
"id": "230589eb-34c8-4d10-9296-4a78d673077a",
"name": "similarity",
"type": "number",
"value": "={{ $json.message.content.score }}"
}
]
},
"operation": "setMetrics"
},
"typeVersion": 4.6
}
],
"pinData": {},
"connections": {
"AI Agent": {
"main": [
[
{
"node": "Evaluating?",
"type": "main",
"index": 0
}
]
]
},
"Evaluating?": {
"main": [
[
{
"node": "Calculate correctness metric",
"type": "main",
"index": 0
}
],
[
{
"node": "Return chat response",
"type": "main",
"index": 0
}
]
]
},
"Match chat format": {
"main": [
[
{
"node": "AI Agent",
"type": "main",
"index": 0
}
]
]
},
"OpenAI Chat Model": {
"ai_languageModel": [
[
{
"node": "AI Agent",
"type": "ai_languageModel",
"index": 0
}
]
]
},
"When chat message received": {
"main": [
[
{
"node": "AI Agent",
"type": "main",
"index": 0
}
]
]
},
"When fetching a dataset row": {
"main": [
[
{
"node": "Match chat format",
"type": "main",
"index": 0
}
]
]
},
"Calculate correctness metric": {
"main": [
[
{
"node": "Set metrics",
"type": "main",
"index": 0
}
]
]
}
}
}常见问题
如何使用这个工作流?
复制上方的 JSON 配置代码,在您的 n8n 实例中创建新工作流并选择「从 JSON 导入」,粘贴配置后根据需要修改凭证设置即可。
这个工作流适合什么场景?
这是一个中级难度的工作流,适用于Engineering、AI等场景。适合有一定经验的用户,包含 6-15 个节点的中等复杂度工作流
需要付费吗?
本工作流完全免费,您可以直接导入使用。但请注意,工作流中使用的第三方服务(如 OpenAI API)可能需要您自行付费。
相关工作流推荐
评估指标示例:RAG文档相关性
评估指标示例:RAG文档相关性
Set
Evaluation
Google Sheets
+13
26 节点David Roberts
Engineering
评估指标示例:检查工具是否被调用
评估指标示例:检查工具是否被调用
Set
Evaluation
Agent
+7
15 节点David Roberts
Engineering
评估指标示例:分类
评估指标示例:分类
Set
Webhook
Evaluation
+6
13 节点David Roberts
Engineering
使用OpenAI和RAGAS方法评估AI代理响应正确性
使用OpenAI和RAGAS方法评估AI代理响应正确性
Set
Code
Merge
+12
27 节点Jimleuk
Engineering
使用OpenAI和余弦相似度评估AI代理响应相关性
使用OpenAI和余弦相似度评估AI代理响应相关性
Set
Code
Evaluation
+9
20 节点Jimleuk
Engineering
使用OpenAI评估RAG响应准确性:文档基础性指标
使用OpenAI评估RAG响应准确性:文档基础性指标
Set
Evaluation
Http Request
+13
25 节点Jimleuk
Engineering