WeijianQi1999 commited on
Commit
6820b1a
·
1 Parent(s): 6eb0a96

update content.py

Browse files
Files changed (1) hide show
  1. content.py +4 -0
content.py CHANGED
@@ -19,11 +19,15 @@ Our goal is to conduct a rigorous assessment of the current state of web agents.
19
 
20
  When using our benchmark or submitting results, please first carefully review the important notes to ensure proper usage and obtain reliable evaluation results and follow the "Submission Guideline".
21
 
 
 
22
  ### ⚠ Important Notes for Reliable Evaluation:
23
  - **Start from the specified websites, not Google Search**: To enable fair comparisons, please ensure that each task starts from the specified website in our benchmark. Starting from Google Search or alternative websites can lead agents to use different websites to solve the task, resulting in varying difficulty levels and potentially skewed evaluation results.
24
  - **Include only factual actions, not agent outputs**: The action history should contain only the factual actions taken by the agent to complete the task (e.g., Clicking elements and Typing text). Do not include the final response or any other agent's outputs, as they may contain hallucinated content and result in a high rate of false positives.
25
  - **Use o4-mini for WebJudge**: WebJudge powered by o4-mini demonstrates a higher alignment with human judgment, achieving an average agreement rate of 85.7% and maintaining a narrow success rate gap of just 3.8%. Therefore, please use o4-mini as the backbone for automatic evaluation.
26
 
 
 
27
  **Please do not use it as training data for your agent.**
28
  """
29
 
 
19
 
20
  When using our benchmark or submitting results, please first carefully review the important notes to ensure proper usage and obtain reliable evaluation results and follow the "Submission Guideline".
21
 
22
+ **We usually need about one week to review the results. If your results require urgent verification, please let us know in advance. Thank you for your understanding.**
23
+
24
  ### ⚠ Important Notes for Reliable Evaluation:
25
  - **Start from the specified websites, not Google Search**: To enable fair comparisons, please ensure that each task starts from the specified website in our benchmark. Starting from Google Search or alternative websites can lead agents to use different websites to solve the task, resulting in varying difficulty levels and potentially skewed evaluation results.
26
  - **Include only factual actions, not agent outputs**: The action history should contain only the factual actions taken by the agent to complete the task (e.g., Clicking elements and Typing text). Do not include the final response or any other agent's outputs, as they may contain hallucinated content and result in a high rate of false positives.
27
  - **Use o4-mini for WebJudge**: WebJudge powered by o4-mini demonstrates a higher alignment with human judgment, achieving an average agreement rate of 85.7% and maintaining a narrow success rate gap of just 3.8%. Therefore, please use o4-mini as the backbone for automatic evaluation.
28
 
29
+ To obtain more reliable automatic evaluation results, the action representation should be as detailed as possible, including only factual actions and excluding any agent outputs. Here is an example [script](https://github.com/OSU-NLP-Group/Online-Mind2Web/blob/main/src/clean_html.py) to process the element's HTML as the action representation. It can preserve valuable information while filtering out irrelevant attributes.
30
+
31
  **Please do not use it as training data for your agent.**
32
  """
33