RAG System Evaluation on RGB Dataset
Welcome to the RAG System Evaluation on RGB Dataset! This tool is designed to evaluate and compare the performance of various Large Language Models (LLMs) using Retrieval-Augmented Generation (RAG) on the RGB dataset. The evaluation focuses on key metrics such as Noise Robustness, Negative Rejection, Counterfactual Robustness, and Information Integration. These metrics help assess how well different models handle noisy inputs, reject invalid queries, manage counterfactual scenarios, and integrate information effectively.
Key Features:
- Compare Multiple LLMs: Evaluate and compare the performance of different LLMs side by side.
- Pre-calculated Metrics: View results from pre-computed evaluations for quick insights.
- Recalculate Metrics: Option to recalculate metrics for custom configurations.
- Interactive Controls: Adjust model parameters, noise rates, and query counts to explore model behavior under different conditions.
- Detailed Reports: Visualize results in clear, interactive tables for each evaluation metric.
How to Use:
- Select a Model: Choose from the available LLMs to evaluate.
- Configure Model Settings: Adjust the noise rate and set the number of queries.
- Choose Evaluation Mode: Use pre-calculated values for quick results or recalculate metrics for custom analysis.
- Compare Results: Review and compare the evaluation metrics across different models in the tables below.
- Logs: View live logs to monitor what's happening behind the scenes in real-time.
If checked, the report(s) will use pre-calculated metrics from saved output files. If any report has N/A value, Click on respective report generation button to generate value based on configuration. Uncheck to recalculate the metrics again.
📊 Noise Robustness
Description: The experimental result of noise robustness measured by accuracy (%) under different noise ratios.
Model | 0.2 | 0.4 | 0.6 | 0.8 | 1.0 |
---|---|---|---|---|---|
deepseek-r1-distill-llama-70b | 100.00 | 100.00 | 100.00 | 94.00 | 12.00 |
🚫 Negative Rejection
Description: This measures the model's ability to reject invalid or nonsensical queries.
Model | Rejection Rate % |
---|---|
deepseek-r1-distill-llama-70b | 57.99999999999999 |
Model | Rejection Rate % |
---|---|
llama3-8b-8192 | 50.0 |
qwen-2.5-32b | 80.0 |
mixtral-8x7b-32768 | 57.99999999999999 |
gemma2-9b-it | 68.0 |
deepseek-r1-distill-llama-70b | 60.0 |
🔄 Counterfactual Robustness
Description: Evaluates a model's ability to handle errors in external knowledge.
Model | Accuracy (%) | Acc_doc (%) | Error Detection Rate (%) | Correction Rate (%) |
---|---|---|---|---|
deepseek-r1-distill-llama-70b | 100 | 14 | 34 | 29.41 |
Model | Accuracy (%) | Acc_doc (%) | Error Detection Rate (%) | Correction Rate (%) |
---|---|---|---|---|
llama3-8b-8192 | 96 | 14 | 34 | 29.41 |
qwen-2.5-32b | 100 | 44 | 68 | 61.76 |
mixtral-8x7b-32768 | 94 | 12 | 24 | 33.33 |
gemma2-9b-it | 94 | 22 | 34 | 52.94 |
deepseek-r1-distill-llama-70b | 100 | 72 | 50 | 84.00 |
🧠 Information Integration
Description: The experimental result of information integration measured by accuracy (%) under different noise ratios.
Model | 0.2 | 0.4 | 0.6 | 0.8 | 1.0 |
---|---|---|---|---|---|
deepseek-r1-distill-llama-70b | 68.00 | 78.00 | 60.00 | 50.00 | 36.00 |
Model | 0.2 | 0.4 | 0.6 | 0.8 | 1.0 |
---|---|---|---|---|---|
llama3-8b-8192 | 68.00 | 78.00 | 60.00 | 50.00 | 36.00 |
qwen-2.5-32b | 86.00 | 66.00 | 68.00 | 52.00 | 70.00 |
mixtral-8x7b-32768 | 66.00 | 54.00 | 44.00 | 42.00 | 58.00 |
gemma2-9b-it | 76.00 | 62.00 | 56.00 | 64.00 | 68.00 |
deepseek-r1-distill-llama-70b | 92.00 | 80.00 | 80.00 | 74.00 | 82.00 |