RAG System Evaluation on RGB Dataset

Welcome to the RAG System Evaluation on RGB Dataset! This tool is designed to evaluate and compare the performance of various Large Language Models (LLMs) using Retrieval-Augmented Generation (RAG) on the RGB dataset. The evaluation focuses on key metrics such as Noise Robustness, Negative Rejection, Counterfactual Robustness, and Information Integration. These metrics help assess how well different models handle noisy inputs, reject invalid queries, manage counterfactual scenarios, and integrate information effectively.

Key Features:

Compare Multiple LLMs: Evaluate and compare the performance of different LLMs side by side.
Pre-calculated Metrics: View results from pre-computed evaluations for quick insights.
Recalculate Metrics: Option to recalculate metrics for custom configurations.
Interactive Controls: Adjust model parameters, noise rates, and query counts to explore model behavior under different conditions.
Detailed Reports: Visualize results in clear, interactive tables for each evaluation metric.

How to Use:

Select a Model: Choose from the available LLMs to evaluate.
Configure Model Settings: Adjust the noise rate and set the number of queries.
Choose Evaluation Mode: Use pre-calculated values for quick results or recalculate metrics for custom analysis.
Compare Results: Review and compare the evaluation metrics across different models in the tables below.
Logs: View live logs to monitor what's happening behind the scenes in real-time.

📊 Noise Robustness

Description: The experimental result of noise robustness measured by accuracy (%) under different noise ratios.

Model	0.2	0.4	0.6	0.8	1.0
deepseek-r1-distill-llama-70b	100.00	100.00	100.00	94.00	12.00

Model	0.2	0.4	0.6	0.8	1.0
llama3-8b-8192	100.00	98.00	96.00	94.00	12.00
qwen-2.5-32b	98.00	100.00	96.00	90.00	10.00
mixtral-8x7b-32768	78.00	74.00	78.00	82.00	18.00
gemma2-9b-it	98.00	98.00	98.00	94.00	10.00
deepseek-r1-distill-llama-70b	92.00	90.00	100.00	88.00	22.00

🚫 Negative Rejection

Description: This measures the model's ability to reject invalid or nonsensical queries.

Model	Rejection Rate %
deepseek-r1-distill-llama-70b	57.99999999999999

Model	Rejection Rate %
llama3-8b-8192	50.0
qwen-2.5-32b	80.0
mixtral-8x7b-32768	57.99999999999999
gemma2-9b-it	68.0
deepseek-r1-distill-llama-70b	60.0

🔄 Counterfactual Robustness

Description: Evaluates a model's ability to handle errors in external knowledge.

Model	Accuracy (%)	Acc_doc (%)	Error Detection Rate (%)	Correction Rate (%)
deepseek-r1-distill-llama-70b	100	14	34	29.41

Model	Accuracy (%)	Acc_doc (%)	Error Detection Rate (%)	Correction Rate (%)
llama3-8b-8192	96	14	34	29.41
qwen-2.5-32b	100	44	68	61.76
mixtral-8x7b-32768	94	12	24	33.33
gemma2-9b-it	94	22	34	52.94
deepseek-r1-distill-llama-70b	100	72	50	84.00

🧠 Information Integration

Description: The experimental result of information integration measured by accuracy (%) under different noise ratios.

Model	0.2	0.4	0.6	0.8	1.0
deepseek-r1-distill-llama-70b	68.00	78.00	60.00	50.00	36.00