Deepseek prover v2
DeepSeek-Prover-V2 Reward Functions
This document describes the reward functions for evaluating Lean theorem proofs based on the DeepSeek-Prover-V2 research paper from DeepSeek.
Overview
DeepSeek-Prover-V2 is an advanced large language model specifically designed for formal theorem proving in Lean 4. It incorporates both informal mathematical reasoning and formal theorem proving capabilities, with a particular focus on subgoal decomposition for tackling complex mathematical proofs.
The reward functions in this module evaluate how well model responses can formulate valid Lean proofs, with special attention to the recursive subgoal decomposition technique central to DeepSeek-Prover-V2’s approach.
Reward Functions
Three reward functions are provided:
lean_prover_reward
: A basic reward function for evaluating Lean proofsdeepseek_prover_v2_reward
: An enhanced reward function that also evaluates quality of subgoal decompositiondeepseek_huggingface_prover_benchmark
: A reward function for evaluating against the DeepSeek-ProverBench dataset from Hugging Face
Installation
Basic functionality requires only the standard reward-kit installation:
For using the HuggingFace dataset integration with deepseek_huggingface_prover_benchmark
, install the dependencies:
Usage
Basic Lean Prover Reward
DeepSeek Prover V2 Reward
HuggingFace ProverBench Evaluation
Scoring Methodology
The scoring uses the following criteria:
Basic Lean Proofs (lean_prover_reward)
- 0.0: No valid Lean proof attempt
- 0.1-0.4: Has definition but incomplete proof (with “sorry” or “admitted”)
- Score scales based on number of tactics used
- 0.5-0.9: Complete proof with varying complexity
- Base score of 0.5 for complete proofs
- Up to 0.4 additional points based on tactics complexity
- 1.0: Perfect match with expected answer if provided
DeepSeek-Prover-V2 Specific (deepseek_prover_v2_reward)
Builds on the basic scoring and adds:
- Subgoal Quality: Up to +0.3 based on effective use of subgoals
- Hierarchical Structure: Up to +0.2 based on proof’s hierarchical depth
Example Output
Verbose output with verbose=True
includes detailed breakdown of the evaluation:
The metrics include:
syntax
: Score for valid theorem definitioncompleteness
: Whether the proof is complete (not using “sorry” or “admitted”)tactics
: Evaluation of the tactics usedsubgoal_decomposition
: Quality of subgoal usage (DeepSeek-specific)hierarchical_structure
: Quality of hierarchical organization (DeepSeek-specific)
Using with the HuggingFace Dataset
For more extensive evaluation with a full dataset: