Datasets¶
Dataset types¶
There are usually two types of dataset format for dense retrieval training based on whether the relevancy of document is human judged or by answer exactly matching.
1. Relevancy Judged Dataset¶
If the relevancy of a passage is annotated, (e.g. MS MARCO passage ranking), an instance in the dataset can usually be organized in following format:
{
"query_id": "<query id>",
"query": "<query text>",
"positive_passages": [
{"docid": "<passage id>", "title": "<passage title>", "text": "<passage body>"}
],
"negative_passages": [
{"docid": "<passage id>", "title": "<passage title>", "text": "<passage body>"}
]
}
positive_passages
are the annotated relevant passages of the query
and passages in negative_passages
are usually non-relevant passages from top results of a retrieval system (e.g. BM25).
2.Exactly Matched Dataset¶
If the relevancy of a passage is judged by answer exactly matching, (e.g. Natural Question), an instance in the dataset can usually be organized in following format:
{
"query_id": "<query id>",
"query": "<query text>",
"answers": ["<answer>"],
"positive_passages": [
{"docid": "<passage id>", "title": "<passage title>", "text": "<passage body>"}
],
"negative_passages": [
{"docid": "<passage id>", "title": "<passage title>", "text": "<passage body>"}
]
}
positive_passages
has subsequence that exactly matches one of the answer string in answers
.
And passages in negative_passages
are usually passages from top results of a retrieval system but doesn't have
subsequence exactly matches any of answer in answers
.
Self-Contained Dataset¶
Tevatron self-contained following common use datasets for dense retrieval.
(via HuggingFace).
These datasets will be downloaded and tokenized automatically during training and encoding
by setting --dataset_name <hgf dataset name>
.
dataset | dataset HuggingFace name | type |
---|---|---|
MS MARCO | Tevatron/msmarco-passage |
Relevancy Judged |
SciFact | Tevatron/scifact |
Relevancy Judged |
NQ | Tevatron/wikipedia-nq |
Exactly Match |
TriviaQA | Tevatron/wikipedia-trivia |
Exactly Match |
WebQuestions | Tevatron/wikipedia-wq |
Exactly Match |
CuratedTREC | Tevatron/wikipedia-curated |
Exactly Match |
SQuAD | Tevatron/wikipedia-squad |
Exactly Match |
Note: the self-contained datasets come with BM25 negative passages by default
Take SciFact as an example:
We can directly train with self-contained dataset by:
python -m tevatron.driver.train \
--do_train \
--output_dir model_scifact \
--dataset_name Tevatron/scifact \
--model_name_or_path bert-base-uncased \
--per_device_train_batch_size 16 \
--learning_rate 1e-5 \
--num_train_epochs 5
Then we can encode corresponding self-contained corpus by:
python tevatron.driver.encode \
--do_encode \
--output_dir=temp_out \
--model_name_or_path model_scifact \
--per_device_eval_batch_size 64 \
--dataset_name Tevatron/scifact-corpus \
--p_max_len 512 \
--encoded_save_path corpus_emb.pkl
And encode corresponding self-contained topics by:
python tevatron.driver.encode \
--do_encode \
--output_dir=temp_out \
--model_name_or_path model_scifact \
--per_device_eval_batch_size 64 \
--dataset_name Tevatron/scifact/dev \
--encode_is_qry \
--q_max_len 64 \
--encoded_save_path queries_emb.pkl
Custom dataset¶
To use custom dataset with Tevatron, there are two ways:
1. Raw data¶
The first method is to prepare dataset in the same format as one of the above two dataset types.
- If the dataset was prepared in the Relevancy Judged
format, then we can directly use the data load process
defined by Tevatron/msmarco-passage
.
- If the dataset was prepared in the Exactly Match
format, then we can directly use the data load process
defined by Tevatron/wikipedia-nq
.
For example, if we have prepared a dataset in Exactly Match format (same as Tevatron/wikipedia-nq
), with:
- train data: train_dir/train_data.jsonl
- dev data: dev_dir/dev_data.jsonl
- corpus: corpus_dir/corpus_jsonl
We can train by:
python -m tevatron.driver.train \
... \
--dataset_name Tevatron/wikipedia-nq \
--train_dir train_dir \
...
Then we can encode corpus by:
python tevatron.driver.encode \
... \
--dataset_name Tevatron/wikipedia-nq-corpus \
--encode_in_path corpus_dir/corpus_jsonl \
...
And encode query by:
python tevatron.driver.encode \
... \
--dataset_name Tevatron/wikipedia-nq \
--encode_in_path dev_dir/dev_data.jsonl \
--encode_is_qry \
...
Note: we use
...
here to hide the arguments that irrelevant to dataset setting for a more clear comperision. Please see training and encoding document for detailed arguments.
2. Pre-tokenized data¶
Tevatron also accept pre-tokenized custom dataset. By doing this, Tevatron will skip the tokenization step during training or encoding.
The datasets need to be crafted in the format below:
- Training: jsonl
file with each line is a training instance,
{'query': TEXT_TYPE, 'positives': List[TEXT_TYPE], 'negatives': List[TEXT_TYPE]}
jsonl
file with each line is a piece of text to be encoded,
{text_id: "xxx", 'text': TEXT_TYPE}
TEXT_TYPE
here can be either List[int]
(pre-tokenized) or string
(non-pretokenized).
Here we encourage user to use pre-tokenized (i.e. TEXT_TYPE=List[int]
)
as TEXT_TYPE=string
is not supported for some tokenizer.
To use custom data in pre-tokenized format, use --dataset_name json
(or leave it as empty)
during training and encoding.
For example, if we have prepared a pre-tokenized dataset, with:
- train data: train_dir/train_data.jsonl
- dev data: dev_dir/dev_data.jsonl
- corpus: corpus_dir/corpus_jsonl
We can train by:
python -m tevatron.driver.train \
... \
--train_dir train_dir \
...
Then we can encode corpus by:
python tevatron.driver.encode \
... \
--encode_in_path corpus_dir/corpus_jsonl \
...
And encode query by:
python tevatron.driver.encode \
... \
--encode_in_path dev_dir/dev_data.jsonl \
--encode_is_qry \
...
Note: we use
...
here to hide the arguments that irrelevant to dataset setting for a more clear comperision. Please see training and encoding document for detailed arguments.