edsnlp.data.huggingface_dataset
from_huggingface_dataset
Load a dataset from the HuggingFace Hub as a Stream.
Example
import edsnlp
nlp = edsnlp.blank("eds")
nlp.add_pipe(...)
doc_iterator = edsnlp.data.from_huggingface_dataset(
"lhoestq/conll2003",
split="train",
tag_order=[
'O',
'B-PER',
'I-PER',
'B-ORG',
'I-ORG',
'B-LOC',
'I-LOC',
'B-MISC',
'I-MISC',
],
converter="hf_ner",
)
annotated_docs = nlp.pipe(doc_iterator)
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
dataset | Either a dataset identifier (e.g. "conll2003") or an already loaded TYPE: |
split | Which split to load (e.g. "train"). If None, the default dataset split returned by TYPE: |
name | Configuration name for datasets with multiple configs (e.g. "en" for a multilingual dataset). Also known as the subset name. TYPE: |
converter | Converter(s) to transform dataset dicts to Doc objects. Recommended converters are TYPE: |
shuffle | Whether to shuffle the dataset before yielding. If True or 'dataset', the whole dataset will be materialized and shuffled (may be expensive). TYPE: |
seed | Random seed for shuffling. TYPE: |
loop | Whether to loop over the dataset indefinitely. TYPE: |
load_kwargs | Dictionary of additional kwargs that will be passed to the TYPE: |
kwargs | Additional keyword arguments passed to the converter, these are documented in the Converters page. DEFAULT: |
| RETURNS | DESCRIPTION |
|---|---|
Stream | |
to_huggingface_dataset
Convert a collection/Stream of Doc objects (or already-converted dicts) into a datasets.IterableDataset.
Examples
1) Convert a Stream of HuggingFace NER examples into Doc objects (reader), process them and create an IterableDataset of dictionaries using the hf_ner writer converter:
import edsnlp
stream = edsnlp.data.from_huggingface_dataset(
"lhoestq/conll2003",
split="train",
converter="hf_ner",
)
# Apply a pipeline or other processing
stream = stream.map_pipeline(nlp)
# Export as HF IterableDataset of dicts (no push)
hf_iter = edsnlp.data.to_huggingface_dataset(
stream,
converter="hf_ner",
)
2) Convert plain text Docs to HF text-format dicts:
edsnlp.data.to_huggingface_dataset(
docs_stream,
converter=("hf_text"),
execute=True,
# converter kwargs are validated and forwarded by
# `get_doc2dict_converter` (e.g. `text_column`, `id_column`).
)
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
data | Iterable of TYPE: |
converter | Converter name or callable used to transform TYPE: |
execute | If False, return a transformed TYPE: |
**kwargs | Extra kwargs forwarded to the converter factory. DEFAULT: |
| RETURNS | DESCRIPTION |
|---|---|
Union[IterableDataset, Dataset] | An |