Skip to content

edsnlp.data.huggingface_dataset

from_huggingface_dataset

Load a dataset from the HuggingFace Hub as a Stream.

Example

import edsnlp

nlp = edsnlp.blank("eds")
nlp.add_pipe(...)
doc_iterator = edsnlp.data.from_huggingface_dataset(
    "lhoestq/conll2003",
    split="train",
    tag_order=[
        'O',
        'B-PER',
        'I-PER',
        'B-ORG',
        'I-ORG',
        'B-LOC',
        'I-LOC',
        'B-MISC',
        'I-MISC',
    ],
    converter="hf_ner",
)
annotated_docs = nlp.pipe(doc_iterator)

Parameters

PARAMETER DESCRIPTION
dataset

Either a dataset identifier (e.g. "conll2003") or an already loaded datasets.Dataset / datasets.IterableDataset object.

TYPE: Union[str, Any]

split

Which split to load (e.g. "train"). If None, the default dataset split returned by datasets.load_dataset is used.

TYPE: Optional[str] DEFAULT: None

name

Configuration name for datasets with multiple configs (e.g. "en" for a multilingual dataset). Also known as the subset name.

TYPE: Optional[str] DEFAULT: None

converter

Converter(s) to transform dataset dicts to Doc objects. Recommended converters are "hf_ner" and "hf_text". More information is available in the Converters page.

TYPE: Optional[Union[str, Callable]] DEFAULT: None

shuffle

Whether to shuffle the dataset before yielding. If True or 'dataset', the whole dataset will be materialized and shuffled (may be expensive).

TYPE: Union[Literal['dataset'], bool] DEFAULT: False

seed

Random seed for shuffling.

TYPE: Optional[int] DEFAULT: None

loop

Whether to loop over the dataset indefinitely.

TYPE: bool DEFAULT: False

load_kwargs

Dictionary of additional kwargs that will be passed to the datasets.load_dataset() method.

TYPE: Optional[Dict[str, Any]] DEFAULT: None

kwargs

Additional keyword arguments passed to the converter, these are documented in the Converters page.

DEFAULT: {}

RETURNS DESCRIPTION
Stream

to_huggingface_dataset

Convert a collection/Stream of Doc objects (or already-converted dicts) into a datasets.IterableDataset.

Examples

1) Convert a Stream of HuggingFace NER examples into Doc objects (reader), process them and create an IterableDataset of dictionaries using the hf_ner writer converter:

import edsnlp

stream = edsnlp.data.from_huggingface_dataset(
    "lhoestq/conll2003",
    split="train",
    converter="hf_ner",
)

# Apply a pipeline or other processing
stream = stream.map_pipeline(nlp)

# Export as HF IterableDataset of dicts (no push)
hf_iter = edsnlp.data.to_huggingface_dataset(
    stream,
    converter="hf_ner",
)

2) Convert plain text Docs to HF text-format dicts:

edsnlp.data.to_huggingface_dataset(
    docs_stream,
    converter=("hf_text"),
    execute=True,
    # converter kwargs are validated and forwarded by
    # `get_doc2dict_converter` (e.g. `text_column`, `id_column`).
)

Parameters

PARAMETER DESCRIPTION
data

Iterable of Doc objects or a Stream. If converter is provided the stream items are expected to be Doc objects. Otherwise items should already be mapping-like dicts.

TYPE: Union[Any, Stream]

converter

Converter name or callable used to transform Doc -> dict before creating the dataset. Typical values: "hf_ner_doc2dict" or "hf_text_doc2dict". Converter kwargs may be passed via **kwargs.

TYPE: Optional[Union[str, Callable]] DEFAULT: None

execute

If False, return a transformed Stream (not executed). If True (default) produce and return a datasets.IterableDataset.

TYPE: bool DEFAULT: True

**kwargs

Extra kwargs forwarded to the converter factory.

DEFAULT: {}

RETURNS DESCRIPTION
Union[IterableDataset, Dataset]

An IterableDataset containing the converted data.