`edsnlp.data.huggingface_dataset`

`from_huggingface_dataset`

Load a dataset from the HuggingFace Hub as a Stream.

Example

import edsnlp

nlp = edsnlp.blank("eds")
nlp.add_pipe(...)
doc_iterator = edsnlp.data.from_huggingface_dataset(
    "lhoestq/conll2003",
    split="train",
    tag_order=[
        'O',
        'B-PER',
        'I-PER',
        'B-ORG',
        'I-ORG',
        'B-LOC',
        'I-LOC',
        'B-MISC',
        'I-MISC',
    ],
    converter="hf_ner",
)
annotated_docs = nlp.pipe(doc_iterator)

Parameters

PARAMETER	DESCRIPTION
`dataset`	Either a dataset identifier (e.g. "conll2003") or an already loaded `datasets.Dataset` / `datasets.IterableDataset` object. TYPE: `Union[str, Any]`
`split`	Which split to load (e.g. "train"). If None, the default dataset split returned by `datasets.load_dataset` is used. TYPE: `Optional[str]` DEFAULT: `None`
`name`	Configuration name for datasets with multiple configs (e.g. "en" for a multilingual dataset). Also known as the subset name. TYPE: `Optional[str]` DEFAULT: `None`
`converter`	Converter(s) to transform dataset dicts to Doc objects. Recommended converters are `"hf_ner"` and `"hf_text"`. More information is available in the Converters page. TYPE: `Optional[Union[str, Callable]]` DEFAULT: `None`
`shuffle`	Whether to shuffle the dataset before yielding. If True or 'dataset', the whole dataset will be materialized and shuffled (may be expensive). TYPE: `Union[Literal['dataset'], bool]` DEFAULT: `False`
`seed`	Random seed for shuffling. TYPE: `Optional[int]` DEFAULT: `None`
`loop`	Whether to loop over the dataset indefinitely. TYPE: `bool` DEFAULT: `False`
`load_kwargs`	Dictionary of additional kwargs that will be passed to the `datasets.load_dataset()` method. TYPE: `Optional[Dict[str, Any]]` DEFAULT: `None`
`kwargs`	Additional keyword arguments passed to the converter, these are documented in the Converters page. DEFAULT: `{}`

RETURNS	DESCRIPTION
`Stream`

`to_huggingface_dataset`

Convert a collection/Stream of Doc objects (or already-converted dicts) into a datasets.IterableDataset.

Examples

1) Convert a Stream of HuggingFace NER examples into Doc objects (reader), process them and create an IterableDataset of dictionaries using the hf_ner writer converter:

import edsnlp

stream = edsnlp.data.from_huggingface_dataset(
    "lhoestq/conll2003",
    split="train",
    converter="hf_ner",
)

# Apply a pipeline or other processing
stream = stream.map_pipeline(nlp)

# Export as HF IterableDataset of dicts (no push)
hf_iter = edsnlp.data.to_huggingface_dataset(
    stream,
    converter="hf_ner",
)

2) Convert plain text Docs to HF text-format dicts:

edsnlp.data.to_huggingface_dataset(
    docs_stream,
    converter=("hf_text"),
    execute=True,
    # converter kwargs are validated and forwarded by
    # `get_doc2dict_converter` (e.g. `text_column`, `id_column`).
)

Parameters

PARAMETER	DESCRIPTION
`data`	Iterable of `Doc` objects or a `Stream`. If `converter` is provided the stream items are expected to be `Doc` objects. Otherwise items should already be mapping-like dicts. TYPE: `Union[Any, Stream]`
`converter`	Converter name or callable used to transform `Doc` -> dict before creating the dataset. Typical values: `"hf_ner_doc2dict"` or `"hf_text_doc2dict"`. Converter kwargs may be passed via `kwargs`. TYPE: `Optional[Union[str, Callable]]` DEFAULT:** `None`
`execute`	If False, return a transformed `Stream` (not executed). If True (default) produce and return a `datasets.IterableDataset`. TYPE: `bool` DEFAULT: `True`
`**kwargs`	Extra kwargs forwarded to the converter factory. DEFAULT: `{}`

RETURNS	DESCRIPTION
`Union[IterableDataset, Dataset]`	An `IterableDataset` containing the converted data.