Dataset Container

Interface

class deepparse.dataset_container.DatasetContainer(is_training_container: bool = True, data_cleaning_pre_processing_fn: None | Callable = None)[source]

Interface for the dataset. This interface defines most of the methods that the dataset needs to define. If you define another dataset container, the init must define the attribute data.

We also recommend using the validate_dataset method in your init to validate some characteristics of your dataset.

For a training container, it validates the following:

  • no address is a None value,

  • no address is empty,

  • no address is composed of only whitespace,

  • no address includes consecutive whitespace (e.g. “An address”),

  • no tags list is empty, if data is a list of tuple ([('an address', ['a_tag', 'another_tag']), ...]), and

  • if the addresses (whitespace-split) are the same length as their respective tags list.

While for a predict container (unknown prediction tag), it validates the following:

  • no address is a None value,

  • no address is empty,

  • no address is composed of only whitespace, and

  • no address includes consecutive whitespace (e.g. “An address”).

Parameters:
  • is_training_container (bool) – Either or not, the dataset container is a training container. This will determine the dataset validation test we apply to the dataset. That is, a predict dataset doesn’t include tags. The default value is True.

  • data_cleaning_pre_processing_fn (Callable) – Function to apply as data clea ning pre-processing step after loading the data, but before applying the validation steps. The default value is None.

Implementations

class deepparse.dataset_container.PickleDatasetContainer(data_path: str, is_training_container: bool = True, data_cleaning_pre_processing_fn: None | Callable = None)[source]

Pickle dataset container that imports a list of addresses in pickle format and does some validation on it.

The dataset needs to be a list of tuples where the first element of each tuple is the address (a string), and the second is a list of the expected tag to predict (e.g. [('an address', ['a_tag', 'another_tag']), ...]). The length of the tags needs to be the same as the length of the address when the whitespace-split is used.

For a training container, the validation tests applied on the dataset are the following:

  • no address is a None value,

  • no address is empty,

  • no address is composed of only whitespace,

  • no address includes consecutive whitespace (e.g. “An address”),

  • no tags list is empty, if data is a list of tuple ([('an address', ['a_tag', 'another_tag']), ...]), and

  • if the addresses (whitespace-split) are the same length as their respective tags list.

While for a predict container (unknown prediction tag), the validation tests applied on the dataset are the following:

  • no address is a None value,

  • no address is empty,

  • no address is composed of only whitespace, and

  • no address includes consecutive whitespace (e.g. “An address”).

Parameters:
  • data_path (str) – The path to the pickle dataset file.

  • is_training_container (bool) – Either or not, the dataset container is a training container. This will determine the dataset validation test we apply to the dataset. That is, a predict dataset doesn’t include tags. The default value is True.

  • data_cleaning_pre_processing_fn (Callable) – Function to apply as data clea ning pre-processing step after loading the data, but before applying the validation steps. The default value is None.

class deepparse.dataset_container.CSVDatasetContainer(data_path: str, column_names: List | str, is_training_container: bool = True, separator: str = '\t', tag_seperator_reformat_fn: None | Callable = None, csv_reader_kwargs: None | Dict = None, data_cleaning_pre_processing_fn: None | Callable = None)[source]

CSV dataset container that imports a CSV of addresses. If the dataset is a predict one, it must have at least one column with some addresses. If the dataset is a training one (with prediction tags), it must have at least two columns, one with some addresses and another with a list of tags for each address.

After loading the CSV dataset, some tests will be applied depending on its type.

For a training container, the validation tests applied on the dataset are the following:

  • no address is a None value,

  • no address is empty,

  • no address is composed of only whitespace,

  • no address includes consecutive whitespace (e.g. “An address”),

  • no tags list is empty, if data is a list of tuple ([('an address', ['a_tag', 'another_tag']), ...]), and

  • if the addresses (whitespace-split) are the same length as their respective tags list.

While for a predict container (unknown prediction tag), the validation tests applied on the dataset are the following:

  • no address is a None value,

  • no address is empty,

  • no address is composed of only whitespace, and

  • no address includes consecutive whitespace (e.g. “An address”).

Parameters:
  • data_path (str) – The path to the CSV dataset file.

  • column_names (list) – A column name list to extract the dataset element. If the dataset container is a “predict” one, the list must be of exactly one element (i.e. the address column). On the other hand, if the dataset container is a “training” one, the list must be of exactly two elements: addresses and tags.

  • is_training_container (bool) – Either or not, the dataset container is a training container. This will determine the dataset validation test we apply to the dataset. That is, a predict dataset doesn’t include tags. The default value is True.

  • separator (str) – The CSV columns separator to use. By default, "\t".

  • tag_seperator_reformat_fn (Callable, optional) – A function to parse a tags string and return a list of address tags. For example, if the tag column is a former Python list saved with pandas, the characters ] , ] and ' will be included as the tags’ element. Thus, a parsing function will take a string as is parameter and output a python list. The default function processes it as a former Python list. That is, it removes the [], characters and splits the sequence at each comma (",").

  • csv_reader_kwargs (dict, optional) – Keyword arguments to pass to pandas read_csv use internally. By default, the data_path is passed along with our default sep value ( "\t") and the "utf-8" encoding format. However, this can be overridden by using this argument again.

  • data_cleaning_pre_processing_fn (Callable) – Function to apply as data clea ning pre-processing step after loading the data, but before applying the validation steps. The default value is None.

class deepparse.dataset_container.ListDatasetContainer(data: List, is_training_container: bool = True, data_cleaning_pre_processing_fn: None | Callable = None)[source]

List dataset container that loads a list dataset into a DatasetContainer class. It also validates the dataset.

Parameters:
  • data (list) – The dataset in a list format. The list format (if a train or test container) is identical as the PickleDatasetContainer.

  • is_training_container (bool) – Either or not, the dataset container is a training container. This will determine the dataset validation test we apply to the dataset. That is, a predict dataset doesn’t include tags. The default value is True.

  • data_cleaning_pre_processing_fn (Callable) – Function to apply as data clea ning pre-processing step after loading the data, but before applying the validation steps. The default value is None.

Dataset Validation Steps

We also applied data validations to all data containers using the following three functions.

deepparse.data_validation.data_validation.validate_if_any_empty(string_elements: List) bool[source]

Return True if one of the string elements is empty. For example, the second element in the following list is an empty address: ["An address", "", "Another address"]. Thus, it will return True.

Parameters:

string_elements (list) – A list of strings to validate.

deepparse.data_validation.data_validation.validate_if_any_whitespace_only(string_elements: List) bool[source]

Return True if one of the string elements is only whitespace. For example, the second element in the following list is only whitespace: ["An address", " ", "Another address"]. Thus, it will return True.

Parameters:

string_elements (list) – A list of strings to validate.

deepparse.data_validation.data_validation.validate_if_any_none(string_elements: List) bool[source]

Return True if one string element is a None value. For example, the second element in the following list is a None value: ["An address", None, "Another address"]. Thus, it will return True.

Parameters:

string_elements (list) – A list of strings to validate.