Parser

Pre-trained Complete Model

This is the complete pretrained address parser model. This model allows using the pretrained weights to predict the tags of any address.

For now, we offer only two pretrained models, FastText and BPEmb. The first one relies on fastText French pretrained embeddings to parse the address, and the second use the byte-pair multilingual subword pretrained embeddings. In both cases, the architecture and performances are similar; our results are available in this article.

Memory Usage and Time Performance

To assess memory usage and inference time performance, we have conducted an experiment using Linux OS, Python 3.11, Torch 2.0 and CUDA 11.7 (done March 21, 2023). The next two tables report the results. In each table, we report the RAM usage, and in the first table, we also report the GPU memory usage. Also, for both tables, we report the mean-time of execution that was obtained by processing ~183,000 addresses using different batch sizes (2^0, …, 2^9) (i.e. \(\frac{\text{Total time to process all addresses}}{~183,000} =\) time per address). In addition, we proposed a lighter version ("fasttext-light") of our fastText model using Magnitude embeddings mapping. For this lighter model, on average, results are a little bit lower for the trained country (around ~2%) but are similar for the zero-shot country (see our article for more details).

With a GPU

Memory usage GPU (GB)

Memory usage RAM (GB)

Mean time of execution (batch of 1) (s)

Mean time of execution (batch of more than 1) (s)

fastText [1]

~1

~8

~0.0023

~0.0004

fastTextAttention

~1.1

~8

~0.0043

~0.0007

fastText-light

~1

~1

~0.0028

~0.0037

BPEmb

~1

~1

~0.0055

~0.0015

BPEmbAttention

~1.1

~1

~0.0081

~0.0019

Libpostal

N/A

N/A

N/A

~0.00004

With a CPU

Memory usage RAM (GB)

Mean time of execution (batch of 1) (s)

Mean time of execution (batch of more than 1) (s)

fastText [2]

~8

~0.0128

~0.0026

fastTextAttention

~8

~0.0230

~0.0057

fastText-light

~1

~0.0170

~0.0030

BPEmb

~1

~0.0179

~0.0044

BPEmbAttention

~1

~0.0286

~0.0075

Libpostal

N/A

N/A

~0.00004

Thus, the more addresses there are, the faster each address can be processed. You can also improve performance by using more workers for the data loader created with your data within the call. But note that this performance improvement is not linear. Furthermore, as of version 0.9.6, we now use Torch 2.0 and many other tricks to improve processing performance. Here are a few: if the parser uses a GPU, it will pin the memory in the Dataloader and reduce some operations (e.g. useless .to(device)).

AddressParser

class deepparse.parser.AddressParser(model_type: str = 'best', attention_mechanism: bool = False, device: int | str | device = 0, rounding: int = 4, verbose: bool = True, path_to_retrained_model: S3Path | str | None = None, cache_dir: str | None = None, offline: bool = False)[source]

Address parser to parse an address or a list of addresses using one of the seq2seq pretrained networks either with FastText or BPEmb. The default prediction tags are the following

  • "StreetNumber": for the street number,

  • "StreetName": for the name of the street,

  • "Unit": for the unit (such as an apartment),

  • "Municipality": for the municipality,

  • "Province": for the province or local region,

  • "PostalCode": for the postal code,

  • "Orientation": for the street orientation (e.g. west, east),

  • "GeneralDelivery": for other delivery information,

  • "EOS": (End Of Sequence) since we use an EOS during training, sometimes the models return an EOS tag.

Parameters:
  • model_type (str) –

    The network name to use, can be either:

    • "fasttext" (need ~9 GO of RAM to be used),

    • "fasttext-light" (need ~2 GO of RAM to be used, but slower than fasttext version),

    • "bpemb" (need ~2 GO of RAM to be used),

    • "fastest" (quicker to process one address) (equivalent to "fasttext"),

    • "lightest" (the one using the less RAM and GPU usage) (equivalent to "fasttext-light"),

    • "best" (the best accuracy performance) (equivalent to "bpemb").

    The default value is "best" for the most accurate model. Ignored if path_to_model_weights is not None. To further improve performance, consider using the models (fasttext or BPEmb) with their counterparts using an attention mechanism with the attention_mechanism flag.

  • attention_mechanism (bool) – Whether to use the model with an attention mechanism. The model will use an attention mechanism that takes an extra 100 MB on GPU usage (see the documentation for more statistics). The default value is False.

  • device (Union[int, str, torch.torch.device]) –

    The device to use can be either:

    • a GPU index in int format (e.g. 0),

    • a complete device name in a string format (e.g. "cuda:0"),

    • a device object,

    • "cpu" for a CPU use.

    The default value is 0, witch is a GPU device with the index 0 if it exists. Otherwise, the value is CPU.

  • rounding (int) – The rounding to use when asking the probability of the tags. The default value is 4, namely four digits.

  • verbose (bool) – Turn on/off the verbosity of the model weights download and loading. The default value is True.

  • path_to_retrained_model (Union[S3Path, str, None]) – The path to the retrained model to use for prediction. We will infer the model_type of the retrained model. The default value is None, meaning we use our pretrained model. If the retrained model uses an attention mechanism, attention_mechanism needs to be set to True. The path_to_retrain_model can also be a S3-like (Azure, AWS, Google) bucket URI string path (e.g. "s3://path/to/aws/s3/bucket.ckpt"). Or it can be a S3Path S3-like URI using cloudpathlib to handle S3-like bucket. See cloudpathlib <https://cloudpathlib.drivendata.org/stable/> for detail on supported S3 buckets provider and URI condition. The default value is None.

  • cache_dir (Union[str, None]) – The path to the cached directory to use for downloading (and loading) the embeddings model and the model pretrained weights.

  • offline (bool) – Whether or not the model is an offline one, meaning you have already downloaded the pre-trained weights and embeddings weights in either the default Deepparse cache directory ("~./cache/deepparse") or the cache_dir directory. When offline, we will not verify if the model is the latest. You can use our download_models CLI function to download all the requirements for a model. The default value is False (not an offline parsing model).

Note

For both networks, we will download the pretrained weights and embeddings in the .cache directory for the root user. The pretrained weights take at most 44 MB. The FastText embeddings take 6.8 GO, the FastText-light ("fasttext-light") embeddings take 3.3 GO and bpemb take 116 MB (in ".cache/bpemb").

Also, one can download all the dependencies of our pretrained model using our CLI (e.g. download_model fasttext) before sending it to a node without access to Internet.

Here are the URLs to download our pretrained models directly

Note

Since Windows uses spawn instead of fork during multiprocess (for the data loading pre-processing num_worker > 0) we use the Gensim model, which takes more RAM (~10 GO) than the Fasttext one (~8 GO). It also takes a longer time to load. See here the issue.

Note

You may observe a 100% CPU load the first time you call the fasttext-light model. We hypotheses that this is due to the SQLite database behind pymagnitude. This approach creates a cache to speed up processing, and since the memory mapping is saved between the runs, and it’s more intensive the first time you call it and subsequent time this load doesn’t appear.

Examples

address_parser = AddressParser(device=0) # On GPU device 0
parse_address = address_parser("350 rue des Lilas Ouest Quebec city Quebec G1L 1B6")

address_parser = AddressParser(model_type="fasttext", device="cpu") # fasttext model on cpu
parse_address = address_parser("350 rue des Lilas Ouest Quebec city Quebec G1L 1B6")

Using a model with an attention mechanism

# FasTtext model with an attention mechanism
address_parser = AddressParser(model_type="fasttext", attention_mechanism=True)
parse_address = address_parser("350 rue des Lilas Ouest Quebec city Quebec G1L 1B6")

Using a retrained model

address_parser = AddressParser(model_type="fasttext",
                               path_to_model_weights="/path_to_a_retrain_fasttext_model.ckpt")
parse_address = address_parser("350 rue des Lilas Ouest Quebec city Quebec G1L 1B6")

Using a retrained model trained on different tags

# We don't give the model_type since it's ignored when using path_to_model_weights
address_parser = AddressParser(path_to_model_weights="/path_to_a_retrain_fasttext_model.ckpt")
parse_address = address_parser("350 rue des Lilas Ouest Quebec city Quebec G1L 1B6")

Using a retrained model with attention

address_parser = AddressParser(model_type="fasttext",
                               path_to_model_weights="/path_to_a_retrain_fasttext_attention_model.ckpt",
                               attention_mechanism=True)
parse_address = address_parser("350 rue des Lilas Ouest Quebec city Quebec G1L 1B6")

Using Deepparse as an offline service (assuming all dependencies have been downloaded in the default cache dir or a specified dir using the cache_dir parameter).

   address_parser = AddressParser(model_type="fasttext",
                                  offline=True)
   parse_address = address_parser("350 rue des Lilas Ouest Quebec city Quebec G1L 1B6")

Using a retrained model in an S3-like bucket.
   address_parser = AddressParser(model_type="fasttext",
                                  path_to_model_weights="s3://path/to/bucket.ckpt")
   parse_address = address_parser("350 rue des Lilas Ouest Quebec city Quebec G1L 1B6")

Using a retrained model in an S3-like bucket using CloudPathLib.
address_parser = AddressParser(model_type="fasttext",
                               path_to_model_weights=CloudPath("s3://path/to/bucket.ckpt"))
parse_address = address_parser("350 rue des Lilas Ouest Quebec city Quebec G1L 1B6")
__call__(addresses_to_parse: List[str] | str | DatasetContainer, with_prob: bool = False, batch_size: int = 32, num_workers: int = 0, with_hyphen_split: bool = False, pre_processors: None | List[Callable] = None) FormattedParsedAddress | List[FormattedParsedAddress][source]

Callable method to parse the components of an address or a list of address.

Parameters:
  • addresses_to_parse (Union[list[str], str, DatasetContainer]) –

    The addresses to be parsed, can be either a single address (when using str), a list of address or a DatasetContainer. We apply some validation tests before parsing to validate its content if the data to parse is a string or a list of strings. We apply the following basic criteria:

    • no addresses are None value,

    • no addresses are empty strings, and

    • no addresses are whitespace-only strings.

    The addresses are processed in batches when using a list of addresses, allowing a faster process. For example, using the FastText model, a single address takes around 0.0023 seconds to be parsed using a batch of 1 (1 element at the time is processed). This time can be reduced to 0.00035 seconds per address when using a batch of 128 (128 elements at the time are processed).

  • with_prob (bool) – If true, return the probability of all the tags with the specified rounding.

  • batch_size (int) – The batch size (by default, 32).

  • num_workers (int) – Number of workers for the data loader (default is 0, meaning the data will be loaded in the main process).

  • with_hyphen_split (bool) – Either or not, use the hyphen split whitespace replacing for countries that use the hyphen split between the unit and the street number (e.g. Canada). For example, '3-305' will be replaced as '3 305' for the parsing. Where '3' is the unit, and '305' is the street number. We use a regular expression to replace alphanumerical characters separated by a hyphen at the start of the string. We do so since some cities use hyphens in their names. The default is False. If True, it adds the hyphen_cleaning() pre-processor at the end of the pre-processor list to apply.

  • pre_processors (Union[None, List[Callable]]) – A list of functions (callable) to apply pre-processing on all the addresses to parse before parsing. See Pre-Processors for examples of pre-processors. Since models were trained on lowercase data, during the parsing, we always apply a lowercase pre-processor. If you pass a list of pre-processor, a lowercase pre-processor is added at the end of the pre-processor list to apply. By default, None, meaning we use the default setup, which is (in order) the coma removal pre-processor, lowercase, double whitespace cleaning and trailing whitespace removal.

Returns:

Either a FormattedParsedAddress or a list of FormattedParsedAddress when given more than one address.

Note

Since model was trained on lowercase data, during the parsing, we always apply a lowercase pre-processor.

Examples

address_parser = AddressParser(device=0)  # On GPU device 0
parse_address = address_parser("350 rue des Lilas Ouest Quebec city Quebec G1L 1B6")

# It also can be a list of addresses
parse_address = address_parser(["350 rue des Lilas Ouest Quebec city Quebec G1L 1B6",
                                "350 rue des Lilas Ouest Quebec city Quebec G1L 1B6"])

# It can also output the prob of the predictions
parse_address = address_parser("350 rue des Lilas Ouest Quebec city Quebec G1L 1B6",
                               with_prob=True)

# Print the parsed address
print(parsed_address)

Using a larger batch size

address_parser = AddressParser(device=0) # On GPU device 0
parse_address = address_parser(a_large_list_dataset, batch_size=1024)

# You can also use more worker
parse_address = address_parser(a_large_list_dataset, batch_size=1024, num_workers=2)

Or using one of our dataset containers

addresses_to_parse = CSVDatasetContainer("./a_path.csv", column_names=["address_column_name"],
                                         is_training_container=False)
address_parser(addresses_to_parse)

Using a user-define pre-processor

def strip_parenthesis(address):
    return address.strip("(").strip(")")

address_parser(addresses_to_parse, pre_processors=[strip_parenthesis])
# It will also use the default lower case pre-processor.
get_formatted_model_name() str[source]

Return the model type formatted name. For example, if the model type is "fasttext" the formatted name is "FastText".

retrain(train_dataset_container: DatasetContainer, val_dataset_container: DatasetContainer | None = None, train_ratio: float = 0.8, batch_size: int = 32, epochs: int = 5, num_workers: int = 1, learning_rate: float = 0.01, callbacks: List | None = None, seed: int = 42, logging_path: str = './checkpoints', disable_tensorboard: bool = True, prediction_tags: Dict | None = None, seq2seq_params: Dict | None = None, layers_to_freeze: str | None = None, name_of_the_retrain_parser: None | str = None, verbose: None | bool = None) List[Dict][source]

Method to retrain the address parser model using a dataset with the same tags. We train using experiment from poutyne framework. The experiment module allows us to save checkpoints (ckpt, in a pickle format) and a log.tsv where the best epochs can be found (the best epoch is used for the test). The retrained model file name are formatted as retrained_{model_type}_address_parser.ckpt. For example, if you retrain a FastText model, the file name will be retrained_fasttext_address_parser.ckpt. The retrained saved model included, in a dictionary format, the model weights, the model type, if new prediction_tags were used, the new prediction tags, and if new seq2seq_params were used, the new seq2seq parameters.

Parameters:
  • train_dataset_container (DatasetContainer) –

    The train dataset container of the training data to use, such as any PyTorch Dataset (Dataset) user-define class or one of our DatasetContainer (PickleDatasetContainer, CSVDatasetContainer or ListDatasetContainer). The training dataset is use in two ways:

    1. As-is if a validating dataset is provided (val_dataset_container).

    2. Split in a training and validation dataset if val_dataset_container is set to None.

    Thus, it means that if val_dataset_container is set to the None default settings, we use the train_ratio argument to split the training dataset into a train and val dataset. See examples for more details.

  • val_dataset_container (Union[DatasetContainer, None]) – The validation dataset container to use for validating the model (by default, None).

  • train_ratio (float) – The ratio to use of the train_dataset_container for the training procedure. The rest of the data is used for the validation (e.g. a training ratio of 0.8 mean an 80-20 train-valid split) (by default, 0.8). The argument is ignored if val_dataset_container is not None.

  • batch_size (int) – The size of the batch (by default, 32).

  • epochs (int) – The number of training epochs (by default, 5).

  • num_workers (int) – The number of workers to use for the data loader (by default, 1 worker).

  • learning_rate (float) – The learning rate (LR) to use for training (default 0.01).

  • callbacks (Union[list, None]) – List of callbacks to use during training. See Poutyne callback for more information. By default, we set no callback.

  • seed (int) – The seed to use (default 42).

  • logging_path (str) – The logging path for the checkpoints. Poutyne will use the best one and reload the state if any checkpoints are there. Thus, an error will be raised if you change the model type. For example, you retrain a FastText model and then retrain a BPEmb in the same logging path directory. The logging_path can also be a S3-like (Azure, AWS, Google) bucket URI string path (e.g. "s3://path/to/aws/s3/bucket.ckpt"). Or it can be a S3Path S3-like URI using cloudpathlib to handle S3-like bucket. See cloudpathlib <https://cloudpathlib.drivendata.org/stable/> for detail on supported S3 buckets provider and URI condition. If the logging_path is a S3 bucket, we will only save the best checkpoint to the S3 Bucket at the end of training. By default, the path is ./checkpoints.

  • disable_tensorboard (bool) – To disable Poutyne automatic Tensorboard monitoring. By default, we disable them (true).

  • prediction_tags (Union[dict, None]) – A dictionary where the keys are the address components (e.g. street name) and the values are the components indices (from 0 to N + 1) to use during the retraining of a model. The + 1 corresponds to the End Of Sequence (EOS) token that needs to be included in the dictionary. We will use this dictionary’s length for the prediction layer’s output size. We also save the dictionary to be used later on when you load the model. The default value is None, meaning we use our pretrained model prediction tags.

  • seq2seq_params (Union[dict, None]) –

    A dictionary of seq2seq parameters to modify the seq2seq architecture to train. Note that if you change the seq2seq parameters, a new model will be trained from scratch. Parameters that can be modified are:

    • The input_size of the encoder (i.e. the size of the embedding). The default value is 300.

    • The size of the encoder_hidden_size of the encoder. The default value is 1024.

    • The number of encoder_num_layers of the encoder. The default value is 1.

    • The size of the decoder_hidden_size of the decoder. The default value is 1024.

    • The number of decoder_num_layers of the decoder. The default value is 1.

    The default value is None, meaning we use the default seq2seq architecture.

  • layers_to_freeze (Union[str, None]) –

    Name of the portion of the seq2seq to freeze layers. Thus, it reduces the number of parameters to learn. It Will be ignored if seq2seq_params is not None. A seq2seq is composed of three-part, an encoder, decoder, and prediction layer. The encoder is the part that encodes the address into a more dense representation. The decoder is the part that decodes a dense address representation. Finally, the prediction layer is a fully-connected with an output size of the same length as the prediction tags. Available freezing settings are:

    • None: No layers are frozen.

    • "encoder": To freeze the encoder part of the seq2seq.

    • "decoder": To freeze the decoder part of the seq2seq.

    • "prediction_layer": To freeze the last layer that predicts a tag class .

    • "seq2seq": To freeze the encoder and decoder but not the prediction layer.

    The default value is None, meaning we do not freeze any layers.

  • name_of_the_retrain_parser (Union[str, None]) –

    Name to give to the retrained parser that will be used when reloaded as the printed name, and to the saving file name (note that we will manually add the extension ".ckpt" to the name for the file name). By default, None.

    Default settings for the parser name will use the training settings for the name using the following pattern:

    • the pretrained architecture ('fasttext' or 'bpemb' and if an attention mechanism is use),

    • if prediction_tags is not None, the following tag: ModifiedPredictionTags,

    • if seq2seq_params is not None, the following tag: ModifiedSeq2SeqConfiguration, and

    • if layers_to_freeze is not None, the following tag: FreezedLayer{portion}.

  • verbose (Union[None, bool]) – To override the AddressParser verbosity for the test. When set to True or False, it will override (but it does not change the AddressParser verbosity) the test verbosity. If set to the default value None, the AddressParser verbosity is used as the test verbosity.

Returns:

A list of dictionaries with the best epoch stats (see Experiment class for details). The pretrained is saved using a default file name of using the name_of_the_retrain_parser. See the last note for more details.

Note

We recommend using a learning rate scheduler procedure during retraining to reduce the chance of losing too much of our learned weights, thus increasing retraining time. We personally use the following poutyne.StepLR(step_size=1, gamma=0.1). Also, starting learning rate should be relatively low (i.e. 0.01 or lower).

Note

We use SGD optimizer, NLL loss and accuracy as a metric, the data is shuffled, and we use teacher forcing during training (with a prob of 0.5) as in the article.

Note

Due to pymagnitude, we could not train using the Magnitude embeddings, meaning it’s not possible to train using the fasttext-light model. But, since we don’t update the embeddings weights, one can retrain using the fasttext model and later on use the weights with the fasttext-light.

Note

When retraining a model, Poutyne will create checkpoints. After the training, we use the best checkpoint in a directory as the model to load. Thus, if you train two different models in the same directory, the second retrain will not work due to model differences.

Note

The default settings for the file name to save the retrained model use the following pattern “retrained_{model_type}_address_parser.ckpt” if name_of_the_retrain_parser is set to None. Otherwise, the file name to save the retrained model will correspond to name_of_the_retrain_parser plus the file extension ".ckpt".

Examples

address_parser = AddressParser(device=0) # On GPU device 0
data_path = "path_to_a_pickle_dataset.p"

container = PickleDatasetContainer(data_path)

# The validation dataset is created from the training dataset (container)
# 80% of the data is use for training and 20% as a validation dataset
address_parser.retrain(container, train_ratio=0.8, epochs=1, batch_size=128)

Using the freezing layer’s parameters to freeze layers during training

address_parser = AddressParser(device=0)

data_path = "path_to_a_csv_dataset.p"
container = CSVDatasetContainer(data_path)

val_data_path = "path_to_a_csv_val_dataset.p"
val_container = CSVDatasetContainer(val_data_path)

# We provide the training dataset (container) and the val dataset (val_container)
# Thus, the train_ratio argument is ignored, and we use the val_container instead
# as the validating dataset.
address_parser.retrain(container, val_container, epochs=5, batch_size=128,
                       layers_to_freeze="encoder")

Using learning rate scheduler callback.

import poutyne

address_parser = AddressParser(device=0)
data_path = "path_to_a_csv_dataset.p"

container = CSVDatasetContainer(data_path)

lr_scheduler = poutyne.StepLR(step_size=1, gamma=0.1) # reduce LR by a factor of 10 each epoch
address_parser.retrain(container, train_ratio=0.8, epochs=5, batch_size=128, callbacks=[lr_scheduler])

Using your own prediction tags dictionary.

address_components = {"ATag":0, "AnotherTag": 1, "EOS": 2}

address_parser = AddressParser(device=0) # On GPU device 0
data_path = "path_to_a_pickle_dataset.p"

container = PickleDatasetContainer(data_path)

address_parser.retrain(container, train_ratio=0.8, epochs=1, batch_size=128,
                       prediction_tags=address_components)

Using your own seq2seq parameters.

seq2seq_params = {"encoder_hidden_size": 512, "decoder_hidden_size": 512}

address_parser = AddressParser(device=0) # On GPU device 0
data_path = "path_to_a_pickle_dataset.p"

container = PickleDatasetContainer(data_path)

address_parser.retrain(container, train_ratio=0.8, epochs=1, batch_size=128,
                       seq2seq_params=seq2seq_params)

Using your own seq2seq parameters and prediction tags dictionary.

seq2seq_params = {"encoder_hidden_size": 512, "decoder_hidden_size": 512}
address_components = {"ATag":0, "AnotherTag": 1, "EOS": 2}

address_parser = AddressParser(device=0) # On GPU device 0
data_path = "path_to_a_pickle_dataset.p"

container = PickleDatasetContainer(data_path)

address_parser.retrain(container, train_ratio=0.8, epochs=1, batch_size=128,
                       seq2seq_params=seq2seq_params, prediction_tags=address_components)

Using a named retrain parser name.

address_parser = AddressParser(device=0) # On GPU device 0
data_path = "path_to_a_pickle_dataset.p"

container = PickleDatasetContainer(data_path)

address_parser.retrain(container, train_ratio=0.8, epochs=1, batch_size=128,
    name_of_the_retrain_parser="MyParserName")
test(test_dataset_container: DatasetContainer, batch_size: int = 32, num_workers: int = 1, callbacks: List | None = None, seed: int = 42, verbose: None | bool = None) Dict[source]

Method to test a retrained or a pretrained model using a dataset with the default tags. If you test a retrained model with different prediction tags, we will use those tags.

Parameters:
  • test_dataset_container (DatasetContainer) – The test dataset container of the data to use.

  • batch_size (int) – The batch size (by default, 32).

  • num_workers (int) – Number of workers to use for the data loader (by default, 1 worker).

  • callbacks (Union[list, None]) –

    List of callbacks to use during training. See Poutyne callback for more information. By default, we set no callback.

  • seed (int) – Seed to use (by default, 42).

  • verbose (Union[None, bool]) – To override the AddressParser verbosity for the test. When set to True or False, it will override (but it does not change the AddressParser verbosity) the test verbosity. If set to the default value None, the AddressParser verbosity is used as the test verbosity.

Returns:

A dictionary with the stats (see Experiment class for details).

Note

We use NLL loss and accuracy as in the article.

Examples

address_parser = AddressParser(device=0, verbose=True) # On GPU device 0
data_path = "path_to_a_pickle_test_dataset.p"

test_container = PickleDatasetContainer(data_path, is_training_container=False)

# We test the model on the data, and we override the test verbosity
address_parser.test(test_container, verbose=False)

You can also test your fine-tuned model

address_components = {"ATag":0, "AnotherTag": 1, "EOS": 2}

address_parser = AddressParser(device=0) # On GPU device 0

# Train phase
data_path = "path_to_a_pickle_train_dataset.p"

train_container = PickleDatasetContainer(data_path)

address_parser.retrain(container, train_ratio=0.8, epochs=1, batch_size=128,
                       prediction_tags=address_components)

# Test phase
data_path = "path_to_a_pickle_test_dataset.p"

test_container = PickleDatasetContainer(data_path, is_training_container=False)

address_parser.test(test_container) # Test the retrained model
save_model_weights(file_path: str | Path) None[source]

Method to save, in a Pickle format, the address parser model weights (PyTorch state dictionary).

file_path (Union[str, Path]): A complete file path with a pickle extension to save the model weights.

It can either be a string (e.g. ‘path/to/save.p’) or a path-like path (e.g. Path(‘path/to/save.p’).

Examples

address_parser = AddressParser(device=0)

a_path = Path('some/path/to/save.p')
address_parser.save_address_parser_weights(a_path)
address_parser = AddressParser(device=0)

a_path = 'some/path/to/save.p'
address_parser.save_address_parser_weights(a_path)

Formatted Parsed Address

class deepparse.parser.FormattedParsedAddress(address: Dict)[source]

A parsed address as commonly known returned by an address parser.

Parameters:

address (dict) – A dictionary where the key is an address, and the value is a list of tuples where the first elements are address components, and the second elements are the parsed address value. Also, the second tuple’s address value can either be the tag of the components (e.g. StreetName) or a tuple (x, y) where x is the tag and y is the probability (e.g. 0.9981) of the model prediction.

raw_address

The raw address (not parsed).

address_parsed_components

The parsed address in a list of tuples where the first elements are the address components and the second elements are the tags.

<Address tag>

All the possible address tag element of the model. For example, StreetName or StreetNumber.

Example

address_parser = AddressParser()
parse_address = address_parser("350 rue des Lilas Ouest Quebec city Quebec G1L 1B6")
print(parse_address.StreetNumber) # 350
print(parse_address.PostalCode) # G1L 1B6

# Print the parsed address
print(parsed_address)

Note

Since an address component can be composed of multiple elements (e.g. Wolfe street), when the probability values are asked of the address parser, the address components don’t keep it. It’s only available through the address_parsed_components attribute.

format_address(fields: List | None = None, capitalize_fields: List[str] | None = None, upper_case_fields: List[str] | None = None, field_separator: str | None = None) str[source]

Method to format the address components in a specific order. We also filter the empty components (None). By default, the order is 'StreetNumber, Unit, StreetName, Orientation, Municipality, Province, PostalCode, GeneralDelivery' and we filter the empty components.

Parameters:
  • fields (Union[list, None]) – Optional argument to define the fields to order the address components of the address. If None, we will use the inferred order based on the address tags’ appearance. For example, if the parsed address is (305, StreetNumber), (rue, StreetName), (des, StreetName), (Lilas, StreetName), the inferred order will be StreetNumber, StreetName.

  • capitalize_fields (Union[list, None]) – Optional argument to define the capitalized fields for the formatted address. If None, no fields are capitalized.

  • upper_case_fields (Union[list, None]) – Optional argument to define the upper-cased fields for the formatted address. If None, no fields are capitalized.

  • field_separator (Union[list, None]) – Optional argument to define the field separator between address components. If None, the default field separator is " ".

Returns:

A string of the formatted address in the fields order.

Examples

address_parser = AddressParser()
parse_address = address_parser("350 rue des Lilas Ouest Quebec city Quebec G1L 1B6")

parse_address.formatted_address(fields_separator=", ")
# > 350, rue des lilas, ouest, quebec city, quebec, g1l 1b6

parse_address.formatted_address(fields_separator=", ", capitalize_fields=["StreetName", "Orientation"])
# > 350, rue des lilas, ouest, quebec city, quebec, g1l 1b6

parse_address.formatted_address(fields_separator=", ", upper_case_fields=["PostalCode""])
# > 350 rue des lilas ouest quebec city quebec G1L 1B6
to_dict(fields: List | None = None) dict[source]

Method to convert a parsed address into a dictionary where the keys are the address components, and the values are the value of those components. For example, the parsed address <StreetNumber> 305 <StreetName> rue des Lilas will be converted into the following dictionary: {'StreetNumber':'305', 'StreetName': 'rue des Lilas'}.

Parameters:

fields (Union[list, None]) – Optional argument to define the fields to extract from the address and the order of it. If None, will use the default order and value 'StreetNumber, Unit, StreetName, Orientation, Municipality, Province, PostalCode, GeneralDelivery'.

Returns:

A dictionary where the keys are the selected (or default) fields and the values are the corresponding value of the address components.

to_list_of_tuples(fields: List | None = None) List[tuple][source]

Method to convert a parsed address into a list of tuples where the first element of the tuples is the value of the components, and the second value is the name of the components.

For example, the parsed address <StreetNumber> 305 <StreetName> rue des Lilas will be converted into the following list of tuples: ('305', 'StreetNumber'), ('rue des Lilas', 'StreetName')].

Parameters:

fields (Union[list, None]) – Optional argument to define the fields to extract from the address and its order. If None, it will use the default order and value 'StreetNumber, Unit, StreetName, Orientation, Municipality, Province, PostalCode, GeneralDelivery'.

Returns:

A list of tuples where the first element of the tuples are the value of the address components and the second values are the name of the address components.

to_pandas() Dict[source]

Method to convert a parsed address into a dictionary for pandas where the first key is the raw address and the following keys are the address components, and the values are the values of those components. For example, the parsed address <StreetNumber> 305 <StreetName> rue des Lilas will be converted into the following dictionary: {'Address': '305 rue des Lilas', 'StreetNumber':'305', 'StreetName': 'rue des Lilas'}.

Returns:

A dictionary of the raw address and all is parsed components.

to_pickle() Tuple[str, List][source]

Method to convert a parsed address into a list of tuple for pickle where the first tuple element is the raw address and the following tuples are the address components, and the values are the values of those components. For example, the parsed address <StreetNumber> 305 <StreetName> rue des Lilas will be converted into the following list of tuples: '305 rue des Lilas', ('305', 'StreetNumber'), ('rue des Lilas', 'StreetName')].

Returns:

A tuple where the first element is the raw address (a string) and the second element is a list of tuple of the parsed addresses. The first element of each tuple is the address components and the second is the tag.