Parser
Pre-trained Complete Model
This is the complete pretrained address parser model. This model allows using the pretrained weights to predict the tags of any address.
For now, we offer only two pretrained models, FastText and BPEmb. The first one relies on fastText French pretrained embeddings to parse the address, and the second use the byte-pair multilingual subword pretrained embeddings. In both cases, the architecture and performances are similar; our results are available in this article.
Memory Usage and Time Performance
To assess memory usage and inference time performance, we have conducted an experiment using Linux OS, Python 3.11,
Torch 2.0 and CUDA 11.7 (done March 21, 2023). The next two tables report the results. In each table,
we report the RAM usage, and in the first table, we also report the GPU memory usage.
Also, for both tables, we report the mean-time of execution
that was obtained by processing ~183,000 addresses using different batch sizes (2^0, …, 2^9)
(i.e. \(\frac{\text{Total time to process all addresses}}{~183,000} =\) time per address).
In addition, we proposed a lighter version ("fasttext-light"
) of our fastText model using
Magnitude embeddings mapping. For this lighter model, on average, results
are a little bit lower for the trained country (around ~2%) but are similar for the zero-shot country
(see our article for more details).
With a GPU |
Memory usage GPU (GB) |
Memory usage RAM (GB) |
Mean time of execution (batch of 1) (s) |
Mean time of execution (batch of more than 1) (s) |
---|---|---|---|---|
fastText [1] |
~1 |
~8 |
~0.0023 |
~0.0004 |
fastTextAttention |
~1.1 |
~8 |
~0.0043 |
~0.0007 |
fastText-light |
~1 |
~1 |
~0.0028 |
~0.0037 |
BPEmb |
~1 |
~1 |
~0.0055 |
~0.0015 |
BPEmbAttention |
~1.1 |
~1 |
~0.0081 |
~0.0019 |
Libpostal |
N/A |
N/A |
N/A |
~0.00004 |
With a CPU |
Memory usage RAM (GB) |
Mean time of execution (batch of 1) (s) |
Mean time of execution (batch of more than 1) (s) |
---|---|---|---|
fastText [2] |
~8 |
~0.0128 |
~0.0026 |
fastTextAttention |
~8 |
~0.0230 |
~0.0057 |
fastText-light |
~1 |
~0.0170 |
~0.0030 |
BPEmb |
~1 |
~0.0179 |
~0.0044 |
BPEmbAttention |
~1 |
~0.0286 |
~0.0075 |
Libpostal |
N/A |
N/A |
~0.00004 |
Note that on Windows, we use the Gensim FastText models that use ~10 GO with similar performance.
Thus, the more addresses there are, the faster each address can be processed. You can also improve performance by using more
workers for the data loader created with your data within the call. But note that this performance improvement is not linear.
Furthermore, as of version 0.9.6
, we now use Torch 2.0 and many other tricks to improve
processing performance. Here are a few: if the parser uses a GPU, it will pin the memory in the Dataloader and reduce some
operations (e.g. useless .to(device)
).
AddressParser
- class deepparse.parser.AddressParser(model_type: str = 'best', attention_mechanism: bool = False, device: int | str | device = 0, rounding: int = 4, verbose: bool = True, path_to_retrained_model: S3Path | str | None = None, cache_dir: str | None = None, offline: bool = False)[source]
Address parser to parse an address or a list of addresses using one of the seq2seq pretrained networks either with FastText or BPEmb. The default prediction tags are the following
"StreetNumber"
: for the street number,"StreetName"
: for the name of the street,"Unit"
: for the unit (such as an apartment),"Municipality"
: for the municipality,"Province"
: for the province or local region,"PostalCode"
: for the postal code,"Orientation"
: for the street orientation (e.g. west, east),"GeneralDelivery"
: for other delivery information,"EOS"
: (End Of Sequence) since we use an EOS during training, sometimes the models return an EOS tag.
- Parameters:
model_type (str) –
The network name to use, can be either:
"fasttext"
(need ~9 GO of RAM to be used),"fasttext-light"
(need ~2 GO of RAM to be used, but slower than fasttext version),"bpemb"
(need ~2 GO of RAM to be used),"fastest"
(quicker to process one address) (equivalent to"fasttext"
),"lightest"
(the one using the less RAM and GPU usage) (equivalent to"fasttext-light"
),"best"
(the best accuracy performance) (equivalent to"bpemb"
).
The default value is
"best"
for the most accurate model. Ignored ifpath_to_model_weights
is notNone
. To further improve performance, consider using the models (fasttext or BPEmb) with their counterparts using an attention mechanism with theattention_mechanism
flag.attention_mechanism (bool) – Whether to use the model with an attention mechanism. The model will use an attention mechanism that takes an extra 100 MB on GPU usage (see the documentation for more statistics). The default value is
False
.device (Union[int, str, torch.torch.device]) –
The device to use can be either:
a
GPU
index in int format (e.g.0
),a complete device name in a string format (e.g.
"cuda:0"
),a
device
object,"cpu"
for aCPU
use.
The default value is
0
, witch is a GPU device with the index0
if it exists. Otherwise, the value isCPU
.rounding (int) – The rounding to use when asking the probability of the tags. The default value is
4
, namely four digits.verbose (bool) – Turn on/off the verbosity of the model weights download and loading. The default value is
True
.path_to_retrained_model (Union[S3Path, str, None]) – The path to the retrained model to use for prediction. We will infer the
model_type
of the retrained model. The default value isNone
, meaning we use our pretrained model. If the retrained model uses an attention mechanism,attention_mechanism
needs to be set to True. The path_to_retrain_model can also be a S3-like (Azure, AWS, Google) bucket URI string path (e.g."s3://path/to/aws/s3/bucket.ckpt"
). Or it can be aS3Path
S3-like URI using cloudpathlib to handle S3-like bucket. See cloudpathlib <https://cloudpathlib.drivendata.org/stable/> for detail on supported S3 buckets provider and URI condition. The default value isNone
.cache_dir (Union[str, None]) – The path to the cached directory to use for downloading (and loading) the embeddings model and the model pretrained weights.
offline (bool) – Whether or not the model is an offline one, meaning you have already downloaded the pre-trained weights and embeddings weights in either the default Deepparse cache directory (
"~./cache/deepparse"
) or thecache_dir
directory. When offline, we will not verify if the model is the latest. You can use ourdownload_models
CLI function to download all the requirements for a model. The default value isFalse
(not an offline parsing model).
Note
For both networks, we will download the pretrained weights and embeddings in the
.cache
directory for the root user. The pretrained weights take at most 44 MB. The FastText embeddings take 6.8 GO, the FastText-light ("fasttext-light"
) embeddings take 3.3 GO and bpemb take 116 MB (in".cache/bpemb"
).Also, one can download all the dependencies of our pretrained model using our CLI (e.g. download_model fasttext) before sending it to a node without access to Internet.
Here are the URLs to download our pretrained models directly
Note
Since Windows uses
spawn
instead offork
during multiprocess (for the data loading pre-processingnum_worker
> 0) we use the Gensim model, which takes more RAM (~10 GO) than the Fasttext one (~8 GO). It also takes a longer time to load. See here the issue.Note
You may observe a 100% CPU load the first time you call the fasttext-light model. We hypotheses that this is due to the SQLite database behind
pymagnitude
. This approach creates a cache to speed up processing, and since the memory mapping is saved between the runs, and it’s more intensive the first time you call it and subsequent time this load doesn’t appear.Examples
address_parser = AddressParser(device=0) # On GPU device 0 parse_address = address_parser("350 rue des Lilas Ouest Quebec city Quebec G1L 1B6") address_parser = AddressParser(model_type="fasttext", device="cpu") # fasttext model on cpu parse_address = address_parser("350 rue des Lilas Ouest Quebec city Quebec G1L 1B6")
Using a model with an attention mechanism
# FasTtext model with an attention mechanism address_parser = AddressParser(model_type="fasttext", attention_mechanism=True) parse_address = address_parser("350 rue des Lilas Ouest Quebec city Quebec G1L 1B6")
Using a retrained model
address_parser = AddressParser(model_type="fasttext", path_to_model_weights="/path_to_a_retrain_fasttext_model.ckpt") parse_address = address_parser("350 rue des Lilas Ouest Quebec city Quebec G1L 1B6")
Using a retrained model trained on different tags
# We don't give the model_type since it's ignored when using path_to_model_weights address_parser = AddressParser(path_to_model_weights="/path_to_a_retrain_fasttext_model.ckpt") parse_address = address_parser("350 rue des Lilas Ouest Quebec city Quebec G1L 1B6")
Using a retrained model with attention
address_parser = AddressParser(model_type="fasttext", path_to_model_weights="/path_to_a_retrain_fasttext_attention_model.ckpt", attention_mechanism=True) parse_address = address_parser("350 rue des Lilas Ouest Quebec city Quebec G1L 1B6")
Using Deepparse as an offline service (assuming all dependencies have been downloaded in the default cache dir or a specified dir using the cache_dir parameter).
address_parser = AddressParser(model_type="fasttext", offline=True) parse_address = address_parser("350 rue des Lilas Ouest Quebec city Quebec G1L 1B6") Using a retrained model in an S3-like bucket.
address_parser = AddressParser(model_type="fasttext", path_to_model_weights="s3://path/to/bucket.ckpt") parse_address = address_parser("350 rue des Lilas Ouest Quebec city Quebec G1L 1B6") Using a retrained model in an S3-like bucket using CloudPathLib.
address_parser = AddressParser(model_type="fasttext", path_to_model_weights=CloudPath("s3://path/to/bucket.ckpt")) parse_address = address_parser("350 rue des Lilas Ouest Quebec city Quebec G1L 1B6")
- __call__(addresses_to_parse: List[str] | str | DatasetContainer, with_prob: bool = False, batch_size: int = 32, num_workers: int = 0, with_hyphen_split: bool = False, pre_processors: None | List[Callable] = None) FormattedParsedAddress | List[FormattedParsedAddress] [source]
Callable method to parse the components of an address or a list of address.
- Parameters:
addresses_to_parse (Union[list[str], str, DatasetContainer]) –
The addresses to be parsed, can be either a single address (when using str), a list of address or a DatasetContainer. We apply some validation tests before parsing to validate its content if the data to parse is a string or a list of strings. We apply the following basic criteria:
no addresses are
None
value,no addresses are empty strings, and
no addresses are whitespace-only strings.
The addresses are processed in batches when using a list of addresses, allowing a faster process. For example, using the FastText model, a single address takes around 0.0023 seconds to be parsed using a batch of 1 (1 element at the time is processed). This time can be reduced to 0.00035 seconds per address when using a batch of 128 (128 elements at the time are processed).
with_prob (bool) – If true, return the probability of all the tags with the specified rounding.
batch_size (int) – The batch size (by default,
32
).num_workers (int) – Number of workers for the data loader (default is
0
, meaning the data will be loaded in the main process).with_hyphen_split (bool) – Either or not, use the hyphen split whitespace replacing for countries that use the hyphen split between the unit and the street number (e.g. Canada). For example,
'3-305'
will be replaced as'3 305'
for the parsing. Where'3'
is the unit, and'305'
is the street number. We use a regular expression to replace alphanumerical characters separated by a hyphen at the start of the string. We do so since some cities use hyphens in their names. The default isFalse
. If True, it adds thehyphen_cleaning()
pre-processor at the end of the pre-processor list to apply.pre_processors (Union[None, List[Callable]]) – A list of functions (callable) to apply pre-processing on all the addresses to parse before parsing. See Pre-Processors for examples of pre-processors. Since models were trained on lowercase data, during the parsing, we always apply a lowercase pre-processor. If you pass a list of pre-processor, a lowercase pre-processor is added at the end of the pre-processor list to apply. By default, None, meaning we use the default setup, which is (in order) the coma removal pre-processor, lowercase, double whitespace cleaning and trailing whitespace removal.
- Returns:
Either a
FormattedParsedAddress
or a list ofFormattedParsedAddress
when given more than one address.
Note
Since model was trained on lowercase data, during the parsing, we always apply a lowercase pre-processor.
Examples
address_parser = AddressParser(device=0) # On GPU device 0 parse_address = address_parser("350 rue des Lilas Ouest Quebec city Quebec G1L 1B6") # It also can be a list of addresses parse_address = address_parser(["350 rue des Lilas Ouest Quebec city Quebec G1L 1B6", "350 rue des Lilas Ouest Quebec city Quebec G1L 1B6"]) # It can also output the prob of the predictions parse_address = address_parser("350 rue des Lilas Ouest Quebec city Quebec G1L 1B6", with_prob=True) # Print the parsed address print(parsed_address)
Using a larger batch size
address_parser = AddressParser(device=0) # On GPU device 0 parse_address = address_parser(a_large_list_dataset, batch_size=1024) # You can also use more worker parse_address = address_parser(a_large_list_dataset, batch_size=1024, num_workers=2)
Or using one of our dataset containers
addresses_to_parse = CSVDatasetContainer("./a_path.csv", column_names=["address_column_name"], is_training_container=False) address_parser(addresses_to_parse)
Using a user-define pre-processor
def strip_parenthesis(address): return address.strip("(").strip(")") address_parser(addresses_to_parse, pre_processors=[strip_parenthesis]) # It will also use the default lower case pre-processor.
- get_formatted_model_name() str [source]
Return the model type formatted name. For example, if the model type is
"fasttext"
the formatted name is"FastText"
.
- retrain(train_dataset_container: DatasetContainer, val_dataset_container: DatasetContainer | None = None, train_ratio: float = 0.8, batch_size: int = 32, epochs: int = 5, num_workers: int = 1, learning_rate: float = 0.01, callbacks: List | None = None, seed: int = 42, logging_path: str = './checkpoints', disable_tensorboard: bool = True, prediction_tags: Dict | None = None, seq2seq_params: Dict | None = None, layers_to_freeze: str | None = None, name_of_the_retrain_parser: None | str = None, verbose: None | bool = None) List[Dict] [source]
Method to retrain the address parser model using a dataset with the same tags. We train using experiment from poutyne framework. The experiment module allows us to save checkpoints (
ckpt
, in a pickle format) and a log.tsv where the best epochs can be found (the best epoch is used for the test). The retrained model file name are formatted asretrained_{model_type}_address_parser.ckpt
. For example, if you retrain a FastText model, the file name will beretrained_fasttext_address_parser.ckpt
. The retrained saved model included, in a dictionary format, the model weights, the model type, if newprediction_tags
were used, the new prediction tags, and if newseq2seq_params
were used, the new seq2seq parameters.- Parameters:
train_dataset_container (DatasetContainer) –
The train dataset container of the training data to use, such as any PyTorch Dataset (
Dataset
) user-define class or one of our DatasetContainer (PickleDatasetContainer
,CSVDatasetContainer
orListDatasetContainer
). The training dataset is use in two ways:As-is if a validating dataset is provided (
val_dataset_container
).Split in a training and validation dataset if
val_dataset_container
is set to None.
Thus, it means that if
val_dataset_container
is set to the None default settings, we use thetrain_ratio
argument to split the training dataset into a train and val dataset. See examples for more details.val_dataset_container (Union[DatasetContainer, None]) – The validation dataset container to use for validating the model (by default,
None
).train_ratio (float) – The ratio to use of the
train_dataset_container
for the training procedure. The rest of the data is used for the validation (e.g. a training ratio of 0.8 mean an 80-20 train-valid split) (by default,0.8
). The argument is ignored ifval_dataset_container
is not None.batch_size (int) – The size of the batch (by default,
32
).epochs (int) – The number of training epochs (by default,
5
).num_workers (int) – The number of workers to use for the data loader (by default,
1
worker).learning_rate (float) – The learning rate (LR) to use for training (default 0.01).
callbacks (Union[list, None]) – List of callbacks to use during training. See Poutyne callback for more information. By default, we set no callback.
seed (int) – The seed to use (default 42).
logging_path (str) – The logging path for the checkpoints. Poutyne will use the best one and reload the state if any checkpoints are there. Thus, an error will be raised if you change the model type. For example, you retrain a FastText model and then retrain a BPEmb in the same logging path directory. The logging_path can also be a S3-like (Azure, AWS, Google) bucket URI string path (e.g.
"s3://path/to/aws/s3/bucket.ckpt"
). Or it can be aS3Path
S3-like URI using cloudpathlib to handle S3-like bucket. See cloudpathlib <https://cloudpathlib.drivendata.org/stable/> for detail on supported S3 buckets provider and URI condition. If the logging_path is a S3 bucket, we will only save the best checkpoint to the S3 Bucket at the end of training. By default, the path is./checkpoints
.disable_tensorboard (bool) – To disable Poutyne automatic Tensorboard monitoring. By default, we disable them (true).
prediction_tags (Union[dict, None]) – A dictionary where the keys are the address components (e.g. street name) and the values are the components indices (from 0 to N + 1) to use during the retraining of a model. The
+ 1
corresponds to the End Of Sequence (EOS) token that needs to be included in the dictionary. We will use this dictionary’s length for the prediction layer’s output size. We also save the dictionary to be used later on when you load the model. The default value isNone
, meaning we use our pretrained model prediction tags.seq2seq_params (Union[dict, None]) –
A dictionary of seq2seq parameters to modify the seq2seq architecture to train. Note that if you change the seq2seq parameters, a new model will be trained from scratch. Parameters that can be modified are:
The
input_size
of the encoder (i.e. the size of the embedding). The default value is300
.The size of the
encoder_hidden_size
of the encoder. The default value is1024
.The number of
encoder_num_layers
of the encoder. The default value is1
.The size of the
decoder_hidden_size
of the decoder. The default value is1024
.The number of
decoder_num_layers
of the decoder. The default value is1
.
The default value is
None
, meaning we use the default seq2seq architecture.layers_to_freeze (Union[str, None]) –
Name of the portion of the seq2seq to freeze layers. Thus, it reduces the number of parameters to learn. It Will be ignored if
seq2seq_params
is notNone
. A seq2seq is composed of three-part, an encoder, decoder, and prediction layer. The encoder is the part that encodes the address into a more dense representation. The decoder is the part that decodes a dense address representation. Finally, the prediction layer is a fully-connected with an output size of the same length as the prediction tags. Available freezing settings are:None
: No layers are frozen."encoder"
: To freeze the encoder part of the seq2seq."decoder"
: To freeze the decoder part of the seq2seq."prediction_layer"
: To freeze the last layer that predicts a tag class ."seq2seq"
: To freeze the encoder and decoder but not the prediction layer.
The default value is
None
, meaning we do not freeze any layers.name_of_the_retrain_parser (Union[str, None]) –
Name to give to the retrained parser that will be used when reloaded as the printed name, and to the saving file name (note that we will manually add the extension
".ckpt"
to the name for the file name). By default,None
.Default settings for the parser name will use the training settings for the name using the following pattern:
the pretrained architecture (
'fasttext'
or'bpemb'
and if an attention mechanism is use),if prediction_tags is not
None
, the following tag:ModifiedPredictionTags
,if seq2seq_params is not
None
, the following tag:ModifiedSeq2SeqConfiguration
, andif layers_to_freeze is not
None
, the following tag:FreezedLayer{portion}
.
verbose (Union[None, bool]) – To override the AddressParser verbosity for the test. When set to True or False, it will override (but it does not change the AddressParser verbosity) the test verbosity. If set to the default value
None
, the AddressParser verbosity is used as the test verbosity.
- Returns:
A list of dictionaries with the best epoch stats (see Experiment class for details). The pretrained is saved using a default file name of using the name_of_the_retrain_parser. See the last note for more details.
Note
We recommend using a learning rate scheduler procedure during retraining to reduce the chance of losing too much of our learned weights, thus increasing retraining time. We personally use the following
poutyne.StepLR(step_size=1, gamma=0.1)
. Also, starting learning rate should be relatively low (i.e. 0.01 or lower).Note
We use SGD optimizer, NLL loss and accuracy as a metric, the data is shuffled, and we use teacher forcing during training (with a prob of 0.5) as in the article.
Note
Due to pymagnitude, we could not train using the Magnitude embeddings, meaning it’s not possible to train using the fasttext-light model. But, since we don’t update the embeddings weights, one can retrain using the fasttext model and later on use the weights with the fasttext-light.
Note
When retraining a model, Poutyne will create checkpoints. After the training, we use the best checkpoint in a directory as the model to load. Thus, if you train two different models in the same directory, the second retrain will not work due to model differences.
Note
The default settings for the file name to save the retrained model use the following pattern “retrained_{model_type}_address_parser.ckpt” if name_of_the_retrain_parser is set to
None
. Otherwise, the file name to save the retrained model will correspond toname_of_the_retrain_parser
plus the file extension".ckpt"
.Examples
address_parser = AddressParser(device=0) # On GPU device 0 data_path = "path_to_a_pickle_dataset.p" container = PickleDatasetContainer(data_path) # The validation dataset is created from the training dataset (container) # 80% of the data is use for training and 20% as a validation dataset address_parser.retrain(container, train_ratio=0.8, epochs=1, batch_size=128)
Using the freezing layer’s parameters to freeze layers during training
address_parser = AddressParser(device=0) data_path = "path_to_a_csv_dataset.p" container = CSVDatasetContainer(data_path) val_data_path = "path_to_a_csv_val_dataset.p" val_container = CSVDatasetContainer(val_data_path) # We provide the training dataset (container) and the val dataset (val_container) # Thus, the train_ratio argument is ignored, and we use the val_container instead # as the validating dataset. address_parser.retrain(container, val_container, epochs=5, batch_size=128, layers_to_freeze="encoder")
Using learning rate scheduler callback.
import poutyne address_parser = AddressParser(device=0) data_path = "path_to_a_csv_dataset.p" container = CSVDatasetContainer(data_path) lr_scheduler = poutyne.StepLR(step_size=1, gamma=0.1) # reduce LR by a factor of 10 each epoch address_parser.retrain(container, train_ratio=0.8, epochs=5, batch_size=128, callbacks=[lr_scheduler])
Using your own prediction tags dictionary.
address_components = {"ATag":0, "AnotherTag": 1, "EOS": 2} address_parser = AddressParser(device=0) # On GPU device 0 data_path = "path_to_a_pickle_dataset.p" container = PickleDatasetContainer(data_path) address_parser.retrain(container, train_ratio=0.8, epochs=1, batch_size=128, prediction_tags=address_components)
Using your own seq2seq parameters.
seq2seq_params = {"encoder_hidden_size": 512, "decoder_hidden_size": 512} address_parser = AddressParser(device=0) # On GPU device 0 data_path = "path_to_a_pickle_dataset.p" container = PickleDatasetContainer(data_path) address_parser.retrain(container, train_ratio=0.8, epochs=1, batch_size=128, seq2seq_params=seq2seq_params)
Using your own seq2seq parameters and prediction tags dictionary.
seq2seq_params = {"encoder_hidden_size": 512, "decoder_hidden_size": 512} address_components = {"ATag":0, "AnotherTag": 1, "EOS": 2} address_parser = AddressParser(device=0) # On GPU device 0 data_path = "path_to_a_pickle_dataset.p" container = PickleDatasetContainer(data_path) address_parser.retrain(container, train_ratio=0.8, epochs=1, batch_size=128, seq2seq_params=seq2seq_params, prediction_tags=address_components)
Using a named retrain parser name.
address_parser = AddressParser(device=0) # On GPU device 0 data_path = "path_to_a_pickle_dataset.p" container = PickleDatasetContainer(data_path) address_parser.retrain(container, train_ratio=0.8, epochs=1, batch_size=128, name_of_the_retrain_parser="MyParserName")
- test(test_dataset_container: DatasetContainer, batch_size: int = 32, num_workers: int = 1, callbacks: List | None = None, seed: int = 42, verbose: None | bool = None) Dict [source]
Method to test a retrained or a pretrained model using a dataset with the default tags. If you test a retrained model with different prediction tags, we will use those tags.
- Parameters:
test_dataset_container (DatasetContainer) – The test dataset container of the data to use.
batch_size (int) – The batch size (by default,
32
).num_workers (int) – Number of workers to use for the data loader (by default,
1
worker).callbacks (Union[list, None]) –
List of callbacks to use during training. See Poutyne callback for more information. By default, we set no callback.
seed (int) – Seed to use (by default,
42
).verbose (Union[None, bool]) – To override the AddressParser verbosity for the test. When set to True or False, it will override (but it does not change the AddressParser verbosity) the test verbosity. If set to the default value
None
, the AddressParser verbosity is used as the test verbosity.
- Returns:
A dictionary with the stats (see Experiment class for details).
Note
We use NLL loss and accuracy as in the article.
Examples
address_parser = AddressParser(device=0, verbose=True) # On GPU device 0 data_path = "path_to_a_pickle_test_dataset.p" test_container = PickleDatasetContainer(data_path, is_training_container=False) # We test the model on the data, and we override the test verbosity address_parser.test(test_container, verbose=False)
You can also test your fine-tuned model
address_components = {"ATag":0, "AnotherTag": 1, "EOS": 2} address_parser = AddressParser(device=0) # On GPU device 0 # Train phase data_path = "path_to_a_pickle_train_dataset.p" train_container = PickleDatasetContainer(data_path) address_parser.retrain(container, train_ratio=0.8, epochs=1, batch_size=128, prediction_tags=address_components) # Test phase data_path = "path_to_a_pickle_test_dataset.p" test_container = PickleDatasetContainer(data_path, is_training_container=False) address_parser.test(test_container) # Test the retrained model
- save_model_weights(file_path: str | Path) None [source]
Method to save, in a Pickle format, the address parser model weights (PyTorch state dictionary).
- file_path (Union[str, Path]): A complete file path with a pickle extension to save the model weights.
It can either be a string (e.g. ‘path/to/save.p’) or a path-like path (e.g. Path(‘path/to/save.p’).
Examples
address_parser = AddressParser(device=0) a_path = Path('some/path/to/save.p') address_parser.save_address_parser_weights(a_path)
address_parser = AddressParser(device=0) a_path = 'some/path/to/save.p' address_parser.save_address_parser_weights(a_path)
Formatted Parsed Address
- class deepparse.parser.FormattedParsedAddress(address: Dict)[source]
A parsed address as commonly known returned by an address parser.
- Parameters:
address (dict) – A dictionary where the key is an address, and the value is a list of tuples where the first elements are address components, and the second elements are the parsed address value. Also, the second tuple’s address value can either be the tag of the components (e.g. StreetName) or a tuple (
x
,y
) wherex
is the tag andy
is the probability (e.g. 0.9981) of the model prediction.
- raw_address
The raw address (not parsed).
- address_parsed_components
The parsed address in a list of tuples where the first elements are the address components and the second elements are the tags.
- <Address tag>
All the possible address tag element of the model. For example,
StreetName
orStreetNumber
.
Example
address_parser = AddressParser() parse_address = address_parser("350 rue des Lilas Ouest Quebec city Quebec G1L 1B6") print(parse_address.StreetNumber) # 350 print(parse_address.PostalCode) # G1L 1B6 # Print the parsed address print(parsed_address)
Note
Since an address component can be composed of multiple elements (e.g. Wolfe street), when the probability values are asked of the address parser, the address components don’t keep it. It’s only available through the
address_parsed_components
attribute.- format_address(fields: List | None = None, capitalize_fields: List[str] | None = None, upper_case_fields: List[str] | None = None, field_separator: str | None = None) str [source]
Method to format the address components in a specific order. We also filter the empty components (None). By default, the order is
'StreetNumber, Unit, StreetName, Orientation, Municipality, Province, PostalCode, GeneralDelivery'
and we filter the empty components.- Parameters:
fields (Union[list, None]) – Optional argument to define the fields to order the address components of the address. If None, we will use the inferred order based on the address tags’ appearance. For example, if the parsed address is
(305, StreetNumber), (rue, StreetName), (des, StreetName), (Lilas, StreetName)
, the inferred order will beStreetNumber, StreetName
.capitalize_fields (Union[list, None]) – Optional argument to define the capitalized fields for the formatted address. If None, no fields are capitalized.
upper_case_fields (Union[list, None]) – Optional argument to define the upper-cased fields for the formatted address. If None, no fields are capitalized.
field_separator (Union[list, None]) – Optional argument to define the field separator between address components. If None, the default field separator is
" "
.
- Returns:
A string of the formatted address in the fields order.
Examples
address_parser = AddressParser() parse_address = address_parser("350 rue des Lilas Ouest Quebec city Quebec G1L 1B6") parse_address.formatted_address(fields_separator=", ") # > 350, rue des lilas, ouest, quebec city, quebec, g1l 1b6 parse_address.formatted_address(fields_separator=", ", capitalize_fields=["StreetName", "Orientation"]) # > 350, rue des lilas, ouest, quebec city, quebec, g1l 1b6 parse_address.formatted_address(fields_separator=", ", upper_case_fields=["PostalCode""]) # > 350 rue des lilas ouest quebec city quebec G1L 1B6
- to_dict(fields: List | None = None) dict [source]
Method to convert a parsed address into a dictionary where the keys are the address components, and the values are the value of those components. For example, the parsed address
<StreetNumber> 305 <StreetName> rue des Lilas
will be converted into the following dictionary:{'StreetNumber':'305', 'StreetName': 'rue des Lilas'}
.- Parameters:
fields (Union[list, None]) – Optional argument to define the fields to extract from the address and the order of it. If None, will use the default order and value
'StreetNumber, Unit, StreetName, Orientation, Municipality, Province, PostalCode, GeneralDelivery'
.- Returns:
A dictionary where the keys are the selected (or default) fields and the values are the corresponding value of the address components.
- to_list_of_tuples(fields: List | None = None) List[tuple] [source]
Method to convert a parsed address into a list of tuples where the first element of the tuples is the value of the components, and the second value is the name of the components.
For example, the parsed address
<StreetNumber> 305 <StreetName> rue des Lilas
will be converted into the following list of tuples:('305', 'StreetNumber'), ('rue des Lilas', 'StreetName')]
.- Parameters:
fields (Union[list, None]) – Optional argument to define the fields to extract from the address and its order. If None, it will use the default order and value
'StreetNumber, Unit, StreetName, Orientation, Municipality, Province, PostalCode, GeneralDelivery'
.- Returns:
A list of tuples where the first element of the tuples are the value of the address components and the second values are the name of the address components.
- to_pandas() Dict [source]
Method to convert a parsed address into a dictionary for pandas where the first key is the raw address and the following keys are the address components, and the values are the values of those components. For example, the parsed address
<StreetNumber> 305 <StreetName> rue des Lilas
will be converted into the following dictionary:{'Address': '305 rue des Lilas', 'StreetNumber':'305', 'StreetName': 'rue des Lilas'}
.- Returns:
A dictionary of the raw address and all is parsed components.
- to_pickle() Tuple[str, List] [source]
Method to convert a parsed address into a list of tuple for pickle where the first tuple element is the raw address and the following tuples are the address components, and the values are the values of those components. For example, the parsed address
<StreetNumber> 305 <StreetName> rue des Lilas
will be converted into the following list of tuples:'305 rue des Lilas', ('305', 'StreetNumber'), ('rue des Lilas', 'StreetName')]
.- Returns:
A tuple where the first element is the raw address (a string) and the second element is a list of tuple of the parsed addresses. The first element of each tuple is the address components and the second is the tag.