Parser

Pre-trained Complete Model

This is the complete pre-trained address parser model. This model allows using the pre-trained weights to predict the tags of any address.

We offer, for now, only two pre-trained models, FastText and BPEmb. The first one relies on fastText French pre-trained embeddings to parse the address and the second use the byte-pair multilingual subword pre-trained embeddings. In both cases, the architecture is similar, and performances are comparable; our results are available in this article.

Memory Usage and Time Performance

We have conducted an experiment, and we report the results in the next two tables. In each table, we report the RAM usage, and in the first table, we also report the GPU memory usage. Also, for both table, we report the mean-time of execution that was obtained by processing ~183,000 address using different batch size (2^0, …, 2^9) (i.e. \(\frac{\text{Total time to process all addresses}}{~183,000} =\) time per address). In addition, we proposed a lighter version (fastText-light) of our fastText model using Magnitude embeddings mapping. Fot this lighter model, in average results are a little bit lower on trained country (around ~2%) but are similar on zero-shot country (see our article for more details).

Memory usage GPU (GB)

Memory usage RAM (GB)

Mean time of execution (batch of 1) (s)

Mean time of execution (batch of more than 1) (s)

fastText 1

~1

~8

~0.00236

~0.0004

fastText-light

~1

~1

~0.0028

~0.0037

BPEmb

~1

~1

~0.0053

~0.0019

Libpostal

N/A

N/A

<1

~0.00007

1

Note that on Windows, we use the Gensim Fasttext models that use ~10 GO with similar performance.

Memory usage RAM (GB)

Mean time of execution (batch of 1) (s)

Mean time of execution (batch of more than 1) (s)

fastText 2

~8

~0.0168

~0.0026

fastText-light

~1

~0.0170

~0.0030

BPEmb

~1

~0.0219

~0.0059

Libpostal

N/A

<1

~0.00007

2

Note that on Windows, we use the Gensim Fasttext models that use ~10 GO with similar performance.

The two tables highlight that the batch size (number of address in the list to be parsed) influence the processing time. Thus, the more there is address, the faster processing each address can be. You can also improve performance by using more worker for the data loader created with your data within the call. But note that this performance improvements is not linear.

AddressParser

class deepparse.parser.AddressParser(model_type: str = 'best', device: Union[int, str, torch.device] = 0, rounding: int = 4, verbose: bool = True, path_to_retrained_model: Optional[str] = None)[source]

Address parser to parse an address or a list of address using one of the seq2seq pre-trained networks either with fastText or BPEmb. The default prediction tags are the following

  • “StreetNumber”: for the street number,

  • “StreetName”: for the name of the street,

  • “Unit”: for the unit (such as apartment),

  • “Municipality”: for the municipality,

  • “Province”: for the province or local region,

  • “PostalCode”: for the postal code,

  • “Orientation”: for the street orientation (e.g. west, east),

  • “GeneralDelivery”: for other delivery information.

Parameters
  • model_type (str) –

    The network name to use, can be either:

    • fasttext (need ~9 GO of RAM to be used),

    • fasttext-light (need ~2 GO of RAM to be used, but slower than fasttext version),

    • bpemb (need ~2 GO of RAM to be used),

    • fastest (quicker to process one address) (equivalent to fasttext),

    • lightest (the one using the less RAM and GPU usage) (equivalent to fasttext-light),

    • best (best accuracy performance) (equivalent to bpemb).

    The default value is “best” for the most accurate model. Ignored if path_to_retrained_model is not None.

  • device (Union[int, str, torch.torch.device]) –

    The device to use can be either:

    • a GPU index in int format (e.g. 0),

    • a complete device name in a string format (e.g. 'cuda:0'),

    • a device object,

    • 'cpu' for a CPU use.

    The default value is GPU with the index 0 if it exist, otherwise the value is CPU.

  • rounding (int) – The rounding to use when asking the probability of the tags. The default value is 4 digits.

  • verbose (bool) – Turn on/off the verbosity of the model weights download and loading. The default value is True.

  • path_to_retrained_model (Union[str, None]) – The path to the retrained model to use for prediction. We will ‘infer’ the model_type of the retrained model. Default is None, meaning we use our pre-trained model.

Note

For both the networks, we will download the pre-trained weights and embeddings in the .cache directory for the root user. The pre-trained weights take at most 44 MB. The fastText embeddings take 6.8 GO, the fastText-light embeddings take 3.3 GO and bpemb take 116 MB (in .cache/bpemb).

Also, one can download all the dependencies of our pre-trained model using the deepparse.download module (e.g. python -m deepparse.download fasttext) before sending it to a node without access to Internet.

Here are the URLs to download our pre-trained models directly

Note

Since Windows uses spawn instead of fork during multiprocess (for the data loading pre-processing num_worker > 0) we use the Gensim model, which takes more RAM (~10 GO) than the Fasttext one (~8 GO). It also takes a longer time to load. See here the issue.

Note

You may observe a 100% CPU load the first time you call the fasttext-light model. We hypotheses that this is due to the SQLite database behind pymagnitude. This approach create a cache to speed up processing and since the memory mapping is save between the runs, it’s more intensive the first time you call it and subsequent time this load doesn’t appear.

Examples

address_parser = AddressParser(device=0) #on gpu device 0
parse_address = address_parser("350 rue des Lilas Ouest Quebec city Quebec G1L 1B6")

address_parser = AddressParser(model_type="fasttext", device="cpu") # fasttext model on cpu
parse_address = address_parser("350 rue des Lilas Ouest Quebec city Quebec G1L 1B6")

Using a retrain model

address_parser = AddressParser(model_type="fasttext",
                               path_to_retrained_model='/path_to_a_retrain_fasttext_model')
parse_address = address_parser("350 rue des Lilas Ouest Quebec city Quebec G1L 1B6")

Using a retrain model trained on different tags

# We don't give the model_type since it's ignored when using path_to_retrained_model
address_parser = AddressParser(path_to_retrained_model='/path_to_a_retrain_fasttext_model')
parse_address = address_parser("350 rue des Lilas Ouest Quebec city Quebec G1L 1B6")
__call__(addresses_to_parse: Union[List[str], str], with_prob: bool = False, batch_size: int = 32, num_workers: int = 0)Union[deepparse.parser.formated_parsed_address.FormattedParsedAddress, List[deepparse.parser.formated_parsed_address.FormattedParsedAddress]][source]

Callable method to parse the components of an address or a list of address.

Parameters
  • addresses_to_parse (Union[list[str], str]) – The addresses to be parse, can be either a single address (when using str) or a list of address. When using a list of addresses, the addresses are processed in batch, allowing a faster process. For example, using fastText model, a single address takes around 0.003 seconds to be parsed using a batch of 1 (1 element at the time is processed). This time can be reduced to 0.00035 seconds per address when using a batch of 128 (128 elements at the time are processed).

  • with_prob (bool) – If true, return the probability of all the tags with the specified rounding.

  • batch_size (int) – The size of the batch (default is 32).

  • num_workers (int) – Number of workers to use for the data loader (default is 0, which means that the data will be loaded in the main process.).

Returns

Either a FormattedParsedAddress or a list of FormattedParsedAddress when given more than one address.

Examples

address_parser = AddressParser(device=0) #on gpu device 0
parse_address = address_parser("350 rue des Lilas Ouest Quebec city Quebec G1L 1B6")
parse_address = address_parser("350 rue des Lilas Ouest Quebec city Quebec G1L 1B6",
                                with_prob=True)

Using a larger batch size

address_parser = AddressParser(device=0) #on gpu device 0
parse_address = address_parser(a_large_dataset, batch_size=1024)
# You can also use more worker
parse_address = address_parser(a_large_dataset, batch_size=1024, num_workers=2)
retrain(dataset_container: deepparse.dataset_container.dataset_container.DatasetContainer, train_ratio: float, batch_size: int, epochs: int, num_workers: int = 1, learning_rate: float = 0.01, callbacks: Optional[List] = None, seed: int = 42, logging_path: str = './checkpoints', prediction_tags: Optional[Dict] = None)List[Dict][source]

Method to retrain the address parser model using a dataset with the same tags. We train using experiment from poutyne framework. The experiment module allows us to save checkpoints (ckpt, in a pickle format) and a log.tsv where the best epochs can be found (the best epoch is used for the test). The retrained model file name are formatted as retrained_{model_type}_address_parser.ckpt. For example, if you retrain a fasttext model, the file name will be retrained_fasttext_address_parser.ckpt. The retrained saved model included, in a dictionary format, the model weights, the model type, and if new prediction_tags were used, the new prediction tags.

Parameters
  • dataset_container (DatasetContainer) – The dataset container of the data to use.

  • train_ratio (float) – The ratio to use of the dataset for the training. The rest of the data is used for the validation (e.g. a train ratio of 0.8 mean a 80-20 train-valid split).

  • batch_size (int) – The size of the batch.

  • epochs (int) – number of training epochs.

  • num_workers (int) – Number of workers to use for the data loader (default is 1 worker).

  • learning_rate (float) – The learning rate (LR) to use for training (default 0.01).

  • callbacks (Union[list, None]) – List of callbacks to use during training. See Poutyne callback for more information. By default we set no callback.

  • seed (int) – Seed to use (by default 42).

  • logging_path (str) – The logging path for the checkpoints. By default the path is ./checkpoints.

  • prediction_tags (Union[dict, None]) – A dictionary where the keys are the address components (e.g. street name) and the values are the components indices (from 0 to N + 1) to use during retraining of a model. The + 1 corresponds to the End Of Sequence (EOS) token that needs to be included in the dictionary. We will use the length of this dictionary for the output size of the prediction layer. We also save the dictionary to be used later on when you load the model. Default is None, meaning we use our pre-trained model prediction tags.

Returns

A list of dictionary with the best epoch stats (see Experiment class for details).

Note

We use SGD optimizer, NLL loss and accuracy as a metric, the data is shuffled and we use teacher forcing during training (with a prob of 0.5) as in the article.

Note

Due to pymagnitude, we could not train using the Magnitude embeddings, meaning it’s not possible to train using the fasttext-light model. But, since we don’t update the embeddings weights, one can retrain using the fasttext model and later on use the weights with the fasttext-light.

Examples

address_parser = AddressParser(device=0) #on gpu device 0
data_path = 'path_to_a_pickle_dataset.p'

container = PickleDatasetContainer(data_path)

address_parser.retrain(container, 0.8, epochs=1, batch_size=128)

Using learning rate scheduler callback.

import poutyne

address_parser = AddressParser(device=0)
data_path = 'path_to_a_pickle_dataset.p'

container = PickleDatasetContainer(data_path)

lr_scheduler = poutyne.StepLR(step_size=1, gamma=0.1) # reduce LR by a factor of 10 each epoch
address_parser.retrain(container, 0.8, epochs=5, batch_size=128, callbacks=[lr_scheduler])

Using your own prediction tags dictionary.

address_components = {"ATag":0, "AnotherTag": 1, "EOS": 2}

address_parser = AddressParser(device=0) #on gpu device 0
data_path = 'path_to_a_pickle_dataset.p'

container = PickleDatasetContainer(data_path)

address_parser.retrain(container, 0.8, epochs=1, batch_size=128, prediction_tags=address_components)
test(test_dataset_container: deepparse.dataset_container.dataset_container.DatasetContainer, batch_size: int, num_workers: int = 1, callbacks: Optional[List] = None, seed: int = 42)Dict[source]

Method to test a retrained or a pre-trained model using a dataset with the default tags. If you test a retrained model with different prediction tags, we will use those tags.

Parameters
  • test_dataset_container (DatasetContainer) – The test dataset container of the data to use.

  • batch_size (int) – The size of the batch (default is 32).

  • num_workers (int) – Number of workers to use for the data loader (default is 1 worker).

  • callbacks (Union[list, None]) –

    List of callbacks to use during training. See Poutyne callback for more information. By default we set no callback.

  • seed (int) – Seed to use (by default 42).

  • callbacks

    List of callbacks to use during training. See Poutyne callback for more information. By default we set no callback.

Returns

A dictionary with the stats (see Experiment class for details).

Note

We use NLL loss and accuracy as in the article.

Examples

address_parser = AddressParser(device=0) #on gpu device 0
data_path = 'path_to_a_pickle_test_dataset.p'

test_container = PickleDatasetContainer(data_path)

address_parser.test(test_container) # We test the model on the data

You can also test your fine tuned model

address_components = {"ATag":0, "AnotherTag": 1, "EOS": 2}

address_parser = AddressParser(device=0) #on gpu device 0

# Train phase
data_path = 'path_to_a_pickle_train_dataset.p'

train_container = PickleDatasetContainer(data_path)

address_parser.retrain(container, 0.8, epochs=1, batch_size=128, prediction_tags=address_components)

# Test phase
data_path = 'path_to_a_pickle_test_dataset.p'

test_container = PickleDatasetContainer(data_path)

address_parser.test(test_container) # Test the retrained model

Formatted Parsed Address

class deepparse.parser.FormattedParsedAddress(address: Dict)[source]

A parsed address as commonly known returned by an address parser.

Parameters

address (dict) – A dictionary where the key is an address, and the value is a list of tuples where the first elements are address components, and the second elements are the parsed address value. Also, the second tuple’s address value can either be the tag of the components (e.g. StreetName) or a tuple (x, y) where x is the tag and y is the probability (e.g. 0.9981) of the model prediction.

raw_address

The raw address (not parsed).

address_parsed_components

The parsed address in a list of tuples where the first elements are the address components and the second elements are the tags.

<Address tag>

All the possible address tag element of the model. For example, StreetName or StreetNumber.

Example

address_parser = AddressParser()
parse_address = address_parser("350 rue des Lilas Ouest Quebec city Quebec G1L 1B6")
print(parse_address.StreetNumber) # 350
print(parse_address.PostalCode) # G1L 1B6

Note

Since an address component can be composed of multiple elements (e.g. Wolfe street), when the probability values are asked of the address parser, the address components don’t keep it. It’s only available through the address_parsed_components attribute.

format_address(fields: Optional[List] = None, capitalize_fields: Optional[List[str]] = None, upper_case_fields: Optional[List[str]] = None, field_separator: Optional[str] = None)str[source]

Method to format the address components in a specific order. We also filter the empty components (None). By default, the order is ‘StreetNumber, Unit, StreetName, Orientation, Municipality, Province, PostalCode, GeneralDelivery’ and we filter the empty components.

Parameters
  • fields (Union[list, None]) – Optional argument to define the fields to order the address components of the address. If None, we will use the inferred order base on the address tags appearance. For example, if the parsed address is (305, StreetNumber), (rue, StreetName), (des, StreetName), (Lilas, StreetName), the inferred order will be StreetNumber, StreetName.

  • capitalize_fields (Union[list, None]) – Optional argument to define the capitalize fields for the formatted address. If None, no fields are capitalize.

  • upper_case_fields (Union[list, None]) – Optional argument to define the upper cased fields for the formatted address. If None, no fields are capitalize.

  • field_separator (Union[list, None]) – Optional argument to define the field separator between address components. If None, the default field separator is " ".

Returns

A string of the formatted address in the fields order.

Examples

address_parser = AddressParser()
parse_address = address_parser("350 rue des Lilas Ouest Quebec city Quebec G1L 1B6")

parse_address.formatted_address(fields_separator=", ")
# > 350, rue des lilas, ouest, quebec city, quebec, g1l 1b6

parse_address.formatted_address(fields_separator=", ", capitalize_fields=["StreetName", "Orientation"])
# > 350, Rue des lilas, Ouest, quebec city, quebec, g1l 1b6

parse_address.formatted_address(fields_separator=", ", upper_case_fields=["PostalCode""])
# > 350 rue des lilas ouest quebec city quebec G1L 1B6
to_dict(fields: Optional[List] = None)dict[source]

Method to convert a parsed address into a dictionary where the keys are the address components, and the values are the value of those components. For example, the parsed address <StreetNumber> 305 <StreetName> rue des Lilas will be converted into the following dictionary: {'StreetNumber':'305', 'StreetName': 'rue des Lilas'}.

Parameters

fields (Union[list, None]) – Optional argument to define the fields to extract from the address and the order of it. If None, will use the default order and value ‘StreetNumber, Unit, StreetName, Orientation, Municipality, Province, PostalCode, GeneralDelivery’.

Returns

A dictionary where the keys are the selected (or default) fields and the values are the corresponding value of the address components.