Parse Addresses

import pandas as pd

from deepparse import download_from_public_repository
from deepparse.dataset_container import PickleDatasetContainer
from deepparse.parser import AddressParser

Here is an example on how to parse multiple addresses. First, let’s download the train and test data from the public repository.

saving_dir = "./data"
file_extension = "p"
test_dataset_name = "predict"
download_from_public_repository(test_dataset_name, saving_dir, file_extension=file_extension)

Now let’s load the dataset using one of our dataset container

addresses_to_parse = PickleDatasetContainer("./data/predict.p", is_training_container=False)

Let’s use the BPEmb model on a GPU.

address_parser = AddressParser(model_type="bpemb", device=0)

parsed_addresses = address_parser(test_data[0:300])

# Print one of the parsed address
print(parsed_addresses[0])

When parsing addresses, some data quality tests are applied to the dataset. First, it validates that no addresses to parse are empty. Second, it validates that no addresses are whitespace-only. The next two lines are rising a DataError.

address_parser("")  # Raise an error
address_parser(" ")  # Raise an error

We can also put our parsed address into a Pandas DataFrame for analysis. You can choose the fields to use or use the default one.

fields = ['StreetNumber', 'StreetName', 'Municipality', 'Province', 'PostalCode']
parsed_address_data_frame = pd.DataFrame([parsed_address.to_dict(fields=fields) for parsed_address in parsed_addresses],
                                         columns=fields)