Pre-Processors

Here are the available pre-processor in Deepparse. The first four are used as default settings when parsing addresses.

deepparse.pre_processing.address_cleaner.coma_cleaning(address: str) str[source]

Pre-processor to remove coma. It is based on issue 56.

Parameters:

address – The address to apply coma cleaning on.

Returns:

The coma-cleaned address.

deepparse.pre_processing.address_cleaner.lower_cleaning(address: str) str[source]

Pre-processor to lowercase an address since the original training data was in lowercase.

Parameters:

address – The address to apply coma cleaning on.

Returns:

The lowercase address.

deepparse.pre_processing.address_cleaner.trailing_whitespace_cleaning(address: str) str[source]

Pre-processor to remove trailing whitespace (" ").

Parameters:

address – The address to apply trailing whitespace (" ") cleaning on.

Returns:

The trailing whitespace cleaned address.

deepparse.pre_processing.address_cleaner.double_whitespaces_cleaning(address: str) str[source]

Pre-processor to remove double whitespace ("  ") by one whitespace (" "). The regular expression use to clean multiple whitespaces is the following " {2,}".

Parameters:

address – The address to apply double whitespace cleaning on.

Returns:

The double whitespace cleaned address.

deepparse.pre_processing.address_cleaner.hyphen_cleaning(address: str) str[source]

Pre-processor to clean hyphen between the street number and unit in an address. Since some addresses use the hyphen to split the unit and street address, we replace the hyphen with whitespaces to allow a proper splitting of the address. For example, the proper parsing of the address "3-305 street name" is "Unit": "3", "StreetNumber": "305", "StreetName": "street name".

See issue 137 for more details.

The regular expression use to clean hyphen is the following "^([0-9]*[a-z]?)-([0-9]*[a-z]?) ". The first group is the unit, and the second is the street number. Both include letters since they can include letters in some countries. For example, unit 3a or address 305a.

Note: the hyphen is also used in some cities’ names, such as "Saint-Jean"; thus, we use regex to detect the proper hyphen to replace.

Parameters:

address – The address to apply coma cleaning on.

Returns:

The lowercase address.