CLI

You can use our CLI to parse addresses directly from the command line, retrain a parsing model or download a pretrained model.

Parse

The parsing of the addresses to parse dataset_path is done using the selected parsing_model. The exported parsed addresses are to be exported in the same directory as the addresses to parse but given the export_file_name using the encoding format of the address dataset file. For example, if the dataset is in a CSV format, the output file format will be a CSV. Moreover, by default, we log some information (--log) such as the parser model name, the parsed dataset path and the number of parsed addresses. Here is the list of the arguments, their descriptions and default values. One can use the command parse --help to output the same description in your command line.

  • parsing_model: The parsing module to use.

  • dataset_path: The path to the dataset file in a pickle (.p, .pickle or .pckl) or CSV format.

  • export_file_name: The filename to use to export the parsed addresses. We will infer the file format base on the file extension. That is, if the file is a pickle (.p or .pickle), we will export it into a pickle file. The supported formats are Pickle, CSV and JSON. The file will be exported in the same repositories as the dataset_path. See the doc for more details on the format exporting.

  • --device: The device to use. It can be ‘cpu’ or a GPU device index such as '0' or '1'. By default, '0'.

  • --batch_size: The batch size to use to process the dataset. By default, 32.

  • --path_to_retrained_model: A path to a retrained model to use for parsing. By default, None.

  • --csv_column_name: The column name to extract address in the CSV. Need to be specified if the provided dataset_path leads to a CSV file. By default, None.

  • --csv_column_separator: The column separator for the dataset container will only be used if the dataset is a CSV one. By default, '\t'.

  • --log: Either or not to log the parsing process into a .log file exported at the same place as the parsed data using the same name as the export file. The bool value can be (not case sensitive) 'true/false', 't/f', 'yes/no', 'y/n' or '0/1'. By default, True.

  • --cache_dir: To change the default cache directory (default to None, e.g. default path).

deepparse.cli.parse.main(args=None) None[source]

CLI function to rapidly parse an addresses dataset and output it in another file.

Examples of usage:

parse fasttext ./dataset_path.csv parsed_address.pickle

Using a gpu device

parse fasttext ./dataset_path.csv parsed_address.p --device 0

Using a CSV dataset

parse fasttext ./dataset.csv parsed_address.pckl --path_to_retrained_model ./path

Dataset Format

For the dataset format see our DatasetContainer.

Exporting Format

We support three types of export formats: CSV, Pickle and JSON.

The first export uses the following pattern column pattern: "Address", "First address components class", "Second class", .... Which means the address 305 rue des Lilas 0 app 2 will output the table bellow using our default tags:

Address

StreetNumber

Unit

StreetName

Orientation

Municipality

Province

Postal Code

GeneralDelivery

305 rue des Lilas 0 app 2

305

app 2

rue des lilas

o

None

None

None

None

The second export uses a similar approach but using tuples and list. Using the same example will return the following tuple ("305 rue des Lilas 0 app 2", [("305", "StreetNumber"), ("rue des lilas", "StreetName"), ...]).

The third export uses a similar approach to the CSV format but uses dictionary-like formatting. Using the same example will return the following dict {"Address": "305 rue des Lilas 0 app 2", "StreetNumber": "305", ...}.

Retrain

This command allows a user to retrain the base_parsing_model on the train_dataset_path dataset. For the training, the CSV or Pickle dataset is loader in a specific dataloader (see DatasetContainer for more details). We use Poutyne’s automatic logging functionalities during training. Thus, it creates an epoch checkpoint and outputs the epoch metrics in a TSV file. Moreover, we save the best epoch model under the retrain model name (either the default one or a given name using the name_of_the_retrain_parser argument). Here is the list of the arguments, their descriptions and default values. One can use the command parse --help to output the same description in your command line.

  • base_parsing_model: The parsing module to retrain.

  • train_dataset_path: The path to the dataset file in a pickle (.p, .pickle or .pckl) or CSV format.

  • --train_ratio: The ratio to use of the dataset for the training. The rest of the data is used for the validation (e.g. a training ratio of 0.8 mean an 80-20 train-valid split) (default is 0.8).

  • --batch_size: The size of the batch (default is 32).

  • --epochs: The number of training epochs (default is 5).

  • --num_workers: The number of workers to use for the data loader (default is 1 worker).

  • --learning_rate: The learning rate (LR) to use for training (default 0.01).

  • --seed: The seed to use (default 42).

  • --logging_path: The logging path for the checkpoints and the retrained model. Note that training creates checkpoints, and we use the Poutyne library that uses the best epoch model and reloads the state if any checkpoints are already there. Thus, an error will be raised if you change the model type. For example, you retrain a FastText model and then retrain a BPEmb in the same logging path directory. By default, the path is './checkpoints'.

  • --disable_tensorboard: To disable Poutyne automatic Tensorboard monitoring. By default, we disable them (True).

  • --layers_to_freeze: Name of the portion of the seq2seq to freeze layers, thus reducing the number of parameters to learn. Default to None.

  • --name_of_the_retrain_parser: Name to give to the retrained parser that will be used when reloaded as the printed name, and to the saving file name. By default, None, thus, the default name. See the complete parser retrain method for more details.

  • --device: The device to use. It can be 'cpu' or a GPU device index such as '0' or '1'. By default '0'.

  • --csv_column_names: The column names to extract address in the CSV. Need to be specified if the provided dataset_path leads to a CSV file. Column names have to be separated by whitespace. For example, --csv_column_names column1 column2.

  • --csv_column_separator: The column separator for the dataset container will only be used if the dataset is a CSV one. By default, '\t'.

  • --cache_dir: To change the default cache directory (default to None, e.g. default path).

  • prediction_tags: To change the prediction tags. The prediction_tags is a path leading to a JSON file of the new tags in a key-value style. For example, the path can be "a_path/file.json" and the content can be {"new_tag": 0, "other_tag": 1, "EOS": 2}

deepparse.cli.retrain.main(args=None) None[source]

CLI function to rapidly retrain an addresses parser and saves it. One can retrain a base pretrained model using most of the arguments as the retrain() method. By default, all the parameters have the same default value as the retrain() method. The supported parameters are the following:

  • train_ratio,

  • batch_size,

  • epochs,

  • num_workers,

  • learning_rate,

  • seed,

  • logging_path,

  • disable_tensorboard,

  • layers_to_freeze, and

  • name_of_the_retrain_parser.

Examples of usage:

retrain fasttext ./train_dataset_path.csv

Using a gpu device

retrain bpemb ./train_dataset_path.csv --device 0

Modifying training parameters

retrain bpemb ./train_dataset_path.csv --device 0 --batch_size 128 --learning_rate 0.001

We do not handle the seq2seq_params fine-tuning argument for now.

Test

This command allows a user to test the base_parsing_model (or the retrained one using the --path_to_retrained_model) on the train_dataset_path dataset. For the testing, the CSV or Pickle dataset is loader in a specific dataloader (see DatasetContainer for more details). Moreover, by default, we log some information (--log) such as the tested address parser model name and the parsed dataset path. Plus, we also log the testing results in a TSV file. The two files are exported at the same path as the testing dataset. Here is the list of the arguments, their descriptions and default values. One can use the command parse --help to output the same description in your command line.

  • base_parsing_model: The parsing module to test.

  • test_dataset_path: The path to the dataset file in a pickle (.p, .pickle or .pckl) or CSV format.

  • --device: The device to use. It can be ‘cpu’ or a GPU device index such as '0' or '1'. By default, '0'.

  • --path_to_retrained_model: A path to a retrained model to use test (need to be the same model type as base_parsing_model). By default, None.

  • --batch_size: The batch size to use to process the dataset. By default, 32.

  • --num_workers: The number of workers to use for the data loader (default is 1 worker).

  • --seed: The seed to use to make the sampling deterministic (default 42).

  • --csv_column_name: The column name to extract address in the CSV. Need to be specified if the provided dataset_path leads to a CSV file. By default, None.

  • --csv_column_separator: The column separator for the dataset container will only be used if the dataset is a CSV one. By default, '\t'.

  • --log: Either or not to log the parsing process into a .log file exported at the same place as the parsed data using the same name as the export file. The bool value can be (not case sensitive) 'true/false', 't/f', 'yes/no', 'y/n' or '0/1'. By default, True.

  • --cache_dir: To change the default cache directory (default to None, e.g. default path).

deepparse.cli.test.main(args=None) None[source]

CLI function to rapidly test an address parser on test data using the same argument as the test() method (with the same default values) except for the callbacks. The results will be logged in a CSV file next to the test dataset.

Examples of usage:

test fasttext ./test_dataset_path.csv

Modifying testing parameters

test bpemb ./test_dataset_path.csv --batch_size 128 --logging_path "./logging_test"

Download

Command to pre-download model weights and requirements. Here is the list of arguments. One can use the command parse --help to output the same description in your command line.

  • model_type: The parsing module to download. The possible choice are 'fasttext', 'fasttext-attention', 'fasttext-light', 'bpemb' and 'bpemb-attention'.

  • --saving_cache_dir: To change the default saving cache directory (default to None, e.g. default path).