CLI
You can use our CLI to parse addresses directly from the command line, retrain a parsing model or download a pretrained model.
Parse
The parsing of the addresses to parse dataset_path
is done using the selected parsing_model
.
The exported parsed addresses are to be exported in the same directory as the addresses to parse but
given the export_file_name
using the encoding format of the address dataset file. For example,
if the dataset is in a CSV format, the output file format will be a CSV. Moreover, by default,
we log some information (--log
), such as the parser model name, the parsed dataset path
and the number of parsed addresses. Here is the list of the arguments, their descriptions and default values.
One can use the command parse --help
to output the same description in your command line.
parsing_model
: The parsing module to use.
dataset_path
: The path to the dataset file in a pickle (.p
,.pickle
or.pckl
) or CSV format.
export_file_name
: The filename to use to export the parsed addresses. We will infer the file format base on the file extension. That is, if the file is a pickle (.p
or.pickle
), we will export it into a pickle file. The supported formats are Pickle, CSV and JSON. The file will be exported in the same repositories as the dataset_path. See the documentation for more details on the format exporting.
--device
: The device to use. It can be ‘cpu’ or a GPU device index such as'0'
or'1'
. By default,'0'
.
--batch_size
: The batch size to use to process the dataset. By default,32
.
--path_to_retrained_model
: A path to a retrained model to use for parsing. By default,None
.
--csv_column_name
: The column name to extract address in the CSV. It needs to be specified if the provideddataset_path
leads to a CSV file. By default,None
.
--csv_column_separator
: The column separator for the dataset container will only be used if the dataset is a CSV one. By default,'\t'
.
--log
: Either or not to log the parsing process into a.log
file exported at the same place as the parsed data using the same name as the export file. The bool value can be (not case sensitive)'true/false'
,'t/f'
,'yes/no'
,'y/n'
or'0/1'
. By default,True
.
--cache_dir
: To change the default cache directory (default toNone
, e.g. default path).
- deepparse.cli.parse.main(args=None) None [source]
CLI function to easily parse an address dataset and output it in another file.
Examples of usage:
parse fasttext ./dataset_path.csv parsed_address.pickle
Using a GPU device
parse fasttext ./dataset_path.csv parsed_address.p --device 0
Using a CSV dataset
parse fasttext ./dataset.csv parsed_address.pckl --path_to_model_weights ./path
Dataset Format
For the dataset format see our DatasetContainer
.
Exporting Format
We support three types of export formats: CSV, Pickle and JSON.
The first export uses the following pattern column pattern:
"Address", "First address components class", "Second class", ...
.
Which means the address 305 rue des Lilas 0 app 2
will output the table below
using our default tags:
Address |
StreetNumber |
Unit |
StreetName |
Orientation |
Municipality |
Province |
Postal Code |
GeneralDelivery |
---|---|---|---|---|---|---|---|---|
305 rue des Lilas 0 app 2 |
305 |
app 2 |
rue des lilas |
o |
None |
None |
None |
None |
The second export uses a similar approach but uses tuples and lists. Using the same example will return the following
tuple ("305 rue des Lilas 0 app 2", [("305", "StreetNumber"), ("rue des lilas", "StreetName"), ...])
.
The third export uses a similar approach to the CSV format but uses dictionary-like formatting. Using the
same example will return the following dictionary {"Address": "305 rue des Lilas 0 app 2", "StreetNumber": "305", ...}
.
Retrain
This command allows a user to retrain the base_parsing_model
on the train_dataset_path
dataset.
For the training, the CSV or Pickle dataset is loaded in a specific dataloader (see
DatasetContainer
for more details). We use Poutyne’s automatic logging
functionalities during training. Thus, it creates an epoch checkpoint and outputs the epoch metrics in a TSV file.
Moreover, we save the best epoch model under the retrain model name (either the default one or a given name using
the name_of_the_retrain_parser
argument). Here is the list of the arguments, their descriptions and default values.
One can use the command parse --help
to output the same description in your command line.
base_parsing_model
: The parsing module to retrain.
train_dataset_path
: The path to the dataset file in a pickle (.p
,.pickle
or.pckl
) or CSV format.
--train_ratio
: The ratio to use of the dataset for the training. The rest of the data is used for the validation (e.g. a training ratio of 0.8 mean an 80-20 train-valid split) (default is0.8
).
--batch_size
: The size of the batch (default is32
).
--epochs
: The number of training epochs (default is5
).
--num_workers
: The number of workers to use for the data loader (default is1
worker).
--learning_rate
: The learning rate (LR) to use for training (default0.01
).
--seed
: The seed to use (default42
).
--logging_path
: The logging path for the checkpoints and the retrained model. Note that training creates checkpoints, and we use the Poutyne library that uses the best epoch model and reloads the state if any checkpoints are already there. Thus, an error will be raised if you change the model type. For example, you retrain a FastText model and then retrain a BPEmb in the same logging path directory. By default, the path is'./checkpoints'
.
--disable_tensorboard
: To disable Poutyne automatic Tensorboard monitoring. By default, we disable them (True
).
--layers_to_freeze
: Name of the portion of the seq2seq to freeze layers, thus reducing the number of parameters to learn. Default toNone
.
--name_of_the_retrain_parser
: Name to give to the retrained parser that will be used when reloaded as the printed name, and to the saving file name. By default,None
, thus, the default name. See the complete parser retrain method for more details.
--device
: The device to use. It can be'cpu'
or a GPU device index such as'0'
or'1'
. By default,'0'
.
--csv_column_names
: The column names to extract the address in the CSV. It must be specified if the provided dataset_path leads to a CSV file. Column names have to be separated by whitespace. For example,--csv_column_names column1 column2
.
--csv_column_separator
: The column separator for the dataset container will only be used if the dataset is a CSV one. By default,'\t'
.
--cache_dir
: To change the default cache directory (default toNone
, e.g. default path).
prediction_tags
: To change the prediction tags. Theprediction_tags
path leads to a JSON file of the new tags in a key-value style. For example, the path can be"a_path/file.json"
and the content can be{"new_tag": 0, "other_tag": 1, "EOS": 2}
.
- deepparse.cli.retrain.main(args=None) None [source]
CLI function to easily retrain an address parser and save it. One can retrain a base pretrained model using most of the arguments as the
retrain()
method. By default, all the parameters have the same default value as theretrain()
method. The supported parameters are the following:train_ratio
,batch_size
,epochs
,num_workers
,learning_rate
,seed
,logging_path
,disable_tensorboard
,layers_to_freeze
, andname_of_the_retrain_parser
.
Examples of usage:
retrain fasttext ./train_dataset_path.csv
Using a GPU device
retrain bpemb ./train_dataset_path.csv --device 0
Modifying training parameters
retrain bpemb ./train_dataset_path.csv --device 0 --batch_size 128 --learning_rate 0.001
We do not handle the seq2seq_params
fine-tuning argument for now.
Test
This command allows a user to test the base_parsing_model
(or the retrained one using the
--path_to_retrained_model
) on the train_dataset_path
dataset.
For the testing, the CSV or Pickle dataset is loaded in a specific dataloader (see
DatasetContainer
for more details). Moreover, by default,
we log some information (--log
), such as the tested address parser model name and the parsed dataset path. Plus,
we also log the testing results in a TSV file. The two files are exported at the same path as the testing dataset.
Here is the list of the arguments, their descriptions and default values.
One can use the command parse --help
to output the same description in your command line.
base_parsing_model
: The parsing module to test.
test_dataset_path
: The path to the dataset file in a pickle (.p
,.pickle
or.pckl
) or CSV format.
--device
: The device to use. It can be ‘cpu’ or a GPU device index such as'0'
or'1'
. By default,'0'
.
--path_to_retrained_model
: A path to a retrained model to use test (need to be the same model type asbase_parsing_model
). By default,None
.
--batch_size
: The batch size to use to process the dataset. By default,32
.
--num_workers
: The number of workers to use for the data loader (default is1
worker).
--seed
: The seed to use to make the sampling deterministic (default42
).
--csv_column_name
: The column name to extract the address in the CSV. It must be specified if the provideddataset_path
leads to a CSV file. By default,None
.
--csv_column_separator
: The column separator for the dataset container will only be used if the dataset is a CSV one. By default,'\t'
.
--log
: Either or not to log the parsing process into a.log
file exported at the same place as the parsed data using the same name as the export file. The bool value can be (not case sensitive)'true/false'
,'t/f'
,'yes/no'
,'y/n'
or'0/1'
. By default,True
.
--cache_dir
: To change the default cache directory (default toNone
, e.g. default path).
- deepparse.cli.test.main(args=None) None [source]
CLI function to rapidly test an address parser on test data using the same argument as the
test()
method (with the same default values) except for the callbacks. The results will be logged in a CSV file next to the test dataset.Examples of usage:
test fasttext ./test_dataset_path.csv
Modifying testing parameters
test bpemb ./test_dataset_path.csv --batch_size 128 --logging_path "./logging_test"
Download
Command to pre-download model weights and requirements. Here is the list of arguments. One can use the command parse --help
to output the same description in your command line.
model_type
: The parsing module to download. The possible choice are'fasttext'
,'fasttext-attention'
,'fasttext-light'
,'bpemb'
and'bpemb-attention'
.
--saving_cache_dir
: To change the default saving cache directory (default toNone
, e.g. default path).