API Documentation
This API is used to authenticate and request jobs from the Vector Curator Cloud.
API URL: https://api.data2vector.ai
Authentication Endpoint: /session/new Headers: "username_or_email": (string) "password": (string) Response: 200: { "token": (string) } 401: { "error": "Invalid username or password" } 500: { "error": "Internal server error" }
Request Cloud Job Endpoint: /jobs Headers: "auth_token": (string) Request JSON: { "job_type": (string), "inputs": (dict or None) } Response: 200: {"status_endpoint": (string)} 401: {"error": "Not authenticated"} 500: {"error": "Internal server error"}
Check Job Status Endpoint: (status_endpoint) Headers: "auth_token": (string) Response: 200: { "status": (string) "results": (dict or None) "error": (string or None) } 401:{"error": "Not authenticated"} 500:{"error": "Internal server error"}
FTP Uploading
For Jobs with large data input requirements you can connect to your account’s cloud ftp folders with any FTP client. (Or use file urls if the files are already hosted)
Host url:
Use the same username as you would in the authetication call
Use the same password as you would in the authentication call
Account Managment Jobs
create_archive required_account_privileges: ["write"] Request JSON ["inputs"]: { "name": (string 0 < len < 30 unique in account) "content_type": (string in ["Image", "Video", "Sound", "Text", "Point_Cloud"]) "description": (string 0 < len < 150 or None) "use_default_similarity_calibration": (bool) } Response JSON ["results"]: None
update_parameters required_account_privileges: ["write"] Request JSON ["inputs"]: { "archive": (string) "description": (string 0 < len < 150 or None) "nr_similar_allowed": (int >= 1 or None) } Response JSON ["results"]: None
list_content Required Account Privileges: ["read"] Request JSON ["inputs"]: None Response JSON ["results"]:{ "archive_to_content_type_states_nr_of_contents": { "archive_name": { "description": (string), "content_type": (string in ["Image", "Video", "Sound", "Text", "Point_Cloud"]), "calibrated_for_similarity": (bool), "calibrated_for_relevance": (bool), "nr_of_contents": (int) } }, "nr_files_in_batch_folder": (int) }
remove_content required_account_privileges: ["write"] Request JSON ["inputs"]: { "archive": (string), "archive_content_ids_subset": (list of ints) "delete_archive": (bool) } Response JSON ["results"]: None
get_archive_ids_and_urls required_account_privileges: ["read"] Request JSON ["inputs"]: { "archive": (string) } Response JSON ["results"]: { "id_to_download_url": { (int): (string) } }
get_vectors Required Account Privileges: ["read"] Request JSON ["inputs"]:{ "archive": (string), "archive_content_ids_subset": (list of ints or None) } Response JSON ["results"]: { "id_to_vector_url": { (int): (string) } }
add_urls_to_contents Required Account Privileges: ["write"] Request JSON ["inputs"]:{ "archive": (string), "indexed_content_id_to_url": { (int): (string) } } Response JSON ["results"]: None
clear_batch_folder required_account_privileges: ["write"] Request JSON ["inputs"]: None Response JSON ["results"]: None
Support Jobs
fine_tune_vectorizer required_account_privileges: ["write"] Request JSON ["inputs"]: { "content_type": (string in ["Image", "Video", "Sound", "Text", "Point_Cloud"]) "custom_vectorizer_name_for_sampling": (string or None) "starting_custom_vectorizer_name": (string or None) "custom_vectorizer_name": (string) "file_urls": (list of strings or None) "download_from_batch_cloud_folder": (bool) } Response JSON ["results"]: None Requires files to be sent via FTP to the cloud batch folder or in the file_urls Fine-tuning a vectorizer is a classification based training step that forces the model to pay attention to the important features and to ignore the irrelevant ones. We recommend at least 100 examples and no more than 10000 examples for fine-tuning. You can have up to 100 labels in your fine tuning dataset. The labels can be any class that describes the content. Each file can have multiple labels. Open a text editor and add the labels of each file following the format: { "file_name_1.ext": ["label_1", "label_2", ...], "file_name_2.ext": ["label_2"], "file_name_3.ext": ["label_1", "label_3"] } save your text editor file as "example_to_labels.json" and place it with the dataset files.
fine_tune_translator required_account_privileges: ["write"] Request JSON ["inputs"]: { "translator_name": (string) "input_content_type": (string in ["Image", "Video", "Sound", "Text", "Point_Cloud"] or None) "input_custom_vectorizer_name": (string or None) "output_content_type": (string in ["Image", "Video", "Sound", "Text", "Point_Cloud"] or None) "output_custom_vectorizer_name": (string or None) "file_urls": (list of strings or None) "download_from_batch_cloud_folder": (bool) } Response JSON ["results"]: None Requires files to be sent via FTP to the cloud batch folder or in the file_urls Fine-tuning a translator will allow you to search a vector archive of a content type with data in another content type or in the same content type but with initial processing. We recommend at least 100 examples and no more than 20000 examples for fine-tuning. Open a text editor and add the input and output name pairs following the format: [ ["input_name_1", "output_name_1"], ["input_name_2", "output_name_2"], ["input_name_3", "output_name_3"] ... ] save your text editor file as "training_mappings.json" and place it with the files. optionally you can also create a "validation_mappings.json" file with the same format. if no validation file is provided, 15% of the training data will be used for validation.
sample_data required_account_privileges: ["read"] Request JSON ["inputs"]: { "source_content_type": (string in ["Video", "Sound", "Text"]) "content_type": (string in ["Image", "Video", "Sound", "Text"]) "time_intervals_or_highlights": (string in ["time_intervals", "highlights"]) "time_interval": (float or None) "nr_samples_per_file": (int or None) "custom_vectorizer_name": (string or None) "file_urls": (list of strings or None) "download_from_batch_cloud_folder": (bool) } Response JSON ["results"]: { "samples_file_names_with_download_urls": (list of lists of str) } Requires files to be sent via FTP to the cloud batch folder or in the file_urls
trim_by_highlights required_account_privileges: ["read"] Request JSON ["inputs"]: { "source_content_type": (string in ["Video", "Sound"]) "nr_trims_per_file": (int) "max_trim_size": (float) "min_trim_size": (float or None) "custom_vectorizer_name": (string or None) "file_urls": (list of strings or None) "download_from_batch_cloud_folder": (bool) } Response JSON ["results"]: { "trims_file_names_with_download_urls": (list of lists of str) } Requires files to be sent via FTP to the cloud batch folder or in the file_urls
data_balance required_account_privileges: ["read"] Request JSON ["inputs"]: { "clustered_content_ids_sorted_by_decreasing_diversity_with_contents_sorted_by_distance_to_centroid": (list of lists of int) "ids_sorted_from_inliers_to_outliers": (list of int or None) "ids_sorted_by_essential_examples": (list of int or None) "ids_sorted_by_forbidden_examples": (list of int or None) } Response JSON ["results"]: { "prioritized_over_represented_ids_to_remove": (list of int) "prioritized_under_represented_ids_to_source": (list of int) }
pca_vector_dim_reduction required_account_privileges: ["read"] Request JSON ["inputs"]: { "archive": (string or None) "archive_content_ids_subset": (list of ints or None) "file_urls": (list of str) "nr_of_dimensions": (int) "download_from_batch_cloud_folder": (bool) } Response JSON ["results"]: { "identifier_to_pca_vector": (dict of int or string to list of float) }
extract_similarity_dataset required_account_privileges: ["read"] Request JSON ["inputs"]: { "source_content_type": (str in ["Sound", "Video", "Text"]) "content_type": (str in ["Sound", "Video", "Text", "Image"]) "max_nr_of_pairs": (int >= 1) "custom_vectorizer_name": (string or None) "file_urls": (list of strings or None) "download_from_batch_cloud_folder": (bool) } Response JSON ["results"]: { "similarity_calibration_pairs_download_urls": (list of str) } Requires files to be sent via FTP to the cloud batch folder or in the file_urls
send_feedback Required Account Privileges: ["read"] Request JSON ["inputs"]:{ "feedback": (string 0 < len < 1000) } Response JSON ["results"]: None
Archives Jobs
calibrate_similarity required_account_privileges: ["write"] Request JSON ["inputs"]: { "archive": (string) "custom_vectorizer_name": (string or None) "file_urls": (list of strings or None) "download_from_batch_cloud_folder": (bool) } Response JSON ["results"]: None Requires files to be sent via FTP to the cloud batch folder or in the file_urls You can pass either data files or vectors (torch safetensors ".pt" one-dimensional, any length) Similarity calibration is a step that is used to train the redundancy filter and the clustering by similarity. The similarity dataset must be composed of at least 200 pairs and max 10000 pairs of examples that are similar according to the client's criteria. To assemble the similarity dataset we recommend you gather your data into clusters, one for each of the fine tuning labels. Then extract at least 2 pairs from each cluster. The file names inside the pairs must start with a prefix that is the id of the pair. Ex: 1_file_1.ext 1_file_2.ext 2_file_3.ext 2_file_4.ext 3_file_5.ext 3_file_6.ext
calibrate_relevance required_account_privileges: ["write"] Request JSON ["inputs"]: { "archive": (string) "archive_content_ids_subset": (list of ints or None) } Response JSON ["results"]: None
index required_account_privileges: ["write"] Request JSON ["inputs"]: { "archive": (string) "check_for_redundancy_against_archived": (bool) "archive_content_ids_subset": (list of ints or None) "check_for_redundancy_within_batch": (bool) "check_for_relevance": (bool) "custom_vectorizer_name": (string or None) "file_urls": (list of strings or None) "download_from_batch_cloud_folder": (bool) } Response JSON ["results"]: { "name_to_indexed_content_id": (dict of string to int) "exact_duplicate_file_names": (list of strings) "failed_vectorization_names": (list of strings) "redundant_content_names": (list of strings) "irrelevant_content_names": (list of strings) } Requires files to be sent via Request or FTP to the cloud batch folder or in the file_urls You can pass either data files or vectors (torch safetensors ".pt" one-dimensional, any length)
inliers_outliers required_account_privileges: ["read"] Request JSON ["inputs"]: { "archive": (string) "archive_content_ids_subset": (list of ints or None) "id_to_preference_weight": (dict of int to int or None) } Response JSON ["results"]: { "content_ids_sorted_by_inliers": (list of ints) "content_ids_sorted_by_outliers": (list of ints) "mean_distance": (float) }
search required_account_privileges: ["read"] Request JSON ["inputs"]: { "input_content_type": (string in ["Image", "Video", "Sound", "Text", "Point_Cloud"] or None) "custom_vectorizer_name": (string or None) "data_type_translator_name": (string or None) "archive": (string) "archive_content_ids_subset": (list of ints or None) "indexed_ids_references": (list of ints or None) "name_or_id_to_preference_weight": (dict of str or int to int or None) "speed_up_with_hash_matching": (bool) "nr_results": (int > 0 or None) "get_mean_distance": (bool) "file_urls": (list of strings or None) "common_assets_archive_name": (string or None) "download_from_batch_cloud_folder": (bool) } Response JSON ["results"]: { "sorted_indexed_ids_with_distances": (list of lists (int, float)), "mean_distance": (float or None) } Requires files to be sent via Request or FTP to the cloud batch folder or in the file_urls You can pass either data files or vectors (torch safetensors ".pt" one-dimensional, any length)
cluster_by_number_of_clusters required_account_privileges: ["read"] Request JSON ["inputs"]: { "archive": (string) "archive_content_ids_subset": (list of ints or None) "nr_of_clusters": (int >= 1 or None) if None or 1 will determine the number of clusters based on cluster quality metrics "sort_content_ids_by_distance_to_cluster_center": (bool) "sort_clusters_by_descending_inner_diversity": (bool) } Response JSON ["results"]: { "clustered_content_ids": (list of lists of ints) "average_distance_between_cluster_elements_and_cluster_center": (float or None) }
cluster_by_calibrated_similarity required_account_privileges: ["read"] Request JSON ["inputs"]: { "archive": (string) "archive_content_ids_subset": (list of ints or None) "sort_content_ids_by_distance_to_cluster_center": (bool) "sort_clusters_by_descending_inner_diversity": (bool) } Response JSON ["results"]: { "clustered_content_ids": (list of lists of ints) "average_distance_between_cluster_elements_and_cluster_center": (float or None) }