API Documentation

This API is used to authenticate and request jobs from the Vector Curator Cloud.

API URL: https://api.data2vector.ai

Authentication

Endpoint: /session/new

Headers:
    "username_or_email": (string)
    "password": (string)

Response:
    200: { "token": (string) }
    401: { "error": "Invalid username or password" }
    500: { "error": "Internal server error" }

Request Cloud Job

Endpoint: /jobs

Headers:
    "auth_token": (string)

Request JSON:
    {
        "job_type": (string),
        "inputs": (dict or None)
    }

Response:
    200: {"status_endpoint": (string)}
    401: {"error": "Not authenticated"}
    500: {"error": "Internal server error"}

Check Job Status

Endpoint: (status_endpoint)

Headers:
    "auth_token": (string)

Response:
    200: {
        "status": (string)
        "results": (dict or None)
        "error": (string or None)
    }
    401:{"error": "Not authenticated"}
    500:{"error": "Internal server error"}

FTP Uploading

For Jobs with large data input requirements you can connect to your account’s cloud ftp folders with any FTP client. (Or use file urls if the files are already hosted)

Host url:
Use the same username as you would in the authetication call
Use the same password as you would in the authentication call

Account Managment Jobs

create_archive
required_account_privileges: ["write"]

Request JSON ["inputs"]: {
    "name": (string 0 < len < 30 unique in account)
    "content_type": (string in ["Image", "Video", "Sound", "Text", "Point_Cloud"])
    "description": (string 0 < len < 150 or None)
    "use_default_similarity_calibration": (bool)
}

Response JSON ["results"]: None

update_parameters
required_account_privileges: ["write"]

Request JSON ["inputs"]: {
    "archive": (string)
    "description": (string 0 < len < 150 or None)
    "nr_similar_allowed": (int >= 1 or None)
}

Response JSON ["results"]: None

list_content
Required Account Privileges: ["read"]

Request JSON ["inputs"]: None

Response JSON ["results"]:{
    "archive_to_content_type_states_nr_of_contents": {
        "archive_name": {
            "description": (string),
            "content_type": (string in ["Image", "Video", "Sound", "Text", "Point_Cloud"]),
            "calibrated_for_similarity": (bool),
            "calibrated_for_relevance": (bool),
            "nr_of_contents": (int)
        }
    },
    "nr_files_in_batch_folder": (int)
}

remove_content
required_account_privileges: ["write"]

Request JSON ["inputs"]: {
    "archive": (string),
    "archive_content_ids_subset": (list of ints)
    "delete_archive": (bool)
}

Response JSON ["results"]: None

get_archive_ids_and_urls
required_account_privileges: ["read"]

Request JSON ["inputs"]: {
    "archive": (string)
}

Response JSON ["results"]: {
    "id_to_download_url": {
        (int): (string)
    }
}

get_vectors
Required Account Privileges: ["read"]

Request JSON ["inputs"]:{
    "archive": (string),
    "archive_content_ids_subset": (list of ints or None)
}

Response JSON ["results"]: {
    "id_to_vector_url": {
        (int): (string)
    }
}

add_urls_to_contents
Required Account Privileges: ["write"]

Request JSON ["inputs"]:{
    "archive": (string),
    "indexed_content_id_to_url": {
        (int): (string)
    }
}

Response JSON ["results"]: None

clear_batch_folder
required_account_privileges: ["write"]

Request JSON ["inputs"]: None

Response JSON ["results"]: None

Support Jobs

fine_tune_vectorizer
required_account_privileges: ["write"]

Request JSON ["inputs"]: {
    "content_type": (string in ["Image", "Video", "Sound", "Text", "Point_Cloud"])
    "custom_vectorizer_name_for_sampling": (string or None)
    "starting_custom_vectorizer_name": (string or None)
    "custom_vectorizer_name": (string)
    "file_urls": (list of strings or None)
    "download_from_batch_cloud_folder": (bool)
}

Response JSON ["results"]: None

Requires files to be sent via FTP to the cloud batch folder or in the file_urls

Fine-tuning a vectorizer is a classification based training step that forces the model to pay attention to the important features and to ignore the irrelevant ones. 
We recommend at least 100 examples and no more than 10000 examples for fine-tuning.
You can have up to 100 labels in your fine tuning dataset.
The labels can be any class that describes the content. Each file can have multiple labels.
Open a text editor and add the labels of each file following the format:
{
    "file_name_1.ext": ["label_1", "label_2", ...],
    "file_name_2.ext": ["label_2"],
    "file_name_3.ext": ["label_1", "label_3"]
}
save your text editor file as "example_to_labels.json" and place it with the dataset files.

fine_tune_translator
required_account_privileges: ["write"]

Request JSON ["inputs"]: {
    "translator_name": (string)
    "input_content_type": (string in ["Image", "Video", "Sound", "Text", "Point_Cloud"] or None)
    "input_custom_vectorizer_name": (string or None)
    "output_content_type": (string in ["Image", "Video", "Sound", "Text", "Point_Cloud"] or None)
    "output_custom_vectorizer_name": (string or None)
    "file_urls": (list of strings or None)
    "download_from_batch_cloud_folder": (bool)
}

Response JSON ["results"]: None

Requires files to be sent via FTP to the cloud batch folder or in the file_urls

Fine-tuning a translator will allow you to search a vector archive of a content type with data in another content type or in the same content type but with initial processing. 
We recommend at least 100 examples and no more than 20000 examples for fine-tuning.

Open a text editor and add the input and output name pairs following the format:
[
    ["input_name_1", "output_name_1"],
    ["input_name_2", "output_name_2"],
    ["input_name_3", "output_name_3"]
    ...
]

save your text editor file as "training_mappings.json" and place it with the files.
optionally you can also create a "validation_mappings.json" file with the same format.
if no validation file is provided, 15% of the training data will be used for validation.

sample_data
required_account_privileges: ["read"]

Request JSON ["inputs"]: {
    "source_content_type": (string in ["Video", "Sound", "Text"])
    "content_type": (string in ["Image", "Video", "Sound", "Text"])
    "time_intervals_or_highlights": (string in ["time_intervals", "highlights"])
    "time_interval": (float or None)
    "nr_samples_per_file": (int or None)
    "custom_vectorizer_name": (string or None)
    "file_urls": (list of strings or None)
    "download_from_batch_cloud_folder": (bool)
}

Response JSON ["results"]: {
    "samples_file_names_with_download_urls": (list of lists of str)
}

Requires files to be sent via FTP to the cloud batch folder or in the file_urls

trim_by_highlights
required_account_privileges: ["read"]

Request JSON ["inputs"]: {
    "source_content_type": (string in ["Video", "Sound"])
    "nr_trims_per_file": (int)
    "max_trim_size": (float)
    "min_trim_size": (float or None)
    "custom_vectorizer_name": (string or None)
    "file_urls": (list of strings or None)
    "download_from_batch_cloud_folder": (bool)
}

Response JSON ["results"]: {
    "trims_file_names_with_download_urls": (list of lists of str)
}

Requires files to be sent via FTP to the cloud batch folder or in the file_urls

data_balance
required_account_privileges: ["read"]

Request JSON ["inputs"]: {
    "clustered_content_ids_sorted_by_decreasing_diversity_with_contents_sorted_by_distance_to_centroid": (list of lists of int)
    "ids_sorted_from_inliers_to_outliers": (list of int or None)
    "ids_sorted_by_essential_examples": (list of int or None)
    "ids_sorted_by_forbidden_examples": (list of int or None)
}

Response JSON ["results"]: {
    "prioritized_over_represented_ids_to_remove": (list of int)
    "prioritized_under_represented_ids_to_source": (list of int)
}

pca_vector_dim_reduction
required_account_privileges: ["read"]

Request JSON ["inputs"]: {
    "archive": (string or None)
    "archive_content_ids_subset": (list of ints or None)
    "file_urls": (list of str)
    "nr_of_dimensions": (int)
    "download_from_batch_cloud_folder": (bool)
}

Response JSON ["results"]: {
    "identifier_to_pca_vector": (dict of int or string to list of float)
}

extract_similarity_dataset
required_account_privileges: ["read"]

Request JSON ["inputs"]: {
    "source_content_type": (str in ["Sound", "Video", "Text"])
    "content_type": (str in ["Sound", "Video", "Text", "Image"])
    "max_nr_of_pairs": (int >= 1)
    "custom_vectorizer_name": (string or None)
    "file_urls": (list of strings or None)
    "download_from_batch_cloud_folder": (bool)
}

Response JSON ["results"]: {
    "similarity_calibration_pairs_download_urls": (list of str)
}
Requires files to be sent via FTP to the cloud batch folder or in the file_urls

send_feedback
Required Account Privileges: ["read"]

Request JSON ["inputs"]:{
    "feedback": (string 0 < len < 1000)
}

Response JSON ["results"]: None

Archives Jobs

calibrate_similarity
required_account_privileges: ["write"]

Request JSON ["inputs"]: {
    "archive": (string)
    "custom_vectorizer_name": (string or None)
    "file_urls": (list of strings or None)
    "download_from_batch_cloud_folder": (bool)
}

Response JSON ["results"]: None

Requires files to be sent via FTP to the cloud batch folder or in the file_urls
You can pass either data files or vectors (torch safetensors ".pt" one-dimensional, any length)

Similarity calibration is a step that is used to train the redundancy filter and the clustering by similarity. 
The similarity dataset must be composed of at least 200 pairs and max 10000 pairs of examples that are similar according to the client's criteria.
To assemble the similarity dataset we recommend you gather your data into clusters, one for each of the fine tuning labels.
Then extract at least 2 pairs from each cluster.
The file names inside the pairs must start with a prefix that is the id of the pair. Ex:
1_file_1.ext
1_file_2.ext
2_file_3.ext
2_file_4.ext
3_file_5.ext
3_file_6.ext

calibrate_relevance
required_account_privileges: ["write"]

Request JSON ["inputs"]: {
    "archive": (string)
    "archive_content_ids_subset": (list of ints or None)
}

Response JSON ["results"]: None

index
required_account_privileges: ["write"]

Request JSON ["inputs"]: {
    "archive": (string)
    "check_for_redundancy_against_archived": (bool)
    "archive_content_ids_subset": (list of ints or None)
    "check_for_redundancy_within_batch": (bool)
    "check_for_relevance": (bool)
    "custom_vectorizer_name": (string or None)
    "file_urls": (list of strings or None)
    "download_from_batch_cloud_folder": (bool)
}

Response JSON ["results"]: {
     "name_to_indexed_content_id": (dict of string to int)
     "exact_duplicate_file_names": (list of strings)
     "failed_vectorization_names": (list of strings)
     "redundant_content_names": (list of strings)
     "irrelevant_content_names": (list of strings)
}

Requires files to be sent via Request or FTP to the cloud batch folder or in the file_urls
You can pass either data files or vectors (torch safetensors ".pt" one-dimensional, any length)

inliers_outliers
required_account_privileges: ["read"]

Request JSON ["inputs"]: {
    "archive": (string)
    "archive_content_ids_subset": (list of ints or None)
    "id_to_preference_weight": (dict of int to int or None)
}

Response JSON ["results"]: {
    "content_ids_sorted_by_inliers": (list of ints)
    "content_ids_sorted_by_outliers": (list of ints)
    "mean_distance": (float)
}

search
required_account_privileges: ["read"]

Request JSON ["inputs"]: {
    "input_content_type": (string in ["Image", "Video", "Sound", "Text", "Point_Cloud"] or None) 
    "custom_vectorizer_name": (string or None)
    "data_type_translator_name": (string or None)
    "archive": (string)
    "archive_content_ids_subset": (list of ints or None)
    "indexed_ids_references": (list of ints or None)
    "name_or_id_to_preference_weight": (dict of str or int to int or None)
    "speed_up_with_hash_matching": (bool)
    "nr_results": (int > 0 or None)
    "get_mean_distance": (bool)
    "file_urls": (list of strings or None)
    "common_assets_archive_name": (string or None)
    "download_from_batch_cloud_folder": (bool)
}

Response JSON ["results"]: {
    "sorted_indexed_ids_with_distances": (list of lists (int, float)),
    "mean_distance": (float or None)
}

Requires files to be sent via Request or FTP to the cloud batch folder or in the file_urls
You can pass either data files or vectors (torch safetensors ".pt" one-dimensional, any length)

cluster_by_number_of_clusters
required_account_privileges: ["read"]

Request JSON ["inputs"]: {
    "archive": (string)
    "archive_content_ids_subset": (list of ints or None)
    "nr_of_clusters": (int >= 1 or None) if None or 1 will determine the number of clusters based on cluster quality metrics
    "sort_content_ids_by_distance_to_cluster_center": (bool)
    "sort_clusters_by_descending_inner_diversity": (bool)
}

Response JSON ["results"]: {
    "clustered_content_ids": (list of lists of ints)
    "average_distance_between_cluster_elements_and_cluster_center": (float or None)
}

cluster_by_calibrated_similarity
required_account_privileges: ["read"]

Request JSON ["inputs"]: {
    "archive": (string)
    "archive_content_ids_subset": (list of ints or None)
    "sort_content_ids_by_distance_to_cluster_center": (bool)
    "sort_clusters_by_descending_inner_diversity": (bool)
}

Response JSON ["results"]: {
    "clustered_content_ids": (list of lists of ints)
    "average_distance_between_cluster_elements_and_cluster_center": (float or None)
}