PostgresML: Open-source Python Library for Training and Deploying ML Models in PostgreSQL via SQL Queries

Comprehensive Guide

Vishnu Sivan
The Pythoneers

--

AI engineers encounter challenges in managing machine learning models due to the necessity for complex infrastructures. These setups involve intricate configurations and microservices for both training and deploying models, leading to time and resource-intensive processes. Existing solutions, though available, often require extensive configurations and external dependencies, contributing to added complexity in the overall machine-learning workflow.

PostgresML is a PostgreSQL extension that empowers users to conduct training and inference on both text and tabular data through SQL queries. This integration seamlessly incorporates machine learning models into the PostgreSQL database, allowing for the utilization of advanced algorithms to process data with efficiency.

In this article, we will go through the basics of PostgresML and try to use the extension through the online notebooks and docker images.

Getting Started

Table of contents

What is PostgresML

PostgresML is an open-source library that enhances PostgreSQL, transforming it into a robust platform for classical machine learning and AI. Its notable feature is the capacity to train and deploy machine learning models directly within the database using standard SQL queries. The library supports GPU-powered inference for swift predictions, offers streaming response support for large language models such as GPT-3, and enables the management of open-source ML models from platforms like HuggingFace for simplifying the tracking of experiment results.

PostgresML supports training on over 50 algorithms, including popular ones like random forests and neural networks. It enables the creation and indexing of vector embeddings for text search and recommendations. Its horizontal scalability, achieving millions of predictions per second through PostgreSQL’s reliability.

The library streamlines the MLOps pipeline, allowing a seamless transition from training to deployment via simple SQL queries. Consolidating the model data pipeline in PostgreSQL boosts operational efficiency, providing faster insights by keeping models close to data and applications.

In summary, PostgresML simplifies machine learning infrastructure complexities, bringing models directly into the PostgreSQL database environment. This integration reduces the reliance on additional services, enhancing machine learning workflows efficiently. PostgresML marks a user-friendly future for machine learning operations.

Installation

PostgresML installation comprises a PostgreSQL database, a machine learning extension for Postgres, and a dashboard app. The extension provides comprehensive machine learning functionality, while the dashboard app offers an easy interface for SQL notebooks, ML experiments, and model management.

You have two options for installing PostgresML: a serverless cloud version and a Docker-based installation. If you prefer to explore the functionality without dealing with Docker, you can register for a free PostgresML account. Within seconds, you’ll have access to a free database, complete with GPU support and cutting-edge LLMs.

In this article, we go ahead with a serverless cloud version. If you are interested to setup using docker, please refer to the following link.

Serverless cloud setup

Follow the steps to perform the serverless cloud setup for PostgresML.

  • Visit the official site at https://postgresml.org/signin and log in. If you are a new user, sign up using the following link: https://postgresml.org/signup.
  • Choose the starter package, specifically opting for the serverless cloud option.
  • Once logged in, navigate to the dashboard where you will find connectivity information, including the host, port, and database URL. Utilize this information to remotely connect to the postgresml database.

Explore PostgresML tutorials

After configuring the serverless cloud, navigate to the sidebar and click on the “Notebooks” option. Here, you will find various tutorials available. Choose “Tutorial 0: Welcome to PostgresML.” PostgresML notebooks resemble Jupyter notebooks, presenting different cells of content. You can select a cell and execute it by clicking on “Run.” The expected result is shown below.

Create your first PostgresML notebook

Go to the sidebar and select the Notebooks option. Then, click on + Create New Notebook to create a new postgresML notebook. Provide a name for the notebook and double-click on it to start working on your first script. After the successful creation, you will get a script screen as below.

1. Text Data

Let’s begin by conducting various textual data manipulations and executing Natural Language Processing (NLP) tasks such as translation, sentiment analysis, summarization, and text generation.

Text Translation

Click on + Create New Cell to create a new cell on your notebook. Add the following code to the cell and click on Run to execute it.

SELECT pgml.transform(
'translation_en_to_fr',
inputs => ARRAY[
'Welcome to PostgresML!',
'The next generation ML Database'
]
) AS french;

You will get the output as follows,

[{"translation_text":"Bienvenue à PostgresML!"},
{"translation_text":"Base de données sur le ML de la prochaine génération"}]

Sentiment Analysis

Sentiment analysis is a natural language processing technique used to determine the sentiment expressed in a piece of text. It involves analyzing the text to understand whether the sentiment conveyed is positive, negative, or neutral. It helps in understanding the opinions, attitudes, and emotions of individuals or groups towards a particular subject, product, service, or event.

Click on + Create New Cell to create a new cell in your notebook. Add the following code to the cell and click on Run to execute it.

SELECT pgml.transform(
task => 'text-classification',
inputs => ARRAY[
'I love PostgresML!',
'I am really worried about time. ☹️'
]
) AS positivity;

You will get the output as follows,

[{"label":"POSITIVE","score":0.9996743202209472},
{"label":"NEGATIVE","score":0.9936240911483764}]

Summarization

Summarization is the process of condensing a piece of text while retaining its key information and main ideas. It involves automatically generating a shorter version of the original text, often in the form of a summary or abstract.

Click on + Create New Cell to create a new cell on your notebook. Add the following code to the cell and click on Run to execute it.

select pgml.transform(
task => '{"task": "summarization",
"model": "sshleifer/distilbart-cnn-12-6"
}'::JSONB,
inputs => array[
'India, officially known as the Republic of India, is a country in South Asia. It is the seventh-largest country by land area, the second-most populous country, and the most populous democracy in the world. India is known for its rich history, diverse cultures, and ancient civilizations dating back thousands of years. It is home to several major religions, including Hinduism, Buddhism, Jainism, and Sikhism.'
]
);

You will get the output as follows,

[{"summary_text":" India is the seventh-largest country by land area, the second-most populous country in the world . It is home to several major religions, including Hinduism, Buddhism, Jainism, and Sikhism . India is known for its rich history, diverse cultures, and ancient civilizations dating back thousands of years ."}]

Text Generation

Text generation involves creating new text, such as completing sentences or rephrasing existing text. It serves multiple purposes like generating code or crafting stories. Completion generation models predict the next word in a sequence, while text-to-text models learn mappings between pairs of texts, like language translation. Well-known models for text generation include GPT-based ones, T5, T0, and BART.

Click on + Create New Cell to create a new cell on your notebook. Add the following code to the new cell and click on Run to execute it.

SELECT pgml.transform(
task => 'text-generation',
inputs => ARRAY[
'The year is 2050, and humanity has made significant advancements in technology. Artificial intelligence now plays a central role in our daily lives, from self-driving cars to virtual assistants. '
]
) AS answer;

You will get the output as follows,

[[{"generated_text":"The year is 2050, and humanity has made significant advancements in technology. Artificial intelligence now plays a central role in our daily lives, from self-driving cars to virtual assistants.  This future technology is about to be even more compelling than before,"}]]

2 Token Classification

Token classification is a key task in natural language understanding, involving assigning labels to individual tokens within a text. Named Entity Recognition (NER) and Part-of-Speech (PoS) tagging are common subtasks. NER models identify entities like people, locations, and dates, while PoS tagging identifies parts of speech such as nouns and verbs.

Named Entity Recognition

Named Entity Recognition (NER) is the process of identifying and classifying named entities within a text. These entities typically include the names of people, locations, organizations, dates, numerical expressions, and other specific terms or phrases that refer to real-world objects or concepts.

Click on + Create New Cell to create a new cell on your notebook. Add the following code to the new cell and click on Run to execute it.

SELECT pgml.transform(
inputs => ARRAY[
'I am Vishnu Sivan and I live in Kerala, India.'
],
task => 'token-classification'
) as ner;

You will get the output as follows,

[[
{"end":11,"entity":"I-PER","index":3,"score":0.994789719581604,"start":5,"word":"Vishnu"},
{"end":32,"entity":"I-LOC","index":8,"score":0.9984140396118164,"start":26,"word":"Kerala"},
{"end":39,"entity":"I-LOC","index":10,"score":0.9995112419128418,"start":34,"word":"India"}
]]

Fill Mask

Fill-mask is a task where words in a sentence are masked, and the goal is to predict the missing words.

Click on + Create New Cell to create a new cell on your notebook. Add the following code to the new cell and click on Run to execute it.

SELECT pgml.transform(
task => '{
"task" : "fill-mask"
}'::JSONB,
inputs => ARRAY[
'Delhi is the <mask> of India.'

]
) AS answer;

You will get the output as follows,

[
{"score":0.735087513923645,"sequence":"Delhi is the capital of India.","token":812,"token_str":" capital"},
{"score":0.06639654189348221,"sequence":"Delhi is the Capital of India.","token":1867,"token_str":" Capital"},
{"score":0.02952105924487114,"sequence":"Delhi is the birthplace of India.","token":32357,"token_str":" birthplace"},
{"score":0.02774127759039402,"sequence":"Delhi is the Gateway of India.","token":20996,"token_str":" Gateway"},
{"score":0.012980145402252674,"sequence":"Delhi is the pride of India.","token":7040,"token_str":" pride"}
]

3 Vector Database

A vector database is designed to store and manage vectors, which represent data points in multi-dimensional space. Vectors can represent various data types like images, text, audio, and numerical data. The database supports efficient searching and retrieval of vectors through methods such as nearest neighbor search, clustering, and indexing. This facilitates tasks like image search, recommendation systems, and natural language processing.

PostgresML extends PostgreSQL databases to function as vector databases by generating embeddings from stored text. The pgml.embed function is used to create embeddings, taking a transformer name and text value as input. This function automatically downloads and caches the transformer for future use, saving time and resources.

Creating embeddings using transformers

In this section, we will be using PostgresML for generating embeddings from a dataset of tweets often employed in sentiment analysis. We will employ the pgml.embed function to produce an embedding for each tweet in the dataset. Subsequently, these embeddings will be inserted into a table named tweet_embeddings.

Click on + Create New Cell to create a new cell on your notebook. Add the following code to the new cell and click on Run to execute it.

SELECT pgml.load_dataset('tweet_eval', 'sentiment');

SELECT * FROM pgml.tweet_eval LIMIT 10;

CREATE TABLE tweet_embeddings AS
SELECT text, pgml.embed('distilbert-base-uncased', text) AS embedding
FROM pgml.tweet_eval;

SELECT * FROM tweet_embeddings LIMIT 2;

You will get the output as follows,

text embedding
"QT @user In the original draft of the 7th book, Remus Lupin survived the Battle of Hogwarts. #HappyBirthdayRemusLupin" {-0.1567948312,-0.3149209619,0.2163394839,..}
"Ben Smith / Smith (concussion) remains out of the lineup Thursday, Curtis #NHL #SJ" {-0.0701668188,-0.012231146,0.1304316372,.. }

Indexing your embeddings

When indexing embeddings, it’s vital to balance accuracy with speed. Exact indexing methods like B-trees offer precise results but may lack speed compared to approximate techniques like KNN and ANN. The index is created on the embedding column within the tweet_embeddings table, containing vector embeddings from the tweet dataset. The vector_cosine_ops argument specifies the indexing operation for embeddings. This indexing enables swift retrieval of records similar to a query vector, beneficial for machine learning tasks like similarity search and recommendation systems.

We’re establishing an index on the tweet_embeddings table, employing the ivfflat algorithm for indexing. This algorithm merges an Inverted File (IVF) index with a Flat (FLAT) index, creating a hybrid approach.

CREATE INDEX ON tweet_embeddings USING ivfflat (embedding vector_cosine_ops);

Querying the index using embeddings for your queries

After indexing your embeddings, you can execute queries by providing a query embedding representing the desired search. The index will then retrieve the closest matching embeddings from the database, determined by the similarity between the query embedding and the stored embeddings.

Click on + Create New Cell to create a new cell on your notebook. Add the following code to the new cell and click on Run to execute it.

WITH query AS (
SELECT pgml.embed('distilbert-base-uncased', 'Star Wars christmas special is on Disney')::vector AS embedding
)
SELECT * FROM items, query ORDER BY items.embedding <-> query.embedding LIMIT 5;

You will get the output as follows,

Happy Friday with Batman animated Series 90S forever!
"Fri Oct 17, Sonic Highways is on HBO tonight, Also new episode of Girl Meets World on Disney"
tfw the 2nd The Hunger Games movie is on Amazon Prime but not the 1st one I didn't watch
5 RT's if you want the next episode of twilight princess tomorrow
Jurassic Park is BACK! New Trailer for the 4th Movie, Jurassic World -

Thanks for reading this article. I wanna thank Gowri M Bhatt for reviewing the content.

If you enjoyed this article, please click on the clap button 👏 and share to help others find it!

--

--

Vishnu Sivan
The Pythoneers

Try not to become a man of SUCCESS but rather try to become a man of VALUE