PostgreSQL Application in Image Search and Video and Image Deduplication

Background

Multiple image processing business scenarios, such as image search, video deduplication, face recognition, image beautification, and image deduplication, encounter the common challenge of managing the humongous volume of duplicate images and videos stored on the server.

For instance, users may upload many videos where specific video has different versions with different resolutions, audio tracks, or compression ratios. This leads to a large number of duplicate videos saved on the server. Another example is the identification of porn videos or images. It is crucial to identify such content but the challenge is how to identify porn videos and images.

On the other hand, image search is the second most common search engine after text search. Image search engines, such as Alibaba Cloud Image Search, allow you to upload a particular picture to the interface and search out a bunch of similar pictures.

This article describes how to screen out duplicate videos and how to implement an efficient image search using PostgreSQL. With the omnipotent API of PostgreSQL, it is easy to extend its image search function.

PostgreSQL Image Search Plug-in: Background Technology

The PostgreSQL image search plug-in uses the mainstream Haar wavelet technology to store images after conversion. Take a look at its Wikipedia page to learn more about this technology.

You can also refer to this document to know more about how to search similar images in PostgreSQL.

Following are some examples to illustrate the Haar wavelet technology application.

Introduction to the PostgreSQL Image Search Plug-in

Follow the steps below to install the PostgreSQL Image Search Plug-in:

Step 1. Add dependency on gd.h.

yum install -y gd-devel

Step 2. Download and install imgsmlr.

$ git clone https://github.com/postgrespro/imgsmlr
$ cd imgsmlr
$ export PGHOME=/home/digoal/pgsql9.5
$ export PATH=$PGHOME/bin:$PATH:.
$ make USE_PGXS=1
$ make USE_PGXS=1 install

Step 3. Install the plug-in.

$ psql
psql (9.5.3)
Type "help" for help.
postgres=# create extension imgsmlr;
CREATE EXTENSION

The following are two newly added data types in imgsmlr.

The KNN operator and gist index method that supports the pattern and signature types can be used for similar image search.

The following table shows several newly added functions. Now, you can convert binary images to the pattern type and convert the data stored in the pattern into the signature type.

PostgreSQL Image Search Plug-in: Testing

Follow the steps listed below to test the PostgreSQL Image Search Plug-in:

Step 1. Import some images as shown below (the more the better).

Step 2. Now, create the image table.

create table image (id serial, data bytea);

Step 3. Import the images to the database.

insert into image(data) select pg_read_binary_file('文件路径');

Step 4. Next, convert the images to the pattern and signature types.

CREATE TABLE pat AS (
SELECT
id,
shuffle_pattern(pattern) AS pattern,
pattern2signature(pattern) AS signature
FROM (
SELECT
id,
jpeg2pattern(data) AS pattern
FROM
image
) x
);

Step 5. Create indexes as shown below.

ALTER TABLE pat ADD PRIMARY KEY (id);CREATE INDEX pat_signature_idx ON pat USING gist (signature);

Step 6. Now you can run an approximation query, such as querying images that are similar to id = : id images and retrieve the top 10 items on the similarity ranking list.

SELECT
id,
smlr
FROM
(
SELECT
id,
pattern <-> (SELECT pattern FROM pat WHERE id = :id) AS smlr
FROM pat
WHERE id <> :id
ORDER BY
signature <-> (SELECT signature FROM pat WHERE id = :id)
LIMIT 100
) x
ORDER BY x.smlr ASC
LIMIT 10

Here, you can use KNN indexing to quickly get output based on the similarity rankings.

Let’s take a look at the following examples.

Video Deduplication

For video deduplication, extract keyframes in a video to generate the Cartesian product through self-correlation. Calculate the similarity between any two images of different videos, and when the similarity reaches a certain threshold value, the two videos are deemed the same.

Refer to the following example for better understanding.

创建图片表,并将所有视频的关键帧导入表中
create table image (id serial8 primary key, movie_id int, data bytea);
导入图片,假设为jpeg格式
... 略 ...
生成patten 和 signature
CREATE TABLE pat AS (
SELECT
id, movie_id,
shuffle_pattern(pattern) AS pattern,
pattern2signature(pattern) AS signature
FROM (
SELECT
id, movie_id,
jpeg2pattern(data) AS pattern
FROM
image
) x
);
计算不同视频的相似度
select t1.movie_id, t1.id, t1.signature<->t2.signature from
pat t1 join pat t2 on (t1.movie_id<>t2.movie_id)
order by t1.signature<->t2.signature desc
or select t1.movie_id, t1.id, t1.signature<->t2.signature from
pat t1 join pat t2 on (t1.movie_id<>t2.movie_id)
where t1.signature<->t2.signature > 0.9
order by t1.signature<->t2.signature desc

Conclusion

In short, PostgreSQL is a very powerful database with highly customizable functions and does not require you to make any changes to the PostgreSQL kernel. Therefore, it is safe and reliable. The image search technology is an example of PostgreSQL feature extension and offers extremely high speed.

Build your own PostgreSQL solution on Alibaba Cloud with ApsaraDB for RDS PostgreSQL.

Original Source:

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.