Data De-duplication in Image Search Services

Alibaba Cloud
5 min readJun 8, 2018

--

Image processing technology, such as image search, has a multitude of applications in the real world. For example, Internet users may upload multiple versions of a video or image, each with different formatting, audio tracks, or compression ratios. This leads to a significant number of duplicate videos stored on the service end. However, this problem can be solved using data de-duplication. But how is this normally done?

When you use search engines to look for relevant images, the search engine will process the image and the tags related to the image. For example, when I search for a “snowman” image, a search engine may return me this result.

Pretty accurate right? Typically, PostgreSQL is behind the implementation of the image search and its Payment Gateway Application Programming Interface (API) extends the image search function.

PostgreSQL’s Image Search Plug-in Background Technology

PostgreSQL’s image search plug-in adopts the mainstream Haar wavelet technology to convert and store an image. The following figures briefly describe the Haar wavelet technology. For additional details, refer to the following Wikipedia link: https://en.wikipedia.org/wiki/Haar_wavelet

Steps to Install PostgreSQL Image Search Plug-in

Below are the steps to install PostgreSQL image search plug-in:

Dependent on gd.h.

# yum install -y gd-devel

Download and install imgsmlr.

  • $ git clone https://github.com/postgrespro/imgsmlr $ cd imgsmlr $ export PGHOME=/home/digoal/pgsql9.5 $ export PATH=$PGHOME/bin:$PATH:. $ make USE_PGXS=1 $ make USE_PGXS=1 install

Install the plug-in.

  • $ psql psql (9.5.3) Type "help" for help. postgres=# create extension imgsmlr; CREATE EXTENSION

Two data types now exist in imgsmlr.

For similar image searching, use the gist index method (supporting the pattern and signature types) and KNN operator.

This adds several functions.

You can convert the binary images into the pattern type and convert the data stored in pattern into the signature type.

Steps to Perform PostgreSQL Image Search Plug-in Test:

Once you are done installing, carry out these steps to perform PostgreSQL image search plug-in test:

  1. Import images, such as the following (the more the better).
  1. Create the image table (id serial, data bytea);
  2. Import the images to the database.
  3. Insert into image(data) select pg_read_binary_file;
  4. Convert the image to the pattern and signature type.
  • CREATE TABLE pat AS ( SELECT id, shuffle_pattern(pattern) AS pattern, pattern2signature(pattern) AS signature FROM ( SELECT id, jpeg2pattern(data) AS pattern FROM image ) x );
  1. Create an index.
  • ALTER TABLE pat ADD PRIMARY KEY (id); CREATE INDEX pat_signature_idx ON pat USING gist (signature);
  1. Perform an approximation query, such as querying images that are similar to id = :id images and retrieving the top 10 items on the similarity ranking list.
  • SELECT id, smlr FROM ( SELECT id, pattern <-> (SELECT pattern FROM pat WHERE id = :id) AS smlr FROM pat WHERE id <> :id ORDER BY signature <-> (SELECT signature FROM pat WHERE id = :id) LIMIT 100 ) x ORDER BY x.smlr ASC LIMIT 10
  1. K — Nearest Neighbour (KNN) indexing is an option here and the result is output quickly based on similarity rankings.

Testing Our Image Search Engine

For the most part, our search engine works as expected.

However, sometimes the image search does not work too well.

This is because the computer “sees” the images differently from humans. It processes an object as a 2D matrix, and transform it to a signature, which is readable for computers.

Video De-duplication Service

For video de-duplication, you can extract key frames in a video to generate the Cartesian product through self-correlation. Remember to calculate the similarity of two images of different videos. When the similarity reaches a certain threshold, the services deem the two videos the same.

Example:

  1. Create the image table and import the key frames of all videos into the table (id serial8 primary key, movie_id int, data bytea);
  2. Import the image (assume it is in jpeg format).
  3. Skipped
  4. Generate the pattern and signature types
  • CREATE TABLE pat AS ( SELECT id, movie_id, shuffle_pattern(pattern) AS pattern, pattern2signature(pattern) AS signature FROM ( SELECT id, movie_id, jpeg2pattern(data) AS pattern FROM image ) x );
  1. Calculate the similarity of different videos.
  • select t1.movie_id, t1.id, t1.signature<->t2.signature from pat t1 join pat t2 on (t1.movie_id<>t2.movie_id) order by t1.signature<->t2.signature desc or select t1.movie_id, t1.id, t1.signature<->t2.signature from pat t1 join pat t2 on (t1.movie_id<>t2.movie_id) where t1.signature<->t2.signature > 0.9 order by t1.signature<->t2.signature desc

Summary

Image de-duplication requires Postgres as their database and uses its API. PostgreSQLis a powerful database with customizable functions. It not only ensures image de-duplication effectively but is also safe and reliable. Video de-duplication is the additional feature that is possible using PostgreSQL. Haar wavelet algorithm adds to the possibility of searching images on popular search engines. The implementation of PostgreSQL and installation are aspects that are worth knowing.

Reference:

https://www.alibabacloud.com/blog/data-de-duplication-in-image-search-services_593726?spm=a2c41.11642830.0.0

--

--

Alibaba Cloud
Alibaba Cloud

Written by Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com

No responses yet