PostgreSQL 具有的 NoSQL 特性

了解了 RDBMS 與 NoSQL 的差異之後,我們來了解 PostgreSQL 除了本是 RDBMS 之外,還包含了哪些 NoSQL 的特性:

PS: 可依下列步驟先在地端用 docker 啟一個 postgreSQL database

1. pull posgres image and run

docker run -d \ --name postgres-db \ -e POSTGRES_USER=postgres \ -e POSTGRES_PASSWORD=password \ -e POSTGRES_DB=postgres \ -p 5432:5432 \ postgres

2. 建立一個 demo database

psql -h localhost -p 5432 -U postgres -c "CREATE DATABASE nosql_demo;"

3. 連線

psql -h localhost -p 5432 -U postgres -W -d nosql_demo

4. 輸入密碼 password

Password:

Document types

1. XML documents

是一種 Hierarchical structured

1. 用 xml text 建立資料

Examples:

// create xmltest table CREATE TABLE xmltest (
data xml NOT NULL
); // 插入一筆 xmltest 資料
INSERT INTO xmltest (data) VALUES ('<attendee><bio>
<name>John Doe</name>
<birthYear>1986</birthYear></bio><languages>
<lang level="5">php</lang><lang level="4">python</lang>
<lang level="2">java</lang></languages></attendee>');
// 用 xpath 選取 data 欄位 tag 為 bio, name 的 文本 SELECT data FROM xmltest WHERE
CAST (xpath('//bio/name/text()', data) as text[]) = '{John Doe}'; // 用 EXPLAIN 查看效能 EXPLAIN SELECT data FROM xmltest WHERE CAST (xpath('//bio/name/text()', data) as text[]) = '{John Doe}'; QUERY PLAN ----------------------------------------------------------------------------------------------------- Seq Scan on xmltest (cost=0.00..30.40 rows=7 width=32) Filter: ((xpath('//bio/name/text()'::text, data, '{}'::text[]))::text[] = '{"John Doe"}'::text[]) (2 rows) // CREATE INDEX CREATE INDEX data_bio_name_idx ON xmltest USING btree (CAST (xpath('//bio/name/text()', data) as text[])); // 再次查看效能 EXPLAIN SELECT data FROM xmltest WHERE CAST (xpath('//bio/name/text()', data) as text[]) = '{John Doe}'; QUERY PLAN ----------------------------------------------------------------------------------------------------- Seq Scan on xmltest (cost=0.00..1.01 rows=1 width=32) Filter: ((xpath('//bio/name/text()'::text, data, '{}'::text[]))::text[] = '{"John Doe"}'::text[]) (2 rows)

設定 index 後變快了

2. posgresql 可以用 xmlparse function 去建立 XML 資料

XMLPARSE ( { DOCUMENT | CONTENT } value)

用 xmlparse function 的範例:

// create books table CREATE TABLE books ( id SERIAL PRIMARY KEY, content XML NOT NULL );

// 使用 XMLPARSE 將 XML 內容插入
INSERT INTO books (content) VALUES (
XMLPARSE (DOCUMENT '<?xml version="1.0"?><book><title>Manual</title><chapter>...</chapter></book>')
);

INSERT INTO books (content) VALUES ( XMLPARSE (CONTENT '<root><foo>bar</foo><bar>foo</bar></root>') );

xpath() 函數來查詢 XML 資料中的特定元素,例如提取 <title><foo> 的內容: 

-- 提取 <title> 元素內容
SELECT xpath('/book/title/text()', content) FROM books WHERE id = 1;

-- 提取 <foo> 元素內容
SELECT xpath('//foo/text()', content) FROM books WHERE id = 2;

3. 用 postgreSQL-specific syntax 插入 xml type 資料

-- 使用 XML 資料類型的方式插入 XML 資料
INSERT INTO books (content) VALUES (
xml '<foo>bar</foo>'
);

-- 使用文字轉換為 XML 型別的方式插入
INSERT INTO books (content) VALUES (
'<foo>bar</foo>'::xml
);

2. Key-Value Storage

HStore Data Type

  • Hierarchical (documents link to documents)

  • Key/value store (strongs or other hstore objects/values)

  • GiST or GIN indexing on hstore values

CREATE EXTENSION hstore;

-- postgresql 中使用 hstore 型別
SELECT 'a=>1,a=>2'::hstore;

CREATE TABLE products (
id serial PRIMARY KEY,
name text,
attributes hstore
);

INSERT INTO products (name, attributes)
VALUES ('Smartphone', 'color=>"black", storage=>"64GB"');

SELECT name FROM products WHERE attributes @> 'storage=>"64GB"';

2. Support for JSON (JavaScript Object Notation) and JSONB Data Types

User Profiles and Preferences Storage

1. JSON

Applications that manage user profiles and preferences can leverage PostgreSQL’s JSON capabilities. This allows for a flexible schema that can evolve as the application grows, accommodating various user attributes without the need for schema modifications.

-- 建立 users_json 表格
CREATE TABLE users_json (
id SERIAL PRIMARY KEY,
profile JSON
);

-- 插入使用者資料
INSERT INTO users_json (profile) VALUES
('{
"name": "John Doe",
"age": 30,
"email": "john@example.com",
"profile_info": {
"occupation": "Software Engineer",
"company": "Tech Corp",
"years_experience": 5
},
"contact": {
"phone": "123-456-7890",
"address": {
"street": "123 Main St",
"city": "San Francisco",
"state": "CA",
"country": "USA",
"zip": "94105"
}
},
"preferences": {
"language": "English",
"theme": "dark",
"notifications": {
"email": true,
"push": false,
"sms": true
}
},
"social_media": {
"linkedin": "john-doe",
"twitter": "@johndoe",
"github": "johndoe-dev"
},
"skills": ["Python", "JavaScript", "PostgreSQL", "Docker"]
}'),
('{
"name": "Jane Smith",
"age": 28,
"email": "jane@example.com",
"profile_info": {
"occupation": "Data Scientist",
"company": "Data Analytics Inc",
"years_experience": 3
},
"contact": {
"phone": "987-654-3210",
"address": {
"street": "456 Oak Street",
"city": "New York",
"state": "NY",
"country": "USA",
"zip": "10001"
}
},
"preferences": {
"language": "English",
"theme": "light",
"notifications": {
"email": true,
"push": true,
"sms": false
}
},
"social_media": {
"linkedin": "jane-smith",
"twitter": "@janesmith",
"github": "jane-data"
},
"skills": ["R", "Python", "Machine Learning", "Statistics"]
}');

-- 一些實用的查詢範例:

-- 1. 基本資料查詢
-- 取得所有使用者名稱和郵件
SELECT
profile->>'name' as name,
profile->>'email' as email
FROM users_json;

-- 2. 巢狀資料查詢
-- 取得使用者的職業和公司
SELECT
profile->>'name' as name,
profile->'profile_info'->>'occupation' as occupation,
profile->'profile_info'->>'company' as company
FROM users_json;

-- 3. 地址資訊查詢
SELECT
profile->>'name' as name,
profile->'contact'->'address'->>'city' as city,
profile->'contact'->'address'->>'country' as country
FROM users_json;

-- 4. 偏好設定查詢
-- 查找啟用電子郵件通知的使用者
SELECT profile->>'name' as name
FROM users_json
WHERE profile->'preferences'->'notifications'->>'email' = 'true';

-- 5. 技能查詢
-- 找出會使用 Python 的使用者
-- 使用 JSON_ARRAY_ELEMENTS 來查詢技能
SELECT DISTINCT profile->>'name' as name
FROM users_json,
JSON_ARRAY_ELEMENTS(profile->'skills') as skills
WHERE skills::text LIKE '"Python"';

-- 6. 年齡範圍查詢
SELECT
profile->>'name' as name,
profile->>'age' as age
FROM users_json
WHERE (profile->>'age')::int BETWEEN 25 AND 35;

-- 7. 特定城市的使用者
SELECT profile->>'name' as name
FROM users_json
WHERE profile->'contact'->'address'->>'city' = 'New York';

2. JSONB

Hstore + JSON = JSONB

  • Binary representation of JSON

  • The B stands for better

  • Big amount of operators and functions

  • GIST and GIN indexes

Creating a GIN index on the JSONB column can significantly improve query performance, especially for large datasets

1. 建立 users table

CREATE TABLE users (
id serial PRIMARY KEY,
profile JSONB
);
2. 插入資料
INSERT INTO users (profile) VALUES
('{
"name": "Marie",
"age": 45,
"city": "Paris",
"preferences": {
"language": "French",
"theme": "dark",
"notifications": true
}
}'::jsonb),
('{
"name": "John",
"age": 32,
"city": "New York",
"country": "USA",
"preferences": {
"language": "English",
"theme": "light",
"notifications": false
}
}'::jsonb),
('{
"name": "Yuki",
"age": 28,
"city": "Tokyo",
"country": "Japan",
"preferences": {
"language": "Japanese",
"theme": "auto",
"notifications": true
}}'::jsonb);

3. 建立索引

CREATE INDEX idx_user_profile ON users USING GIN (profile);
4. 查詢
-- 查詢所有美國用戶
SELECT profile->>'name' AS name
FROM users
WHERE profile @> '{"country": "USA"}';

-- 查詢年齡大於 30 的用戶
SELECT profile->>'name' AS name, profile->>'age' AS age
FROM users
WHERE (profile->>'age')::int > 30;

-- 查詢啟用通知的用戶
SELECT profile->>'name' AS name
FROM users
WHERE profile @> '{"preferences": {"notifications": true}}';

-- 查詢特定城市的用戶
SELECT profile->>'name' AS name, profile->>'city' AS city
FROM users
WHERE profile @> '{"city": "Paris"}';
JSON 保留原始格式(包括空格和重複的鍵),而 JSONB 是二進制格式。

JSON 和 JSONB 的主要差異:

  1. JSON 保留原始格式的空白和重複鍵
  2. JSONB 支援更多索引類型和操作符
  3. JSONB 查詢效能通常更好
  4. JSON 插入速度較快,因為不需要轉換

PostgreSQL as Data Warehouse

這邊稍微解釋一下 data warehouse
Data Pipeline 中的一環
Each platform offers a comprehensive suite of services that cover the entire lifecycle:
  • Ingestion: Collecting data from various sources
  • Data Lake: Storing raw data
  • Computation: Processing and analyzing data
  • Data Warehouse: Storing structured data
  • Presentation: Visualizing and reporting insights

AWS uses services like Kinesis for data streaming, S3 for storage, EMR for processing, RedShift for warehousing, and QuickSight for visualization.

There are also some PostgreSQL features that can help us to manage our data warehouse information…
  • Temporary Tables(臨時表格): It’s a short-lived table that exists for the duration of a database session. PostgreSQL automatically drops the temporary tables at the end of a session or a transaction.
-- 建立臨時表格來存儲分析過程中的中間結果
CREATE TEMPORARY TABLE temp_user_stats (
user_id INTEGER,
message_count INTEGER,
avg_message_length NUMERIC
) ON COMMIT DROP;

-- 插入分析數據
INSERT INTO temp_user_stats
SELECT
sender_id,
COUNT(*) as message_count,
AVG(LENGTH(content)) as avg_message_length
FROM messages
GROUP BY sender_id;

-- 使用臨時表格進行進一步分析
SELECT * FROM temp_user_stats
WHERE message_count > 100;
  • Stored Procedures(預存程序): You can use it to create procedures or function on multiple languages (PL/pgSQL, PL/Perl, PL/Python, etc).
-- 建立一個計算使用者訊息統計的預存程序
CREATE OR REPLACE PROCEDURE calculate_user_stats(
user_id_param INTEGER,
OUT total_messages INTEGER,
OUT avg_length NUMERIC
)
LANGUAGE plpgsql
AS $$
BEGIN
SELECT
COUNT(*),
AVG(LENGTH(content))::NUMERIC(10,2)
INTO
total_messages,
avg_length
FROM messages
WHERE sender_id = user_id_param;
END;
$$;

-- 使用預存程序
CALL calculate_user_stats(1, NULL, NULL);
  • Partitioning(分區表): This is really useful for database maintenance, queries using partition key and INSERT performance.
-- 建立按時間分區的訊息表格
CREATE TABLE messages (
id BIGSERIAL,
sender_id INTEGER,
content TEXT,
created_at TIMESTAMP
) PARTITION BY RANGE (created_at);

-- 建立月份分區
CREATE TABLE messages_202401
PARTITION OF messages
FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');

CREATE TABLE messages_202402
PARTITION OF messages
FOR VALUES FROM ('2024-02-01') TO ('2024-03-01');

-- 自動建立新分區的函數
CREATE OR REPLACE FUNCTION create_monthly_partition()
RETURNS TRIGGER AS $$
DECLARE
partition_date TEXT;
partition_name TEXT;
start_date TEXT;
end_date TEXT;
BEGIN
partition_date := to_char(NEW.created_at, 'YYYYMM');
partition_name := 'messages_' || partition_date;
start_date := to_char(date_trunc('month', NEW.created_at), 'YYYY-MM-DD');
end_date := to_char(date_trunc('month', NEW.created_at + interval '1 month'), 'YYYY-MM-DD');
EXECUTE format(
'CREATE TABLE IF NOT EXISTS %I PARTITION OF messages
FOR VALUES FROM (%L) TO (%L)',
partition_name, start_date, end_date
);
RETURN NULL;
END;
$$ LANGUAGE plpgsql;
  • Materialized Views(具體化視圖): The query results are shown as a table.
-- 建立一個儲存每日訊息統計的具體化視圖
CREATE MATERIALIZED VIEW daily_message_stats AS
SELECT
DATE(created_at) as message_date,
COUNT(*) as total_messages,
COUNT(DISTINCT sender_id) as unique_senders,
AVG(LENGTH(content))::INTEGER as avg_message_length
FROM messages
GROUP BY DATE(created_at)
WITH DATA;

-- 建立索引來加速查詢
CREATE INDEX idx_daily_message_stats_date
ON daily_message_stats(message_date);

-- 更新具體化視圖
REFRESH MATERIALIZED VIEW daily_message_stats;

-- 查詢具體化視圖
SELECT * FROM daily_message_stats
WHERE message_date >= CURRENT_DATE - INTERVAL '7 days';
  • Tablespaces(表空間): You can change the data location to a different disk. In this way, you’ll have parallelized disk access.
-- 建立新的表空間在不同的磁碟位置
CREATE TABLESPACE fast_storage
LOCATION '/path/to/fast/disk';

-- 將大型表格移到新的表空間
ALTER TABLE messages
SET TABLESPACE fast_storage;

-- 建立索引在不同的表空間
CREATE INDEX idx_messages_created_at
ON messages(created_at)
TABLESPACE fast_storage;
  • PITR (Point-In-Time Recovery) 備份: You can create backups Point-in-time-recovery compatible, so in case of failure, you can restore the database state on a specific period of time.
-- 在 postgresql.conf 中啟用 WAL 歸檔
wal_level = replica
archive_mode = on
archive_command = 'cp %p /path/to/archive/%f'

-- 建立基礎備份
pg_basebackup -D /backup/base -Ft -z -P

-- 恢復到特定時間點的指令(需要在 recovery.conf 中設定)
recovery_target_time = '2024-01-15 14:30:00'
restore_command = 'cp /path/to/archive/%f %p'
  • 擴展功能 Huge community: And last but not least, PostgreSQL has a huge community where you can find support on many different issues.
-- 安裝常用擴展
CREATE EXTENSION pg_stat_statements; -- 查詢統計
CREATE EXTENSION timescaledb; -- 時間序列數據
CREATE EXTENSION pg_partman; -- 分區管理

-- 使用 pg_stat_statements 監控查詢效能
SELECT
query,
calls,
total_time / calls as avg_time,
rows / calls as avg_rows
FROM pg_stat_statements
ORDER BY total_time DESC
LIMIT 10;

這些特性可以根據需求組合使用,例如:

  • 使用臨時表格存儲 ETL 過程的中間結果
  • 使用具體化視圖加速常用報表查詢
  • 使用分區表管理大量歷史數據
  • 使用表空間優化存儲效能

Reference:

  1. NoSQL Capabilities in PostgreSQL
  2. PostgreSQL as a Schemaless Database

  3. 8.13. XML Type

  4. Unstructured data types in PostgreSQL

  5. Big Data Pipeline Cheatsheet for AWS, Azure, and Google Cloud

  6. Running a Data Warehouse on PostgreSQL

Comments

Popular posts from this blog

《 Imgproxy 使用分析一:圖片下載速度優化分析:Akamai CDN vs Imgproxy 效能比較》

《 Akamai + S3 與 CloudFront + Imgproxy + S3 使用分析二:壓縮圖片設計流程:檔案大小 vs 載入時間的權衡》

程式語言初學者 Docker 入門第二章 —— 安裝與驗證 (Mac)