PostgreSQL Data Loading and FDW¶
PostgreSQL provides multiple methods for loading data, from simple COPY commands to Foreign Data Wrappers that make external data sources queryable as regular tables.
Loading Methods (Fast to Slow)¶
- COPY (server-side, superuser required) - fastest
- \copy (client-side via psql, no superuser) - fast
- Foreign Data Wrappers - query external data as tables
- INSERT statements - slowest, transactional
COPY Command¶
-- Server-side COPY (postgres daemon needs file access)
COPY staging FROM '/path/to/file.csv' WITH (FORMAT csv, HEADER true, DELIMITER ',');
-- From program output (PG 9.3+)
COPY staging FROM PROGRAM 'curl https://example.com/data.csv' WITH (FORMAT csv, HEADER true);
-- Client-side \copy (psql, no superuser needed)
\copy staging FROM 'local_file.csv' WITH (FORMAT csv, HEADER true)
Foreign Data Wrappers (FDW)¶
file_fdw - Read Flat Files¶
CREATE EXTENSION file_fdw;
CREATE SERVER svr_file FOREIGN DATA WRAPPER file_fdw;
CREATE FOREIGN TABLE fdt_data (col1 text, col2 int, col3 date)
SERVER svr_file
OPTIONS (format 'csv', header 'true', filename '/path/to/file.csv', delimiter ',');
postgres_fdw - Query Remote PostgreSQL¶
CREATE EXTENSION postgres_fdw;
CREATE SERVER remote_db FOREIGN DATA WRAPPER postgres_fdw
OPTIONS (host 'remote.host.com', dbname 'db', port '5432');
CREATE USER MAPPING FOR public SERVER remote_db
OPTIONS (user 'u', password 'p');
CREATE SCHEMA remote_public;
IMPORT FOREIGN SCHEMA public FROM SERVER remote_db INTO remote_public;
-- PG 10+: aggregate pushdown (COUNT, MAX) for cross-database queries
ogr_fdw - GDAL-Powered Universal Reader¶
CREATE EXTENSION ogr_fdw;
-- CSV files
CREATE SERVER svr_csv FOREIGN DATA WRAPPER ogr_fdw
OPTIONS (datasource '/data/csvs', format 'CSV');
IMPORT FOREIGN SCHEMA ogr_all FROM SERVER svr_csv INTO staging;
-- Excel spreadsheets (each sheet = table)
CREATE SERVER svr_rates FOREIGN DATA WRAPPER ogr_fdw
OPTIONS (datasource '/path/Rates.xlsx', format 'XLSX',
config_options 'OGR_XLSX_HEADERS=FORCE');
-- SQL Server via ODBC
CREATE SERVER svr_sqlserver FOREIGN DATA WRAPPER ogr_fdw
OPTIONS (datasource 'ODBC:user/pass@DSN', format 'ODBC');
Other FDWs: mysql_fdw, oracle_fdw, db2_fdw, tds_fdw (SQL Server), multicorn (Python).
Binary Files via Large Objects¶
-- Store files as binary in database
CREATE TABLE docs (file_name text PRIMARY KEY, doc bytea, doc_oid oid);
UPDATE docs SET doc_oid = lo_import(file_name); -- store in LO
UPDATE docs SET doc = lo_get(doc_oid); -- retrieve blob
SELECT lo_unlink(doc_oid) FROM docs; -- cleanup LO storage
Command-Line Tools¶
# shp2pgsql: load shapefiles
shp2pgsql -s 4269 -D -d data/file shapefile_table | psql
# ogr2ogr: universal loader (GDAL)
ogr2ogr -f "PostgreSQL" "PG:host=localhost user=postgres dbname=db" data.osm.pbf
ogr2ogr -f "PostgreSQL" "PG:host=localhost user=postgres dbname=db" /data_csv/
# pgloader: migrate from MySQL/SQLite/CSV to PostgreSQL
pgloader mysql://user:pass@host/db postgresql://user:pass@host/db
Pre-Loading Optimizations¶
Before bulk loading: 1. Disable autocommit 2. Drop indexes (recreate after load with CREATE INDEX CONCURRENTLY) 3. Drop foreign keys (recreate after) 4. Increase maintenance_work_mem (for index rebuild) 5. Increase max_wal_size (reduce checkpoint frequency)
Key Facts¶
- COPY is 10-100x faster than individual INSERT statements
- \copy works over network without server filesystem access
- FDW tables are queryable with regular SQL but performance depends on pushdown support
- postgres_fdw supports aggregate and join pushdown since PG 10
- ogr_fdw reads: spreadsheets, shapefiles, ODBC, dbase, OSM data, and more
Gotchas¶
- Server-side COPY requires superuser and file access by postgres daemon
- FDW queries may be slow if entire remote table is fetched before filtering
- Large Object API (lo_import/lo_get) has 2GB size limit per object
- pgloader auto-maps MySQL types to PostgreSQL but may need manual tuning for edge cases
- COPY FROM PROGRAM runs the command as the postgres OS user - security implications
See Also¶
- backup and recovery - pg_dump/pg_restore for data migration
- postgresql configuration tuning - maintenance_work_mem for bulk loads
- ddl schema management - CREATE TABLE for staging tables
- distributed databases - Citus for distributed data loading