We have a small pivotal Hadoop-hawq cluster.We have created externtal table on it and pointing to hadoop files.
Given Environment:
Product Version: (HAWQ 1.3.0.2 build 14421) on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.4.2
Tried :
When We are trying to read from external table using command. ie
test=# select count(*) from EXT_TAB ; GETTING following error : ERROR: data line too long. likely due to invalid csv data (seg0 slice1 SEG0.HOSTNAME.COM:40000 pid=447247)
DETAIL: External table trcd_stg0, line 12059 of pxf://hostname/tmp/def_rcd/?profile=HdfsTextSimple: "2012-08-06 00:00:00.0^2012-08-06 00:00:00.0^6552^2016-01-09 03:15:43.427^0005567^COMPLAINTS ..." :
Additional Information:
DDL of external table is :
CREATE READABLE EXTERNAL TABLE sysprocompanyb.trcd_stg0
(
"DispDt" DATE,
"InvoiceDt" DATE,
"ID" INTEGER,
time timestamp without time zone,
"Customer" CHAR(7),
"CustomerName" CHARACTER VARYING(30),
"MasterAccount" CHAR(7),
"MasterAccName" CHAR(30),
"SalesOrder" CHAR(6),
"SalesOrderLine" NUMERIC(4, 0),
"OrderStatus" CHAR(200),
"MStockCode" CHAR(30),
"MStockDes" CHARACTER VARYING(500),
"MWarehouse" CHAR(200),
"MOrderQty" NUMERIC(10, 3),
"MShipQty" NUMERIC(10, 3),
"MBackOrderQty" NUMERIC(10, 3),
"MUnitCost" NUMERIC(15, 5),
"MPrice" NUMERIC(15, 5),
"MProductClass" CHAR(200),
"Salesperson" CHAR(200),
"CustomerPoNumber" CHAR(30),
"OrderDate" DATE,
"ReqShipDate" DATE,
"DispatchesMade" CHAR(1),
"NumDispatches" NUMERIC(4, 0),
"OrderValue" NUMERIC(26, 8),
"BOValue" NUMERIC(26, 8),
"OrdQtyInEaches" NUMERIC(21, 9),
"BOQtyInEaches" NUMERIC(21, 9),
"DispQty" NUMERIC(38, 3),
"DispQtyInEaches" NUMERIC(38, 9),
"CustomerClass" CHAR(200),
"MLineShipDate" DATE
)
LOCATION (
'pxf://HOSTNAME-HA/tmp/def_rcd/?profile=HdfsTextSimple'
)
FORMAT 'CSV' (delimiter '^' null '' escape '"' quote '"')
ENCODING 'UTF8';
Any help would be much appreciated ?
based on source code: https://github.com/apache/incubator-hawq/blob/e48a07b0d8a5c8d41d2d4aaaa70254867b11ee11/src/backend/commands/copy.c
The error occurs when cstate->line_buf.len >= gp_max_csv_line_length
is true.
According to: http://hawq.docs.pivotal.io/docs-hawq/guc_config-gp_max_csv_line_length.html
the default length of csv is 1048576 bytes. Have you checked your csv file length and tried increasing value of this setting?