Search code examples
postgresqlhadoopexternal-tableshawq

How do i resolve error when Hawq complaints for : missing data for column "SoldToAddr2"


We have small cluster of pivotal hadoop-hawq system. we have to read one external table.

ie select * from ext_table

But when i issued query in Hawq complaints about following error :

Error Hawq complaints for : missing data for column "SoldToAddr2" 

We tried following :

We tried with diffrent special character on our format clause of ext_table definition :

for ex:
CREATE READABLE EXTERNAL TABLE ext_table
(
 "ID" INTEGER,
 time timestamp,
 "Customer" char(7),
 "Name" varchar,
 "ShortName" char(10),
 "ExemptFinChg" char(1),
 "MaintHistory" char(1),
 "CustomerType" char(1),
 "MasterAccount" char(7),
 "StoreNumber" char(7),
 "PrtMasterAdd" char(1),
 "CreditStatus" char(1),
 "CreditLimit" decimal(14),
 "InvoiceCount" decimal(7),
 "Salesperson" char(3),
 "Salesperson1" char(3),
 "Salesperson2" char(3),
 "Salesperson3" char(3),
 "PriceCode" char(2),
 "CustomerClass" char(2),
 "Branch" char(2),
 "TermsCode" char(2),
 "InvDiscCode" char(2),
 "BalanceType" char(1),
 "Area" char(2),
 "LineDiscCode" char(2),
 "TaxStatus" char(1),
 "TaxExemptNumber" char(30),
 "SpecialInstrs" char(30),
 "PriceCategoryTable" char(52),
 "DateLastSale" date,
 "DateLastPay" date,
 "OutstOrdVal" decimal(16),
 "NumOutstOrd" decimal(6),
 "Telephone" char(20),
 "Contact" varchar,
 "AddTelephone" char(20),
 "Fax" char(20),
 "Telex" char(10),
 "TelephoneExtn" char(5),
 "Currency" text,
 "UserField1" char(10),
 "UserField2" decimal(16),
 "GstExemptFlag" char(1),
 "GstExemptNum" char(15),
 "GstLevel" char(1),
 "DetailMoveReqd" char(1),
 "InterfaceFlag" char(1),
 "ContractPrcReqd" char(1),
 "BuyingGroup1" char(2),
 "BuyingGroup2" char(2),
 "BuyingGroup3" char(2),
 "BuyingGroup4" char(2),
 "BuyingGroup5" char(2),
 "StatementReqd" char(1),
 "BackOrdReqd" char(1),
 "ShippingInstrs" char(30),
 "StateCode" char(3),
 "DateCustAdded" date,
 "StockInterchange" char(1),
 "MaintLastPrcPaid" char(1),
 "IbtCustomer" char(1),
 "SoDefaultDoc" char(1),
 "CounterSlsOnly" char(1),
 "PaymentStatus" char(1),
 "Nationality" char(3),
 "HighestBalance" decimal(16),
 "CustomerOnHold" char(1),
 "InvCommentCode" char(3),
 "EdiSenderCode" char(40),
 "RelOrdOsValue" decimal(16),
 "EdiFlag" char(1),
 "SoDefaultType" char(1),
 "Email" char(50),
 "ApplyOrdDisc" char(1),
 "ApplyLineDisc" char(1),
 "FaxInvoices" char(1),
 "FaxStatements" char(1),
 "HighInvDays" decimal(5),
 "HighInv" char(6),
 "DocFax" char(20),
 "DocFaxContact" char(40),
 "SoldToAddr1" char(150),
 "SoldToAddr2" char(80),
 "SoldToAddr3" char(40),
 "SoldToAddr4" char(40),
 "SoldToAddr5" char(40),
 "SoldPostalCode" char(9),
 "ShipToAddr1" char(40),
 "ShipToAddr2" char(40),
 "ShipToAddr3" char(40),
 "ShipToAddr4" char(40),
 "ShipToAddr5" char(40),
 "ShipPostalCode" char(9),
 "State" char(2),
 "CountyZip" char(5),
 "City" char(3),
 "State1" char(2),
 "CountyZip1" char(5),
 "City1" char(3),
 "DefaultOrdType" char(2),
 "PoNumberMandatory" char(1),
 "CreditCheckFlag" char(1),
 "CompanyTaxNumber" char(15),
 "DeliveryTerms" char(3),
 "TransactionNature" decimal(5),
 "DeliveryTermsC" char(3),
 "TransactionNatureC" decimal(5),
 "RouteCode" char(10),
 "FaxQuotes" char(1),
 "RouteDistance" decimal(6),
 "TpmCustomerFlag" char(1),
 "SalesWarehouse" text,
 "TpmPricingFlag" char(1),
 "ArStatementNo" char(2),
 "TpmCreditCheck" char(1),
 "WholeOrderShipFlag" char(1),
 "MinimumOrderValue" decimal(12),
 "MinimumOrderChgCod" char(6),
 "UkVatFlag" char(1),
 "UkCurrency" char(3),
 "TimeStamp" bytea
 )
LOCATION (
    'pxf://hostname/path/to/hdfs?profile=HdfsTextSimple')
FORMAT 'CSV' (delimiter '^' null 'null' quote '"')
ENCODING 'UTF8';

Error Detail :

ie Bad rows encountered at line 20

error detail :DETAIL:  External table tablename, line 20 of pxf://hostname/path/to/hdfs?profile=HdfsTextSimple: "23020^2015-12-02 11:14:26.52^0023482^Carlos iglesias               ^          ^N^Y^ ^       ^       ..."

What is the better way to resolve bad errors in pivotal hadoop-hawq system ?

Any Help would be much appreciated ?


Solution

  • Try pxf external table with 1 text column, then use regex to parse(debug) the external data inside hawq.
    Also error table may help. CSV("comma" separated vector) is tricky. In your statement, you have null 'null'. if there is empty field, e.g. "", pxf will throw error.
    My suggestion: use "" as null, then post process "null". ex.

    CREATE READABLE EXTERNAL TABLE ext_table
    (raw_text text)
    LOCATION ('pxf://hostname/path/to/hdfs?profile=HdfsTextSimple')
    FORMAT 'text' (delimiter '^' null '' quote '"')
    LOGS ERRORS INTO pxf_errors SEGMENT REJECT LIMIT 1000 ROWS
    ENCODING 'UTF8';