This questions is probably another instance of
BigQuery UDF memory exceeded error on multiple rows but works fine on single row
But it was suggested that I post as a question instead of an answer.
I'm using javascript to parse logfiles into a table. The javascript parse function is relatively simple. It works at 1M rows but fails at 3M rows. Log files can be a lot bigger than the 3M so failing is a problem.
The function is below.
function parseLogRow(row, emit) {
r = (row.logrow ? row.logrow : "") + (row.l2 ? " " + row.l2 : "") + (row.l3 ? " " + row.l3 : "")
ts = null
category = null
user = null
message = null
db = null
seconds = null
found = false
if (r) {
m = r.match(/^(\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d\.\d\d\d (\+|\-)\d\d\d\d) \[([^|]*)\|([^|]*)\|([^\]]*)\] ::( \(([\d\.]+)s\))? (.*)/ )
if( m){
ts = new Date(m[1])*1
category = m[3] || null
user = m[4] || null
db = m[5] || null
seconds = m[7] || null
message = m[8] || null
found = true
}
else {
message = r
found = false
}
}
emit({
ts: ts,
category: category,
user: user,
db: db,
seconds: seconds*1.0,
message: message,
found: found
});
}
bigquery.defineFunction(
'parseLogRow', // Name of the function exported to SQL
['logrow',"l2","l3"], // Names of input columns
[
{'name': 'ts', 'type': 'float'}, // Output schema
{'name': 'category', 'type': 'string'},
{'name': 'user', 'type': 'string'},
{'name': 'db', 'type': 'string'},
{'name': 'seconds', 'type': 'float'},
{'name': 'message', 'type': 'string'},
{'name': 'found', 'type': 'boolean'},
],
parseLogRow // Reference to JavaScript UDF
);
I'm referencing the function with this query:
SELECT
ROW_NUMBER() OVER() as row_num,
ts,category,user,
db,seconds,message,found,
FROM parseLogRow((SELECT * FROM[#{dataset}.today]
LIMIT 1000000
))
Some example data in the 'today' table looks like this (as a CSV):
logrow,l2,l3
# Logfile created on 2015-12-29 00:00:09 -0800 by logger.rb/v1.2.7,,
2015-12-29 00:00:09.262 -0800 [INFO|7aaa0|] :: Running scheduled job: confirm running gulp process,,
2015-12-29 00:00:09.277 -0800 [DEBUG|7aaa0|] :: Restarted gulp process,,
2015-12-29 00:00:09.278 -0800 [INFO|7aaa0|] :: Completed scheduled job: confirm running gulp process,,
2015-12-29 00:00:14.343 -0800 [DEBUG|7aaa2|scheduler] :: Polling for pending tasks (master: true),,
2015-12-29 00:00:19.396 -0800 [INFO|7aaa4|] :: Running scheduled job: confirm running gulp process,,
2015-12-29 00:00:19.409 -0800 [DEBUG|7aaa4|] :: Restarted gulp process,,
2015-12-29 00:00:19.410 -0800 [INFO|7aaa4|] :: Completed scheduled job: confirm running gulp process,,
2015-12-29 00:00:29.487 -0800 [INFO|7aaa6|] :: Running scheduled job: confirm running gulp process,,
2015-12-29 00:00:29.500 -0800 [DEBUG|7aaa6|] :: Restarted gulp process,,
2015-12-29 00:00:29.500 -0800 [INFO|7aaa6|] :: Completed scheduled job: confirm running gulp process,,
2015-12-29 00:00:39.597 -0800 [INFO|7aaa8|] :: Running scheduled job: confirm running gulp process,,
2015-12-29 00:00:39.610 -0800 [DEBUG|7aaa8|] :: Restarted gulp process,,
2015-12-29 00:00:39.611 -0800 [INFO|7aaa8|] :: Completed scheduled job: confirm running gulp process,,
2015-12-29 00:00:44.659 -0800 [DEBUG|7aaaa|scheduler] :: Polling for pending tasks (master: true),,
2015-12-29 00:00:49.687 -0800 [INFO|7aaac|] :: Running scheduled job: confirm running gulp process,,
2015-12-29 00:00:49.689 -0800 [DEBUG|7aaac|] :: Restarted gulp process,,
2015-12-29 00:00:49.689 -0800 [INFO|7aaac|] :: Completed scheduled job: confirm running gulp process,,
2015-12-29 00:00:59.869 -0800 [INFO|7aaae|] :: Running scheduled job: confirm running gulp process,,
2015-12-29 00:00:59.871 -0800 [DEBUG|7aaae|] :: Restarted gulp process,,
This is a bit of a hack as I'm importing the log as a 3 column table (really, just one column) by importing it as a CSV with the delimiter set to tab (we usually don't have any tabs in the log file), the using a query to turn it into the table I actually want.
I like this pattern because it the parsing is fast and distributed (when it works).
A job that failed is: bigquery-looker:bquijob_260be029_153dd96cfdb. Please contact me if you need a reproducible case.
Any help or suggestion would be appreciated.
Point # 1
I dont see issue with UDF - it worked for me even on 10M rows
I think the issue rather is in use of ROW_NUMBER() OVER() - remove it and it should work!
SELECT
ts,category,user,
db,seconds,message,found,
FROM parseLogRow((SELECT * FROM[#{dataset}.today]
))
Point #2
From performance prospective, below should run faster (I think) and in general I would suggest avoid using UDF in cases when "plain" BQ works as well
SELECT
REGEXP_EXTRACT(logrow, r'^(\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d\.\d\d\d (?:\+|\-)\d\d\d\d) \[[^|]*\|[^|]*\|[^\]]*\] :: .*') AS ts,
REGEXP_EXTRACT(logrow, r'^(?:\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d\.\d\d\d (?:\+|\-)\d\d\d\d) \[([^|]*)\|(?:[^|]*)\|(?:[^\]]*)\] :: (?:.*)') AS category,
REGEXP_EXTRACT(logrow, r'^(?:\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d\.\d\d\d (?:\+|\-)\d\d\d\d) \[(?:[^|]*)\|([^|]*)\|(?:[^\]]*)\] :: (?:.*)') AS user,
REGEXP_EXTRACT(logrow, r'^(?:\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d\.\d\d\d (?:\+|\-)\d\d\d\d) \[(?:[^|]*)\|(?:[^|]*)\|([^\]]*)\] :: (?:.*)') AS db,
REGEXP_EXTRACT(logrow, r'^(?:\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d\.\d\d\d (?:\+|\-)\d\d\d\d) \[(?:[^|]*)\|(?:[^|]*)\|(?:[^\]]*)\] :: (.*)') AS message,
REGEXP_MATCH(logrow, r'^((?:\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d\.\d\d\d (?:\+|\-)\d\d\d\d) \[(?:[^|]*)\|(?:[^|]*)\|(?:[^\]]*)\] :: (?:.*))') AS found
FROM (
SELECT logrow +IFNULL(' ' + l2, '') + IFNULL(' ' + l3, '') AS logrow
FROM YourTable
)