Search code examples
postgresqlfunctionfor-loopcreate-table

PostgreSQL: Generate columns for each row in another table with value based on function result


I need some help.
I have table № 1:

id val1 val2
0 15 17
1 5 6
.... ... ...

I have table № 2:

id_another val1 val2 val3
0 15 17 100
1 5 6 200

I want to get:

id val1 val2 belongs_to_id_another_0 belongs_to_id_another_1 is_outlier
0 15 17 1 NULL NULL
1 5 6 NULL NULL 1
.... ... ... ... ...

Explanation: For each entity in Table 1 I should apply function-comparison with every entity in Table 2 (Size order for Table 1 and Table 2 is ~1000:1, so the time complexity would not be problem). The function is:

CREATE OR REPLACE FUNCTION is_loc_inside_cls (
    loc_longitude FLOAT,
    loc_latitude FLOAT,
    cls_longitude FLOAT,
    cls_latitude FLOAT,
    cls_radius FLOAT
)
RETURNS INT AS $$
BEGIN
  IF (2 * 6378137.0 * ASIN(SQRT(SIN((loc_latitude * PI() / 180.0 - cls_latitude * PI() / 180.0) / 2) * SIN((loc_latitude * PI() / 180.0 - cls_latitude * PI() / 180.0) / 2) + (COS(loc_latitude * PI() / 180.0 ) * COS(cls_latitude * PI() / 180.0)) * SIN((loc_longitude * PI() / 180.0 - cls_longitude * PI() / 180.0) / 2) * SIN((loc_longitude * PI() / 180.0 - cls_longitude * PI() / 180.0) / 2)))) * (2 * 6378137.0 * ASIN(SQRT(SIN((loc_latitude * PI() / 180.0 - cls_latitude * PI() / 180.0) / 2) * SIN((loc_latitude * PI() / 180.0 - cls_latitude * PI() / 180.0) / 2) + (COS(loc_latitude * PI() / 180.0 ) * COS(cls_latitude * PI() / 180.0)) * SIN((loc_longitude * PI() / 180.0 - cls_longitude * PI() / 180.0) / 2) * SIN((loc_longitude * PI() / 180.0 - cls_longitude * PI() / 180.0) / 2)))) <= cls_radius * cls_radius THEN 
    RETURN 1;
  ELSE
    RETURN NULL;
  END IF;
END; $$
LANGUAGE PLPGSQL;

I then create as many columns as entities in Table 2 and I place there the return. is_outlier value == 1 if in these new columns the NULL value is everywhere, == NULL otherwise. Question is how I can apply this logic? Couldn't find any valid examples.

I can't think of the part where I should implement FOR LOOP logic, I managed to implement only function


Solution

  • Before addressing the question, the following are a few points regarding general code improvement:

    1. Use meaningful column names. Latitude, longitude, and radius convey more than val1, val2, val3.
    2. Use the most natural data type for values. Since is_loc_inside_cls evaluates a logical condition, it should return a boolean.
    3. Avoid using NULL as a value. NULL should generally indicate lack of value or an unknown value. In the case of is_loc_inside_cls, the result is known so long as none of the arguments are NULL.
    4. Consider refactoring is_loc_inside_cls to move the numeric calculation to a function that is called by is_loc_inside_cls. Doing so allows the calculation to be used by other code.

    The following refactors is_loc_inside_cls to return a boolean. The IF statement has been removed from the original version of the function and the result is directly returned. This version of the function returns NULL to indicate that the condition is unknown if any of the arguments are NULL; otherwise, it returns either TRUE or FALSE.

    CREATE OR REPLACE FUNCTION is_loc_inside_cls (loc_longitude float, loc_latitude float, cls_longitude float, cls_latitude float, cls_radius float)
      RETURNS boolean
      AS $$
    BEGIN
      RETURN (2 * 6378137.0 * ASIN(SQRT(SIN((loc_latitude * PI() / 180.0 - cls_latitude * PI() / 180.0) / 2) * SIN((loc_latitude * PI()
        / 180.0 - cls_latitude * PI() / 180.0) / 2) + (COS(loc_latitude * PI() / 180.0) * COS(cls_latitude * PI() / 180.0)) *
        SIN((loc_longitude * PI() / 180.0 - cls_longitude * PI() / 180.0) / 2) * SIN((loc_longitude * PI() / 180.0 - cls_longitude *
        PI() / 180.0) / 2)))) * (2 * 6378137.0 * ASIN(SQRT(SIN((loc_latitude * PI() / 180.0 - cls_latitude * PI() / 180.0) / 2) *
        SIN((loc_latitude * PI() / 180.0 - cls_latitude * PI() / 180.0) / 2) + (COS(loc_latitude * PI() / 180.0) * COS(cls_latitude *
        PI() / 180.0)) * SIN((loc_longitude * PI() / 180.0 - cls_longitude * PI() / 180.0) / 2) * SIN((loc_longitude * PI() / 180.0 -
        cls_longitude * PI() / 180.0) / 2)))) <= cls_radius * cls_radius;
    END;
    $$
    LANGUAGE PLPGSQL;
    
    

    For the following query, I've changed the column names from val1, val2, and val3 to latitude, longitude, and radius, respectively. This query uses the refactored version of is_loc_inside_cls, so belongs_to_id_another_0 and belongs_to_id_another_1 might each be TRUE, FALSE, or NULL. If either belongs_to_id_another_0 or belongs_to_id_another_1 is TRUE, then is_outlier will be FALSE regardless of the other's value. If both are FALSE, then is_outlier will be TRUE. If neither are TRUE and either or both are NULL, then is_outlier will also be NULL, indicating that the state could not be determined.

    SELECT
      table_1.id,
      table_1.latitude,
      table_1.longitude,
      bool_and(
        CASE table_2.id
        WHEN 0 THEN
          t.is_inside_cls
        END) AS belongs_to_id_another_0,
      bool_and(
        CASE table_2.id
        WHEN 1 THEN
          t.is_inside_cls
        END) AS belongs_to_id_another_1,
      bool_and(NOT t.is_inside_cls)
        OR nullif (bool_or(t.is_inside_cls IS NULL), TRUE) AS is_outlier
      FROM
        table_1
      CROSS JOIN table_2
      JOIN LATERAL (
        VALUES (is_loc_inside_cls (table_1.latitude, table_1.longitude, table_2.latitude, table_2.longitude, table_2.radius))) t (is_inside_cls) ON TRUE
    GROUP BY
      table_1.id,
      table_1.latitude,
      table_1.longitude
    ORDER BY
      table_1.id;
    

    This query can easily be extended to support additional rows by copying the pattern for the belongs_to_id_another_* columns. The logic for is_outlier is not dependent on the number of rows in table_2.

    If it is available in the target environment, consider using the crosstab function that is provided by the tablefunc PostgreSQL extension. If the tablefunc extension is not available and there is a need to support an arbitrary number of rows from table_2, then create a function to dynamically generate the SQL and then execute the resulting query.