Search code examples
smalltalkgemstone

How is ETL done for GemStone Smalltalk?


I would like to (re)-start again with GemStone/S. I have done multiple ETL transformations for relational databases but I'm still fuzzy on how this is done at GemStone/S.

I would like to load data into GemStone from different sources. It could be files (csv, excel, xml, plain text, etc.) or other DBs like SQL Server, Postgres, Oracle, etc.

From what I saw at the pages there is GemConnect which connects to Oracle databases. How do you do it from other databases or files? Is there any option to connect via ODBC? Is there any data pump to do so or you "just" have to one yourself?

In the end I'm asking is how do you create a staging area, where you would clean-up, transform and then load the data into GemStone DB. Are there any examples or documentation how it is done?

Note: Only similar answer I have found is on SO - from Stephan Eggermont, but that was short and without any "real" information.


Solution

  • Staging

    I suspect that the reason that most environments have "ETL/staging" as a separate step is because the two endpoints are somewhat rigid and don't have a good programming language for data manipulation. That is, if you have TXT, CSV, XML, JSON, or SQL, and need it in another format/schema, then someone has to do the "transformation." But if you are working in GemStone, then you can do the transformation in Smalltalk--there is no need for a separate step.

    Files

    If you have files (TXT, CSV, XML, JSON, etc.), then use GsFile. In fact, if the other endpoint can deal with files, then just export from one source in an agreed format and then import in another (with GemStone doing the "heavy lifting" of transforming). Files are simpler, they avoid the communications layer, and they makes debugging trivial (if the source hasn't created the file, then it is the source's problem; if it is in the pending directory then haven't processed it yet (destination problem); if it is in the completed directory then the destination has processed it).

    With this approach you start (one or more) background jobs in GemStone to watch a directory, open a file for read, process the file, and then move it to another directory. Other than basic string manipulation, you only need to work with GsFile. Then you create and update your objects in the database.

    ODBC

    While it would be possible to make FFI calls from GemStone to an ODBC library (or to a database's native library as is done with GemConnect), this would probably be unnecessarily complex. Instead, I'd create another layer using tools that have better interaction with the foreign system. This layer could write text files (as described above), or, with the proper interface, could communicate with GemStone directly. My inclination would be to use Dolphin to extract the data (good ODBC support), then communicate directly to GemStone from Dolphin. You could do something similar with other client Smalltalk dialects (Pharo, VA, or VW), or even from another language (I have a student working on a Python interface to GemStone).

    O/R Mapping

    Here again you are left with needing a way to take data in one format and translate it to another. These tend to be highly domain-specific and we find it easier to just write Smalltalk code. Alternatively, you could use something like GLORP in Pharo, VA, VW, etc.

    Best Practices

    I think you haven't found any "best practices" for ETL in GemStone because it isn't something we think of as an external process or separate step. There is just how to communicate with a file (GsFile), a socket (GsSocket), a library (CLibrary), or a client (GCI). From here we can look at internal processing issues like multiple producers and one consumer (RcQueue), or one producer and multiple consumers (locking).

    So, it isn't that GemStone applications don't do ETL, they just do it internally and the situations are much more situation-specific.