I originally created a key-value database with Xodus Entity that created a small, 2GB database:
public static void main(String[] args) throws Exception{
if (args.length != 2){
throw new Exception("Argument missing. Current number of arguments: " + args.length);
long offset = Long.parseLong(args[0]);
long chunksize = Long.parseLong(args[1]);
Path pathBabelNet = Paths.get("/mypath/BabelNet-API-3.7/config");
BabelNetLexicalizationDataSource dataSource = new BabelNetLexicalizationDataSource(pathBabelNet);
Map<String, List<String>> data = new HashMap<String, List<String>>();
data = dataSource.getDataChunk(offset, chunksize);
jetbrains.exodus.env.Environment env = Environments.newInstance(".myAppData");
final Transaction txn = env.beginTransaction();
Store store = env.openStore("xodus-lexicalizations", StoreConfig.WITHOUT_DUPLICATES, txn);
for (Map.Entry<String, List<String>> entry : data.entrySet()) {
String key = entry.getKey();
String value = entry.getValue().get(0);
store.put(txn, StringBinding.stringToEntry(key), StringBinding.stringToEntry(value));
I used a batch script to do this in chunks:
for ((offset=0; offset<165622128;))
echo $offset;
java -Xmx10g -jar /path/to/jar.jar $offset $chunksize
Now I changed it so it is relational:
public static void main(String[] args) throws Exception{
if (args.length != 2){
throw new Exception("Argument missing. Current number of arguments: " + args.length);
long offset = Long.parseLong(args[0]);
long chunksize = Long.parseLong(args[1]);
Path pathBabelNet = Paths.get("/mypath/BabelNet-API-3.7/config");
BabelNetLexicalizationDataSource dataSource = new BabelNetLexicalizationDataSource(pathBabelNet);
Map<String, List<String>> data = new HashMap<String, List<String>>();
data = dataSource.getDataChunk(offset, chunksize);
PersistentEntityStore store = PersistentEntityStores.newInstance("lexicalizations-test");
final StoreTransaction txn = store.beginTransaction();
Entity synsetID;
Entity lexicalization;
String id;
for (Map.Entry<String, List<String>> entry : data.entrySet()) {
String key = entry.getKey();
String value = entry.getValue().get(0);
synsetID = txn.newEntity("SynsetID");
synsetID.setProperty("synsetID", key);
lexicalization = txn.newEntity("Lexicalization");
lexicalization.setProperty("lexicalization", value);
lexicalization.addLink("synsetID", synsetID);
synsetID.addLink("lexicalization", lexicalization);
And this created a file over 17GB and it only stopped because I ran out of memory. I understand that it will be larger because it has to store the links, among other things, but ten times bigger? What am I doing wrong?
For some reason removing the txn.flush()
fixes everything. Now it's just 5.5GB.