Difference between revisions of "Archive"

Latest revision as of 19:44, 30 September 2013

Matthias Troyer came to Poly to discuss his use of VisTrails and the problems he was facing with the persistence package.

Fernando, Juliana, Matthias and Remi met on 2013-09-24.

Summary

Persistence only used as a cache

Can’t delete stuff; he deletes and recreates the whole store
He wants to use it to archive correct result, without the other intermediate files that resulted from bogus workflow OR module code
He wants to be able to find his files afterwards. Git revision hash + file reference = impracticable
He doesn’t mind filenames being unreadable if he has some way of finding these from metadata (workflow name, vistrail query, or custom metadata from module code)

Conclusions:

Drop git. If we are only going to use it for storage, and keep a separate database to map from ref uuid/upstream hash to object hash, commits and branches are useless (and are a nuisance because we can’t rewrite history)
Use a flat object store with hashes (like git’s)
Use a database to associate hash filename (upstream modules hash?) to metadata: vistrails parameter, (execution info), custom metadata
Request: make this separate from VisTrails (and used by the new archive package) so that it can be used directly by other code, and used to find files in the store

Interface

Store a file: gets a file and a dictionary of metadata, hashes it, puts it in the store with the hash name, and creates a database entry.
Remove a file: gets a hash, removes that file and its associated metadata
Query: searches for hashes matching conditions on metadata from the database

On top of this file-storing package, the VisTrails package would be built. It would provide modules like the persistence package, to add a file or directory in storage or retrieve one, a graphical querying UI, and a way to get the file from a module in a specific workflow version.

The package would also have a command-line or graphical interface to add, remove or query files from the store.

The new project is at github.com/remram44/file_archive

File store

Just a plain filestore like Git's seems fine. There wouldn't be compression like Git's packs (although they are not really suitable for huge different files, but meant for similar small files). Sharing would still be possible with rsync (+ merging the databases) or anything else, as it's just about copying the files.

It was decided to store directories as-is without deduplication (see issue 1).

Which database?

A database is used to store the metadata, and search for files matching a given condition on its metadata.

MongoDB

MongoDB was suggested by Fernando during the meeting. Its role is precisely to store key-value pairs associated to an id.

Pros:

Built for schema-less storage
Supports range queries and such

Cons:

MongoDB required: big installation, needs server to be running
Reliability?
An index would need to be created for each attribute (performance unknown)

find({'key1': 'value1', 'key2': 'value2'})

SQL

This would required a filename/key/value table (or even, one per value type). JOINing could be painful.

Pros:

SQLite is bundled with Python
SQL servers very common
SQLite allows to save the database in a file, along with the store

Cons:

Join on key/value table and general performance

SELECT a.filehash, a.value AS value1, b.value AS value2
FROM files a
INNER JOIN files b ON a.filehash = b.filehash AND b.key = 'key2' AND b.value = 'value2'
WHERE a.key = 'key1' AND a.value = 'value1'
GROUP BY filehash

(see sqlfiddle)

PostgreSQL's hstore type

http://www.postgresql.org/docs/9.3/static/hstore.html

hstore is a key-value store as a single value (i.e. in a column).

Pros:

Common SQL server

Cons:

PostgreSQL requried: big installation, needs server to be running
Non-equality queries limited?

SELECT * FROM files WHERE metadata @> '"key1"=>"value1","key2"=>"value2"'::hstore;

(see sqlfiddle)

Syncing repositories

Once files have been generated and stored, the user probably want to put them into another archive.

Moving the files is easy, it's just about adding the missing files to the destination. Tools like rsync could be used here (in a very dumb mode of operation).
Merging the databases is not complicated either, we just need to merge the databases, i.e. add the missing records to the other database (no update necessary).

Open design issues

See the github issues marked 'question'.

@@ Line 16: / Line 16: @@
 * Use a database to associate hash filename (upstream modules hash?) to metadata: vistrails parameter, (execution info), custom metadata
 * Request: make this separate from VisTrails (and used by the new archive package) so that it can be used directly by other code, and used to find files in the store
+== Interface ==
+* Store a file: gets a file and a dictionary of metadata, hashes it, puts it in the store with the hash name, and creates a database entry.
+* Remove a file: gets a hash, removes that file and its associated metadata
+* Query: searches for hashes matching conditions on metadata from the database
+On top of this file-storing package, the VisTrails package would be built. It would provide modules like the persistence package, to add a file or directory in storage or retrieve one, a graphical querying UI, and a way to get the file from a module in a specific workflow version.
+The package would also have a command-line or graphical interface to add, remove or query files from the store.
+The new project is at [http://github.com/remram44/file_archive github.com/remram44/file_archive]
+== File store ==
+Just a plain filestore like Git's seems fine. There wouldn't be compression like Git's packs (although they are not really suitable for huge different files, but meant for similar small files). Sharing would still be possible with rsync (+ merging the databases) or anything else, as it's just about copying the files.
+It was decided to store directories as-is without deduplication (see [https://github.com/remram44/file_archive/issues/1 issue 1]).
 == Which database? ==
+A database is used to store the metadata, and search for files matching a given condition on its metadata.
 === MongoDB ===
-MongoDB was suggested by Fernando. Its role is precisely to store key-value pairs associated to an id.
+MongoDB was suggested by Fernando during the meeting. Its role is precisely to store key-value pairs associated to an id.
 Pros:
@@ Line 39: / Line 59: @@
 * SQLite is bundled with Python
 * SQL servers very common
+* SQLite allows to save the database in a file, along with the store
 Cons:
-* Join on key/value table
+* Join on key/value table and general performance
-* Values with different types will need different tables? (e.g. for range queries to work)
   SELECT a.filehash, a.value AS value1, b.value AS value2
@@ Line 50: / Line 70: @@
   GROUP BY filehash
-(see [http://sqlfiddle.com/#!2/73024/1/0 jsfiddle])
+(see [http://sqlfiddle.com/#!2/1e627/5/0 sqlfiddle])
 === PostgreSQL's hstore type ===
@@ Line 66: / Line 86: @@
   SELECT * FROM files WHERE metadata @> '"key1"=>"value1","key2"=>"value2"'::hstore;
-(see [http://sqlfiddle.com/#!1/15431/1/0 jsfiddle])
+(see [http://sqlfiddle.com/#!1/15431/1/0 sqlfiddle])
+== Syncing repositories ==
+Once files have been generated and stored, the user probably want to put them into another archive.
+* Moving the files is easy, it's just about adding the missing files to the destination. Tools like rsync could be used here (in a very dumb mode of operation).
+* Merging the databases is not complicated either, we just need to merge the databases, i.e. add the missing records to the other database (no update necessary).
+== Open design issues ==
+See [https://github.com/remram44/file_archive/issues?labels=question&state=open the github issues marked 'question'].

Difference between revisions of "Archive"

Latest revision as of 19:44, 30 September 2013

Contents

Summary

Interface

File store

Which database?

MongoDB

SQL

PostgreSQL's hstore type

Syncing repositories

Open design issues

Navigation menu

Search