Difference between revisions of "Archive"
Jump to navigation
Jump to search
(→SQL) |
(→Which database?: Pros and cons) |
||
Line 21: | Line 21: | ||
=== MongoDB === | === MongoDB === | ||
MongoDB was suggested by Fernando. Its role is precisely to store key-value pairs associated to an id. | MongoDB was suggested by Fernando. Its role is precisely to store key-value pairs associated to an id. | ||
Pros: | |||
* Built for schema-less storage | |||
* Supports range queries and such | |||
Cons: | Cons: | ||
* MongoDB required | * MongoDB required: big installation, needs server to be running | ||
* Reliability? | * Reliability? | ||
* An index would need to be created for each attribute | * An index would need to be created for each attribute (performance unknown) | ||
find({'key1': 'value1', 'key2': 'value2'}) | find({'key1': 'value1', 'key2': 'value2'}) | ||
Line 31: | Line 35: | ||
=== SQL === | === SQL === | ||
This would required a filename/key/value table (or even, one per value type). JOINing could be painful. | This would required a filename/key/value table (or even, one per value type). JOINing could be painful. | ||
Pros: | |||
* SQLite is bundled with Python | |||
* SQL servers very common | |||
Cons: | Cons: | ||
* Join on key/value table | * Join on key/value table | ||
* Values with different types will need different tables? | * Values with different types will need different tables? (e.g. for range queries to work) | ||
SELECT a.filehash, a.value AS value1, b.value AS value2 | SELECT a.filehash, a.value AS value1, b.value AS value2 | ||
Line 48: | Line 56: | ||
hstore is a key-value store as a single value (i.e. in a column). | hstore is a key-value store as a single value (i.e. in a column). | ||
Pros: | |||
* Common SQL server | |||
Cons: | |||
* PostgreSQL requried: big installation, needs server to be running | |||
* Non-equality queries limited? | |||
SELECT * FROM files WHERE metadata @> '"key1"=>"value1","key2"=>"value2"'::hstore; | SELECT * FROM files WHERE metadata @> '"key1"=>"value1","key2"=>"value2"'::hstore; |
Revision as of 16:06, 25 September 2013
Matthias Troyer came to Poly to discuss his use of VisTrails and the problems he was facing with the persistence package.
Fernando, Juliana, Matthias and Remi met on 2013-09-24.
Summary
Persistence only used as a cache
- Can’t delete stuff; he deletes and recreates the whole store
- He wants to use it to archive correct result, without the other intermediate files that resulted from bogus workflow OR module code
- He wants to be able to find his files afterwards. Git revision hash + file reference = impracticable
- He doesn’t mind filenames being unreadable if he has some way of finding these from metadata (workflow name, vistrail query, or custom metadata from module code)
Conclusions:
- Drop git. If we are only going to use it for storage, and keep a separate database to map from ref uuid/upstream hash to object hash, commits and branches are useless (and are a nuisance because we can’t rewrite history)
- Use a flat object store with hashes (like git’s)
- Use a database to associate hash filename (upstream modules hash?) to metadata: vistrails parameter, (execution info), custom metadata
- Request: make this separate from VisTrails (and used by the new archive package) so that it can be used directly by other code, and used to find files in the store
Which database?
MongoDB
MongoDB was suggested by Fernando. Its role is precisely to store key-value pairs associated to an id.
Pros:
- Built for schema-less storage
- Supports range queries and such
Cons:
- MongoDB required: big installation, needs server to be running
- Reliability?
- An index would need to be created for each attribute (performance unknown)
find({'key1': 'value1', 'key2': 'value2'})
SQL
This would required a filename/key/value table (or even, one per value type). JOINing could be painful.
Pros:
- SQLite is bundled with Python
- SQL servers very common
Cons:
- Join on key/value table
- Values with different types will need different tables? (e.g. for range queries to work)
SELECT a.filehash, a.value AS value1, b.value AS value2 FROM files a INNER JOIN files b ON a.filehash = b.filehash AND b.key = 'key2' AND b.value = 'value2' WHERE a.key = 'key1' AND a.value = 'value1' GROUP BY filehash
(see jsfiddle)
PostgreSQL's hstore type
http://www.postgresql.org/docs/9.3/static/hstore.html
hstore is a key-value store as a single value (i.e. in a column).
Pros:
- Common SQL server
Cons:
- PostgreSQL requried: big installation, needs server to be running
- Non-equality queries limited?
SELECT * FROM files WHERE metadata @> '"key1"=>"value1","key2"=>"value2"'::hstore;