Persistence Package
Persistence Package Issues
Currently, VisTrails uses one git repository per user. To support a repository that is shared among multiple users, we first need to take the metadata from the local sqlite database and put it also in the git (or a central repository).
We considered using git notes as a means to store the metadata. Our hope was that git would manage the notes together with the pushes and pulls, but unfortunately, that is not how this works. But the notes functionality itself requires a separate pulling, pushing and merging, which seems to add unnecessary complexity.
Instead of using git notes, David proposed to store the metadata as an ordinary file in git. For instance, if "input.csv" is one of the files being tracked by the persistence package, along with this file, we would also have something like "input.csv_md", which would store its metadata (name, tags, user, date created, id, ...). But for the user to manage this metadata (e.g., edit tags or search for a specific file in the repository), we still need the sqlite database. Thus, we have two possible solutions so far:
- Use the solution proposed by David - when opening VisTrails (with the persistence package enabled), we could update the local repository (pull), and put all the metadata in a local sqlite database; when making modifications, the user can ask to do a push in the repository (and we can automatically do a push when VisTrails closes or the package is disabled); of course, the user could also ask to update his local repository again.
- Positive aspects:
- The solution allows users to work offline
- Negative aspects:
- Merging files can be an issue when pushing - we might have conflicts not only with the files, but also with the metadata, which would be probably not that easy to solve
- If the user has 1,000 files in the repository, than he would also have 1,000 metadata files
- Positive aspects:
- Along with the git, users would also have a centralized sqlite database; in this case, all users would access the same database, so all the metadata could be directly stored on and retrieved from it.
- Positive aspects:
- No need for additional files to store metadata
- No problems with merging metadata (the database would guarantee consistency)
- Negative aspects:
- Users could not work offline
- Positive aspects:
The idea is to implement one of the solutions to work with only one user - then, we would expand it to allow collaboration.
- Tommy suggested using one branch for each user. This would mean every user could access all versions from all users. But this would also complicate the interface and require too much storage space.
- Matthias requirements is for a user to be able to push data used in tagged workflows to a central server.
- Tommys comments:
- This solution would simplify the interface and solve the storage space issue.
- I think we at least need a local and global branch. The user work on the local branch and can choose to push all data used in tagged workflows to the global branch. This would mean cherry-picking the commits that corresponds to those versions. The user can also choose to pull the global branch. This would mean merging global branch into local. After that it would be possible execute a vistrail referencing those data items.
- I am not exactly sure if it works to have a commit message correspond to a specific data item?
- A conflict should not arise since each commit id is unique and we don't care about the order of commit messages. The only thing that may change is the metadata (name and tags). But in those cases we can ask the user which version to keep. But would this work if each version has its own metadata?
- The latest version could also be pushed if a workflow references it. But the latest version should probably only be used during the exploration stage.
- Tommys comments: