Use random IDs instead of autoinc #772

Open
opened 1 year ago by aksdb · 9 comments
aksdb commented 1 year ago

Autoincrementing IDs give away usage information (how actively a system is used; how many tasks were created between two points in time, for example) and make it easier to exploit possible security problems (since you can easily guess IDs).

It would be better to use randomized IDs ... UUIDs, ObjectIDs, WUIDs, NanoIDs, etc.

They can easily be sharded in a database, they are unguessable and practically collision free.

Autoincrementing IDs give away usage information (how actively a system is used; how many tasks were created between two points in time, for example) and make it easier to exploit possible security problems (since you can easily guess IDs). It would be better to use randomized IDs ... UUIDs, ObjectIDs, WUIDs, NanoIDs, etc. They can easily be sharded in a database, they are unguessable and practically collision free.
Owner

I think I understand the concerns, but I'm not sure if I understand the threat of it. Like, what advantage does a malicous actor have if they know an instance now has 43134 more tasks than a month ago?
Given that you'll need to be authenticated to do anything at all I'm not sure what the advantage of it would be.

And also it would be a massive breaking change and not an easy task to change it everywhere whith potential to go wrong.

What we could do though would be to add an api endpoint to retrieve the tasks by their index which is only incrementing per list, much like Gitea/Github do. That would at least not give the id away in the browser url, but you could still get it from the api response itself.

I think I understand the concerns, but I'm not sure if I understand the threat of it. Like, what advantage does a malicous actor have if they know an instance now has 43134 more tasks than a month ago? Given that you'll need to be authenticated to do anything at all I'm not sure what the advantage of it would be. And also it would be a massive breaking change and not an easy task to change it everywhere whith potential to go wrong. What we could do though would be to add an api endpoint to retrieve the tasks by their index which is only incrementing per list, much like Gitea/Github do. That would at least not give the id away in the browser url, but you could still get it from the api response itself.
Poster

For the theoretical background, see The German Tank problem. More IT related can be read in this blog post.

For the database side, CockroachDB has a few infos. Basically you would enable more possible data storage options with easier horizontal scale out. Restore and merge of data also gets easier.

Regarding the security aspect, there are two vectors:

  1. A future version could - temporarily - introduce a bug that allows unauthenticated API access to certain endpoints or ignore permission scopes (therefore giving authenticated users more access than they should have). With sequential numbers this can be exploited extremely easy. Of course you don't implement something like that willingly. But most security incidents are not due to bad intentions or neglect, but simply due to a mistake. They happen.

  2. A recent famous attack was on a platform called Parler. After the attackers got access to an admin account which was supposed to have access to everything (that is fine), they could easily scrape all the data because all they had to do was increment IDs until they got everything. This attack would have been significantly harder to scale if IDs would have been random.

I am aware that it's not easy to migrate. But it also won't get easier the older the product gets. The earlier such a change is done, the less impact it has.

For the theoretical background, see [The German Tank problem](https://en.wikipedia.org/wiki/German_tank_problem). More IT related can be read in [this blog post](https://www.clever-cloud.com/blog/engineering/2015/05/20/why-auto-increment-is-a-terrible-idea/). For the database side, [CockroachDB](https://www.cockroachlabs.com/docs/v20.2/sql-faqs#how-do-i-auto-generate-unique-row-ids-in-cockroachdb) has a few infos. Basically you would enable more possible data storage options with easier horizontal scale out. Restore and merge of data also gets easier. Regarding the security aspect, there are two vectors: 1. A future version could - temporarily - introduce a bug that allows unauthenticated API access to certain endpoints or ignore permission scopes (therefore giving authenticated users more access than they should have). With sequential numbers this can be exploited extremely easy. Of course you don't implement something like that willingly. But most security incidents are not due to bad intentions or neglect, but simply due to a mistake. They happen. 2. A recent [famous attack was on a platform called Parler](https://cybernews.com/news/70tb-of-parler-users-messages-videos-and-posts-leaked-by-security-researchers/). After the attackers got access to an admin account which was supposed to have access to everything (that is fine), they could easily scrape all the data because all they had to do was increment IDs until they got everything. This attack would have been significantly harder to scale if IDs would have been random. I am aware that it's not easy to migrate. But it also won't get easier the older the product gets. The earlier such a change is done, the less impact it has.
Poster

Oh and another addition in regards to

Like, what advantage does a malicous actor have if they know an instance now has 43134 more tasks than a month ago?

Let's say Vikunja would be used in a company, and that company has a workers council. The workers council enforces, that managers cannot track their employees. The numbers could give away that an employee created much or not enough tasks. "Hey you only created two tasks the last month ... are you slacking off?!".

Meta-information can be dangerous and I would try to minimize them wherever possible.

Ticket numbers in ticket systems are a slightly different case, since you need to refer to them directly (via their number). I don't think Vikunja intends to do something like that for tasks, though, right?

Oh and another addition in regards to > Like, what advantage does a malicous actor have if they know an instance now has 43134 more tasks than a month ago? Let's say Vikunja would be used in a company, and that company has a workers council. The workers council enforces, that managers cannot track their employees. The numbers could give away that an employee created much or not enough tasks. "Hey you only created two tasks the last month ... are you slacking off?!". Meta-information can be dangerous and I would try to minimize them wherever possible. Ticket numbers in ticket systems are a slightly different case, since you need to refer to them directly (via their number). I don't think Vikunja intends to do something like that for tasks, though, right?
Owner

Thanks for the detailed answer. I've only looked briefly into the articles but I'll probably come back with a few more comments once I've read them in full.

I've heard about Parler and I do remember thinking "well why are they using numeric ids at their scale?". I guess I never thought about that this could be a problem given I'm not sure if Vikunja would reach a scale like that one day. That being said, it's probably not impossible and as you rightfully pointed out would be a lot harder to change then than it is now.

Meta-information can be dangerous and I would try to minimize them wherever possible.

That's a very valid reason, thanks for pointing that out.

Ticket numbers in ticket systems are a slightly different case, since you need to refer to them directly (via their number). I don't think Vikunja intends to do something like that for tasks, though, right?

Tasks have an "index" value which is the number you see in the frontend when you open a task. This index value is individual per (task, list)-tuple, like github/gitea issues for example.

In general I think it makes sense to do the switch I'm just hesitant to do it because it'll be a lot of work and there's other more interesting things to do right now. But I've put it in the backlog, so it'll happen one day.

Thanks for the detailed answer. I've only looked briefly into the articles but I'll probably come back with a few more comments once I've read them in full. I've heard about Parler and I do remember thinking "well why are they using numeric ids at their scale?". I guess I never thought about that this could be a problem given I'm not sure if Vikunja would reach a scale like that one day. That being said, it's probably not impossible and as you rightfully pointed out would be a lot harder to change then than it is now. > Meta-information can be dangerous and I would try to minimize them wherever possible. That's a very valid reason, thanks for pointing that out. > Ticket numbers in ticket systems are a slightly different case, since you need to refer to them directly (via their number). I don't think Vikunja intends to do something like that for tasks, though, right? Tasks have an "index" value which is the number you see in the frontend when you open a task. This index value is individual per `(task, list)`-tuple, like github/gitea issues for example. In general I think it makes sense to do the switch I'm just hesitant to do it because it'll be a lot of work and there's other more interesting things to do right now. But I've put it in the backlog, so it'll happen one day.
Poster

Tasks have an "index" value which is the number you see in the frontend when you open a task. This index value is individual per (task, list)-tuple, like github/gitea issues for example.

Do you plan to use that somewhere besides that view? Maybe I just missed it, but so far it only seems to be a visual representation when opening the task.

I think when linking tasks, the search for a title (and/or content) is more important than a number (which the user may not even (want to?) remember). So unless you have something in mind (or already implemented and I just failed to notice so far :D), it might be easier to just remove that label and go with the titles only. The IDs would then only be shown in URLs.

Alternative idea (if it's only about the visual representation): show the number representing their order. Yes, that number changes when I remove, add or reorder tasks. But there is value in that number ... "task 2" tells you its the next in line. "Task 50" tells you this is probably not relevant or very far in the future. If you decide to move "task 50" to the front, it's then better to see it as "task 1" (since it's now at the top) instead of having a weird order of "task 50, task 1, task 31, task 21, ...".

> Tasks have an "index" value which is the number you see in the frontend when you open a task. This index value is individual per `(task, list)`-tuple, like github/gitea issues for example. Do you plan to use that somewhere besides that view? Maybe I just missed it, but so far it only seems to be a visual representation when opening the task. I think when linking tasks, the search for a title (and/or content) is more important than a number (which the user may not even (want to?) remember). So unless you have something in mind (or already implemented and I just failed to notice so far :D), it might be easier to just remove that label and go with the titles only. The IDs would then only be shown in URLs. Alternative idea (if it's only about the visual representation): show the number representing their order. Yes, that number changes when I remove, add or reorder tasks. But there is value in that number ... "task 2" tells you its the next in line. "Task 50" tells you this is probably not relevant or very far in the future. If you decide to move "task 50" to the front, it's then better to see it as "task 1" (since it's now at the top) instead of having a weird order of "task 50, task 1, task 31, task 21, ...".
Owner

From my work experience, you need a reference to a task which does not change, for example to reference it in commits or in other places. It's just quicker to put a number somewhere than a title (which would be editable). For example, when referencing a task in an email or other means of conversation it's way easier to say "Task #123" instead of "You know that task about colouring the header - yeah no no not that one about the menu header, the other one".

This doesn't necessarily have to be a globally unique number (like an auto incrementing id) but a per-list-uniuqe index is fine.

I'd admit you'd probably search for a task by its title or content instead of the number, but you'll still need a number imho.

Re: order: If you absolutely need the order you can have that in kanban (at least for now). If you need a priority, I'd suggest to use the priority field of tasks.

From my work experience, you need a reference to a task which does not change, for example to reference it in commits or in other places. It's just quicker to put a number somewhere than a title (which would be editable). For example, when referencing a task in an email or other means of conversation it's way easier to say "Task #123" instead of "You know that task about colouring the header - yeah no no not that one about the menu header, the other one". This doesn't necessarily have to be a globally unique number (like an auto incrementing id) but a per-list-uniuqe index is fine. I'd admit you'd probably search for a task by its title or content instead of the number, but you'll still need a number imho. Re: order: If you absolutely need the order you can have that in kanban (at least for now). If you need a priority, I'd suggest to use the priority field of tasks.
Poster

Hmm I see. In written communication one could probably get away with simply pasting a link, but spoken not so much.

In general I would be willing to take a shot at migrating the code towards another ID approach (no idea yet if UUID is the best fit or if nano-id is cleaner). However it seems like a simple autoinc -> uuid will not work from a conceptual level. I'll think about it some more if I can come up with an idea for the task referencing. Given that basically all ticketing systems I know use counter for their ticket numbers though lets me believe that this simply might be the best approach if referencing tickets/tasks is desired.

Hmm I see. In written communication one could probably get away with simply pasting a link, but spoken not so much. In general I would be willing to take a shot at migrating the code towards another ID approach (no idea yet if UUID is the best fit or if nano-id is cleaner). However it seems like a simple autoinc -> uuid will not work from a conceptual level. I'll think about it some more if I can come up with an idea for the task referencing. Given that basically all ticketing systems I know use counter for their ticket numbers though lets me believe that this simply might be the best approach if referencing tickets/tasks is desired.

I agree that autoinc is not the best way to go about this, but as a possible solution that I am not sure has been proposed yet is to do something like Jira does. Each project gets a unique 2-4 character identifier, which then has an auto increment for actual task number.

This allows for 2 things: first the malicous actor needs to know that specific identifier, and it still allows for easy relaying of information verbally.

I think forcing each list to have an identifier and incrementing on top of that would be a good solution to this issue. Thoughts? @aksdb

I agree that autoinc is not the best way to go about this, but as a possible solution that I am not sure has been proposed yet is to do something like Jira does. Each project gets a unique 2-4 character identifier, which then has an auto increment for actual task number. This allows for 2 things: first the malicous actor needs to know that specific identifier, and it still allows for easy relaying of information verbally. I think forcing each list to have an identifier and incrementing on top of that would be a good solution to this issue. Thoughts? @aksdb
Owner

@danner26 The tasks already have these kinds of identifiers, at least the index. We could modify the api endpoint (and the frontend urls as well) to query tasks by that index. That would still expose the task id in the api response though. And this expands to other things like lists, namespaces, teams etc. as well so it would be quite an effort to change it everywhere.

I'd advice against forcing a prefix though, in previous versions of Vikunja it would create one by default whenever you created a new list and this was slightly confusing to people so I removed that again.

Tbh this whole thing feels a bit like premature optimization of a problem which does not really exist (yet).

@danner26 The tasks already have these kinds of identifiers, at least the index. We could modify the api endpoint (and the frontend urls as well) to query tasks by that index. That would still expose the task id in the api response though. And this expands to other things like lists, namespaces, teams etc. as well so it would be quite an effort to change it everywhere. I'd advice against forcing a prefix though, in previous versions of Vikunja it would create one by default whenever you created a new list and this was slightly confusing to people so I removed that again. Tbh this whole thing feels a bit like premature optimization of a problem which does not really exist (yet).
Sign in to join this conversation.
No Milestone
No Assignees
3 Participants
Notifications
Due Date

No due date set.

Dependencies

This issue currently doesn't have any dependencies.

Loading…
There is no content yet.