Taxonomy
DAG Taxonomy
Posted July 5th, 2008 by MgcclI have decided to create a taxonomy database library based on direct acyclic graph(DAG), actually it is based on polytree. Like the tree taxonomy, except allow one node to have more than one direct parent node.
Like this structure, that can't be constructed from tree based taxonomy.
A hash table storing nodes. Indices are integers.
The node stores three sets of integers, the indices of direct parent nodes, the indices of direct child nodes, and the items.
The items are integers only, each item refers to the item wish to be classified. Those items can be thought as ids. When the database return items, those items can be used in a SQL query to request the content of items.
These are specs of the database, and all the possible operations:
- Return all items directly under a node.
- Return all items under a node or it's child nodes.
- Return relationship from node A to node B.(parent, child or neither)
- Return all items under a complex query. For example, items that are classified as part of node A but not node B
- Insertion of new node
- deletion of nodes
- Find the smallest graph that contains some specific nodes as it's child node. Useful when there are things having the same name but different meaning. So with this process, it can return stuff like Graph(Graph Theory) and Graph(Plotting Functions)
- Move a node from one place to another.(supply old parent and new parents, or supply a entirely new node)
- Save/Load the database to/from a file
- Normalization process of the database:
- If an node has an direct item that is also an direct item of any of it's child node. Remove the item from node.
- If an node has a direct parent node P1 and P2, if P1 is a child node of P2, then remove P2 from the direct parent node
- If an node has a direct child node C1 and C2, if C1 is a parent node of C2, then remove C2 from the direct parent node
- If nodes refer to nodes that doesn't exist, remove that reference
- Each time, insertion, alter, and deletion of nodes and items are operated, normalization operation starts, so the database is always normalized
In the end the database looks like this:
There will be a separated database handling converting integers ids into it's respectable names.
Slowly progressing. I only need to finish it by the time I start building a math problem database. C++ is hard.
- Mgccl's blog
- Add new comment
- 274 reads
What I'm always looking for is Ontology
Posted April 19th, 2008 by MgcclBye taxonomy...
jk. There is not enough development in ontology yet to completely discard taxonomy. Even if ontology had more development, taxonomy will still dominate because it's simplicity.
I'm going to use Taxonomy for taxonomy schemes below unless I specifically address it.
I'm interested in classifying data due to my need of creating a system that can find a particular math problem fast when all math problem are provided with description. Mathematics has concrete definitions of each of it's individual elements. It's the ideal model to test any classification system.
My previous views on classification are in fact restricted on the taxonomy system. It's simple, but can get really complicated. There is something extremely difficult to address. For instance. The problem between how to exactly describe a term.
My confusion with taxonomy is: Does terms intended to be the only object describe terms?
In the most common taxonomies, only terms can be used to describe terms. Because it only have terms and parent-child relationships. There is nothing else, it's easy to implement and get the job done.
I tried to divide term into quality and quantity, which is like creating attributes to extend the taxonomy. I'm not a genius, there are shortcomings in the model. Include using taxonomy to describe attributes, which I think it's a problem in all models ever created by man.
An example:
Triangle is a child term of Polygon.
Triangle have 3 sides but Polygon can have [positive integer between and include 3 to infinity] sides.
So sides should be an attribute. The different quantities in attribute created the difference of terms.
In fact, even without manually associate Polygon as the parent of Triangle. After examine all the possible attributes of the two term, the computer can see that Triangle is a child of Polygon.
So side is an attribute. But side(synonym of edge) is just a 1-face. So side can have the attribute of n-face set to 1. So an attribute is getting described by another attribute. Replace the word "attribute" with "term" in my last sentence. It would be the standard model of taxonomy. The attribute+term model of taxonomy seems like the normal taxonomy system forced to create a unnatural border between completely same concepts. I get confused and I really want to understand if attribute and term can be used interchangeably or not.
But worry no more... I start to focus my attention on ontology.
All the quotes are from wikipedia
Definition of Ontology:
An ontology is a representation of a set of concepts within a domain and the relationships between those concepts.
Review what Taxonomy means.
Taxonomy is the practice and science of classification.
We can't compare ontology with taxonomy. The definition shows compare them are like comparing water with kittens1.
We can compare ontology model and taxonomic schemes. They both can do the essential thing I want: Show the relationships between items and their properties.
If there is unlimited system resource. A taxonomy can really in fact classify everything by exhaustively convert all attribute and associated values into terms. Like the following one:
With unlimited computational and storage resource. I would tell you right now that the current multi-inheritance taxonomy scheme is perfect. There is no need of develop a specific ontology model for anything. We can all sleep at night knowing another great challenge is defeated.
But obviously no one is going to represent every single number as a separate term in taxonomy. You can, if you have relatively few attributes and value combinations. Each combination requires 3 slot for storage. One refer to the attribute, one refer to the value and one refer to the term that's the combination of those two. It construct a huge web, and need huge computational power.
A ontology model basically solved that problem. Ontology is almost like the math we see everyday! It has three significant advancement compare to normal taxonomy--Attributes, axioms and restrictions. Refer to Wikipedia on these, because I'm not as expert as Wikipedia in this particular topic. Wikipedia is like the living proof of a very loose ontology.
I'm looking forward to OWL become part of Drupal one day. Then, when it start to have service APIs for distributed Drupal and data, Drupal will be the perfect CMS for everything.
What is beyond the current ontology?
Humans start to talk in constructed language that's syntactically unambiguous(The only one known to me is Lojban) so machines can now, seriously, understand what we are talking about. Then all the ontology and taxonomy structures are built by machines automatically.
- 1. No, they are not both eatable
- Mgccl's blog
- Add new comment
- 805 reads
My newest idea on classification system and it's implementation
Posted April 3rd, 2008 by MgcclI have talked about the taxonomy systems before. Now I finally believe I envisioned the best taxonomy system and how to implement it in databases, even create specialized database for it.
This is basically build up on the OOP design. Tags are basically like classes.
All the classes considered are static classes, there is no operations, there is only descriptions. It's all data, we want to keep classes in the database.
Now, to make it look more like the new web classification, I will use tag instead of class.(irony)
Each tag can have 3 different types of properties--sub-tag, super-tag and quantity.
Sub-tag: "Polygon" can be a sub-tag of "Polytope", "Polytope" is the super-tag. A sub-tag will automatically inherit all the quality(super-tag) and quantity of its super-tag.
Super-tag: Super-tags describe it's sub-tag. All qualities are super-tags. Say, a tag "Polygon" that have super-tag(quality) "Polytope", all qualities "Polytope" have, "Polygon" will have.
Quantity: numbers. Quantity is a form of super-tag, but since they are so special, because there are infinite amount of them, it's better to make it into a special group to think about. Say "Triangle" are "Polygon" with "Vertex" equal to "3".("Vertex" is a super-tag of "Polygon", all the super-tag of "Vertex" still works, just "Vertex" have an number "3" associated with it)
Quantity just describe how many of "tag" exists, the tag can be a unit, like "g". "kg" is a sub-tag of "g", and "g" is assigned with 1000 for the tag "kg".
I would not like to make this system too complicated, but just for some extension... Super-tag, Sub-tag can be classified as relation. Only one is required to figure find the other. I personally believe it's good to have both tags. Because quantity uses super-tag and doesn't use sub-tags. The tags can be used to tag an node. Node can't be used to tag other nodes. That's the only difference between tags and nodes. Implementation: Table: tag_sup Table: tag_quant Table: node_tag Table: node_quant Great, entire system in 4 tables. One might ask, if we pretend nodes are tags, then there will be only 2 tables. but it will be easier to separate them into 2 tables for future searches. A search is following the path. Actually... the core is simple, the difficult part is to make it user friendly and can handle a few problems. But there are ways to solve them. Ways to solve the problem: As you can see, I still haven't organized everything in a very proper manner yet.
A quantity can be set as the domain of all possible quantity values. Like "Positive Integers", "{x| 1
The theory.. I say it's more advanced than Drupal's taxonomy! It introduced quantity and remove "related tags"(which doesn't show anything about the relation... just saying there *is* a relation...well everything have a relation with each other...I still don't get why Drupal would not remove it from their source.)
The database design
There will be only item ids, because this is only a classification system. The data associated with the ids, like name and description are stored elsewhere.
Field: t_id, s_id
Provide a link from a tag to it's super-tag
Field: t_id, s_id, q
Provide associate a quantity from a tag with one of it's super-tag
Field: n_id, t_id
associate nodes with tags
Field: n_id, t_id, q
associate nodes with tags and a quantity.
That's all for databases.
Say, someone want all nodes tagged with "Polygon" or it's sub-tags. The program will go though the tag_sub list. find all the sub-tags of "Polygon". Now go to node_tag, find all the n_id got tagged by the sub-tags. Done
To search by quantity, we can even use numerical related operators in database, like return result with quantity mod 2 = 1
1. Users don't know the tag's id, they only know the tag's common names. Like "Polygon" instead of "3442"
2. When user use common names, 2 word can mean the same thing, like "Film" and "Movie".
3. A word might mean different things under different topics, like "base"
4. A user might tag an new born cat "kitten", the other might tag it "cat". "kitten" are sub-tag of "cat", only the lowest tag suppose to show up.
1. Associate ID with common names
table: tag_name
fields: t_id, n_id
2. use the table created above, add more n_id, which associated with names
3. when 1 word have more than one meaning, it associated with different n_id. The system should detect which n_id with different meaning, according to setting, either feed back to the user so the user can chose which meaning(super-tag) he wants, or select the one most likely meant by the user.
4. Find an algorithm remove super-tags so all the tag remaining associate with node are not super/sub tags of each other.
These are just some thoughts... I wish I can think of something better than the very messy quantity property... but it's not likely, there is no way to make all numbers into tags. It's more likely the quantity property will evolve into logic property, like under w/e condition w/e will be classified as w/e. Please make suggestions if you can xD. The smartest way for most people is just remove quantity and never address quantity in classification.
- Mgccl's blog
- 2 comments
- 418 reads
The taxonomy systems and its problems
Posted January 24th, 2008 by MgcclIn this informational world, it becomes more and more difficult to classify data, till now, many ways have been suggested, each one of them have some flaws.
Trees
Tree is a classic taxonomy system. It's widely used in classifying living organisms, file systems and most paper's organizations. Each parent term can have a few or no child terms. Each child term can have only one direct parent term.
Pros
1. easy to navigate
2. easy to manage
Problems
The importance of each term can be understand differently. For example, should equilateral triangle be equilateral polygon->triangle or should it be triangle->equilateral.
Easy navigation require some amount of knowledge about how item are categorized.
Multiple inheritance
It's the same as the tree system, but each term can have more than one direct parent term. The equilateral triangle can be classified as the child term of equilateral polygon and triangle.
Pros
1. easy to navigate
2. easy to manage, more difficult than tree
Problems
Since equilateral triangle is belongs to both equilateral polygon and triangle, and they are both belong to polygon. Should equilateral polygon and triangle classified as the same kind of item? It's sure that triangle, pentagon and hexagon are more related than equilateral polygon's relationship to triangle. Should a new parent created just to differ these two apart? Like "relative length of the polygon side" and "amount of sides".
Naive Tagging
Pros
1. Easy to tag
2. Easy to navigate when there is a smart system
3. Social tagging power
Problems
No sense of parent and child. So someone can tag equilateral triangle "math", "geometry" and "polygon". But we know polygon is part of geometry, which is part of math. but if someone searches "math" and it only got tagged by "geometry", it will not show up. To make an item fully accessible, an item have to be tagged with a tag and all it's parent tags.
Few word can have the same meaning. Different user would tag it differently and result search for one, the other don't show up. For example "Triangle" and "Polygon with 3 sides" are the identical tag by the definition of triangle, but they are different tags.
A word can mean different things. Like "Python" can mean a programming language or a special group of snakes. Search python as the language, turns out a lot of snakes are not intended.
Finally, after Drupal's Taxonomy system come out, I saw the best categorizing system, yet.
Tagging with multiple inheritance and synonym support + more than one vocabularies
The most powerful taxonomy system Drupal can offer is Tagging with multiple inheritance synonym support. And as a extra, "related tags" field. It's not formal to use related tags, because there is no standard to define what the relation is between the tags and how to act for different relations.
Vocabularies are complete different set of term, they have have no relationship with another set of vocabularies. So they can have the same term("python") but they are used differently(Vocabulary of programming and vocabulary of snakes). Basically, it's the same as categorizing one set of terms into one single parent, except they are defined not to have any relation with another set of terms.
Pros
1. Really powerful at classifying data
Problems
Beside the huge difficulty of constant adding synonyms for a tag, format all the possible inheritance, and a lot of database queries. This system still can't fix the multi meaning word problem, but if the taxonomy managers are careful, it will not happen. The basic rule: Never let a term be ambiguous. For example, instead of using "Python" for both the snake and programming language, do "Python(Programming Language)" and "Python(snake)". This problem does not exist in tree based system. If anyone can think of a better suggestion, please let me know :)
Also, a few category problem seems impossible unless one knows how to perfectly define vocabularies. For a personal example, let's say there is are two vocabularies, time and space. An item appeared in China at 2001, and also appeared in US at 2003. So this item will be tagged:
Space: US, China
Time: 2001, 2003
but, the space are not directly associated with the time, one with no knowledge about the item can inferred the item also appeared in China at 2003, a wrong statement. When I was the Space and Time vocabularies, I found it makes perfect sense, user can browse by time or by space. until I finally meet a problem like this and have to consider redefine the vocabularies. The only solution requires the removal of combine 2 vocabularies. Create terms like "China 2001" inherit from "2001" and "China". If we think about it. For many cases, there can be only 1 term directly inherits 2 term, those times I like to think it as a synonym of 2 term combined. The disadvantage will be a huge amount of database usage, say, there are 200+ countries and human have recorded history for more than 5000 years. If the original database contain like 5000 terms now, it should be over million terms.
Future...
I want to hear suggestions on how these problems can be solved. because I have no idea.
- Mgccl's blog
- Add new comment
- 552 reads








Recent comments
14 hours 45 min ago
1 day 6 hours ago
1 day 23 hours ago
3 days 17 hours ago
3 days 23 hours ago
4 days 2 hours ago
4 days 8 hours ago
4 days 14 hours ago
4 days 17 hours ago
4 days 19 hours ago