#2423 HIGH Trial-3: Journal does not find substrings

Sat Aug 4 20:01:37 EDT 2007

#2423: Journal does not find substrings
-------------------------------+--------------------------------------------
  Reporter:  bert              |       Owner:  bcsaller
      Type:  defect            |      Status:  new     
  Priority:  high              |   Milestone:  Trial-3 
 Component:  interface-design  |     Version:          
Resolution:                    |    Keywords:          
  Verified:  0                 |  
-------------------------------+--------------------------------------------
Changes (by Eben):

 * cc: krstic, marco, tomeu (added)
  * owner:  Eben => bcsaller

Comment:

 Well, there's a lot to consider.  We have 3 main things to search: titles,
 tags (+ metadata), and text.

 '''Case-sensetivity:'''  I think that its a safe bet to ignore case
 entirely in all aspects of the search.  Its use as a means of
 distinguishing proper nouns from arbitrary tags is small compared to the
 vast number of misses that it would likely cause.  Google has been doing
 pretty well ignoring case for years.

 '''Boolean logic:''' We'd like to default to OR logic because this will
 always return a superset of the AND results, and we feel it's better to
 provide excess matches than few or none.  Refining search terms and
 filters is much more pleasant than removing them to broaden a search.  We
 would, however, also like to support boolean search terms and
 parenthetical grouping when they are explicitly entered.  It might be nice
 to allow use of '&', '|', and '!' as well as the localized strings for
 'and', 'or' and 'not'.

 '''Fuzzy search:''' I opened ticket #2645 regarding fuzzy matches for
 search strings.  Since the kids using these machines are going to be both
 a) learning how to spell and b) learning how to type, we should expect
 inaccuracies in their search queries, which some amount of fuzziness could
 overcome.  As mentioned there, I think titles and tags could use fuzzy
 search, but it's not needed for full text.

 '''Partial matching:''' I think that partial matching is absolutely
 essential for titles, and I think it's probably also good to use for tags
 as well.  Since activity name, participants, etc should all be stored
 within the metadata, this would allow me to search for "Writ" and find all
 Write documents, or "Walt" and find all instances of activities I did with
 Walter.  Again, like fuzzy search, I think partial matching of full text
 is unnecessary and would provide far too many results.  If we really
 wanted to, we could allow explicit use of wildcards to force partial
 matches on text, but I think that's an edge case optimization.  Note that
 this partial match is not bidirectional:  a search for "ca" should match
 "cat", but another for "cathedral" should not. Also, phrases within single
 or double quotes should require a full match (though again, perhaps
 fuzzy).

 '''Tags:''' Tags are explicitly entered by the user.  We'd like to make
 their input as format agnostic as possible, so that we aren't forcing any
 given system upon the kids.  To do so we'd like to lay a base rule,
 stating that all tags are space delimited.  This, of course, means that
 you can't tag something as "white house" but only with the weaker pair of
 tags "white" and "house."  Again, we'd like to offer a few accommodations
 so that, while not required, advanced users can enter more accurate tags.
 For this reason, we can support 3 ways of entering a tag with spaces:

  1. Washington, white house, capitol
  2. washington "white house" capitol
  3. washingtin white_house capitol

 The first method assumes that, when commas are present, they are meant for
 delimiting a list.  Commas within quotes, if they exist, would be ignored.
 The second method uses quotes - single or double - to group two or more
 words into a single tag. Single quotes inside double quotes should be
 ignored. The last method replaces the space with an underscore in order to
 tie the words together.  Of course, since we allow partial matches,
 searching for "white" would return the entry regardless of how the tag was
 entered.  On the other hand, searching for "white house" or white_house
 would return only those with both terms in the tag, title, or text.

 '''Metadata:''' We associate any number of metadata with the objects in
 the Journal.  Most of the metadata is provided by the activity or by
 sugar, but we also want to allow more advanced children to create their
 own.  Presently, the format for this is simply typing "key:value" pairs
 within the tag field.  Likewise, a search for a key:value pair should
 return results only for entries which have the same key:value pair (with
 fuzziness, perhaps).  To be clear, searching for a key:value pair should
 search all metadata, including that which wasn't created by the user
 within the tag field.

 There is some more background and additional thoughts on this in the
 Journal section of the HIG:
 http://wiki.laptop.org/go/OLPC_Human_Interface_Guidelines/The_Laptop_Experience/The_Journal#The_Power_of_Metadata

 As an aside, it occurred to me that present Journal designs don't offer
 any means of showing results based upon relevance.  On the one hand, I
 think that the time based view is quite important, and can even be so
 within the results list.  On the other hand, I wonder if we need to allow
 the option, even though temporal sort will remain the default.  Can we
 improve the speed of the search if we don't attempt to provide relevance
 rankings at all?

 Oh, and I assume this is already the case, but we should be sure to apply
 any selected filters prior to considering the query string to limit the
 number of entries we have to check for matches against.

-- 
Ticket URL: <http://dev.laptop.org/ticket/2423#comment:4>
One Laptop Per Child <http://laptop.org/>