Wednesday, July 31, 2013

Added terms component coverage to Early Access Release #5 for Solr 4.x Deep Dive

I just finished adding coverage of the terms component to Early Access Release #5 for my Solr 4.x Deep Dive book, both a short intro in the tutorial and a full chapter in the query part of the book. This eliminates another TBD from the book outline.
 
Now, I am going to move on to the term vectors component.
 
Then, I'll have to decide between moving on to coverage of highlighting, core management, collection management, or SolrCloud.

-- Jack Krupansky

Monday, July 29, 2013

Added Solr real-time get coverage to Early Access Release #5 for Solr 4.x Deep Dive

I just finished adding coverage of Solr real-time get to Early Access Release #5 for my Solr 4.x Deep Dive book, both a short intro in the tutorial and a full chapter in the query part of the book. For some reason, it was not even on my lengthy TO DO list for the book. Not sure how I missed it.
 
Now, I am going to move on to the terms component.
 
Then, I'll have to decide between moving on to the term vectors component, SolrCloud, or highlighting.

-- Jack Krupansky

Saturday, July 27, 2013

Early Access Release #4 for Solr 4.x Deep Dive is now available for download on Lulu.com

Okay, it's hot off the e-presses: Solr 4.x Deep Dive, Early Access Release #4 is now available for purchase and download as an e-book for $9.99 on Lulu.com at:
 
 
(That link says "1", but it apparently correctly redirects to EAR #4.)
 
I completed updates for the recent release of Solr 4.4 just four days ago. That was the primary focus of this EAR.
 
Summary of changes:
  • Covered changes to NGramTokenizer and NGramFilter for Solr 4.4
  • Covered addition of lcmap for normalization mapping of language identifiers in language detection update processors
  • Coverage for the maxscore query parser for Solr 4.4
  • Coverage for the switch query parser for Solr 4.2
  • Covered changes to <mergePolicy> for Solr 4.4
  • Added detailed descriptions for other pre-4.4 merge policy classes
  • Added a couple of appendices – intro to XML, intro to regular expressions, URL encodings
  • Added a new chapter for "Schemaless Discovery Mode"
  • Lucene infostream now can be sent to the Solr log file
  • ByteField and ShortField field types deprecated in schema
  • Added notes about order of evaluation for overlapping dynamic field patterns
  • Other minor Solr 4.4 changes
  • More formatting cleanup and indexing improvements
  • Summary comparison of the book to the new Apache Solr Reference Guide
Total of 69 pages of additional content.
 
The Solr 4.4 changes are all indexed. Lookup "4.4" under "Solr release" in the index for a clickable list of pages with changes for Solr 4.4.
 
Although some of those changes are in fact documented in the Lucene and Solr Javadoc and Solr Reference Manual and Solr wiki (all available FOR FREE!), I focus on providing greater detail and a lot more examples – a deeper dive. For example, the book includes a full list of all of the tokenizers, all of the token filters, and all of the update processors, and full descriptions for all options in solrconfig.
 
This EAR was actually a week early (well, I wanted it to be published yesterday, but...). Solr 4.4 released this week and I was essentially done with my 4.4 updates for the book, and the new Apache Solr Reference Guide is now available, so it seemed like good timing. And now I have almost three weeks to do some more significant coverage than the catch-up of the latest EARs.
 
I wouldn't bill EAR#4 as a "major" release – if you already have EAR#1 or #2 or #3, you may want to hold off for another release or two, although there have been LOTS of improvements.
 
Please feel free to email or comment on this blog for any questions or issues related to the book.
 
Thanks!

-- Jack Krupansky

Almost there for Early Access Release #4 of Solr 4.x Deep Dive

Okay, I finally got a handle on the merge policy changes. The good news is that I had confused myself and I actually did have the description correct in the book in previous releases. The bad news is that I was missing some additional material, which I have now added. Just a few more hours to clean up some stuff and I can publish Early Access Release #4 of the Solr 4.x Deep Dive book later this afternoon.
 
Again, the primary focus on this EAR is simply to cover Solr 4.4 changes. No significant new areas of coverage.
 
All of the Solr 4.4 changes are covered, except for those related to areas of Solr that are not covered at all by the book yet (marked TBD, such as SolrCloud, core and collection management, admin, etc.)

-- Jack Krupansky

Friday, July 26, 2013

Progress on Early Access Release #4 for Solr 4.x Deep Dive

I'm making steady progress on Early Access Release #4 for my Solr 4.x Deep Dive book. I only have one bullet point left for the backlog of Solr 4.4 items – change to merge policy in solrconfig. Unfortunately, while digging into that item, I realized that I made some mistakes in that area in the existing coverage, so I'm trying to sort that out now. Basically, I had parameters under <mergePolicy>, but in reality all of those mergPolicy-specific parameter are separate configuration elements within <indexConfig>. In fact, it looks like the standard Solr example solrconfig itself is wrong in that area as well. But, I'm still sorting it out. Right now I'm looking at previous Solr releases to see how they implemented merge policy in solrconfig.
 
I had hoped that maybe I could do an early release of EAR#4 today with all the 4.4 changes since 4.4 is now out, but it all depends on when I get this merge policy confusion sorted out. Stay tuned. It is still a possibility for late afternoon today.
 
Besides more general cleanup, there isn't much else in EAR#4 other than 4.4 feature coverage (and a 4.2 feature as well.)

-- Jack Krupansky

Saturday, July 20, 2013

Starting on Early Access Release #4 for Solr 4.x Deep Dive book

I have started work on Early Access Release #4 for my Solr 4.x Deep Dive book. Expected to be published in two weeks, on Friday, August 2.
 
So far:
  • Added appendix for URL encodings for special characters
  • Covered changes to NGramTokenizer and NGramFilter for Solr 4.4
My intention is to focus at least a third or even half of my time on covering the Solr 4.4 features, and a third or so on background work for SolrCloud, and another third on whatever else I can squeeze in, including further cleanup, formatting, and indexing.

-- Jack Krupansky

Friday, July 19, 2013

Also in EAR#3, changes to EdgeNGramTokenizer and Filter

Oops, I neglected to mention another area where I had to do some updates for Solr 4.4 in my Solr 4.x Deep Dive book – the EdgeNGramTokenizerFactory and EdgeNGramFilterFactory no longer support the side="back" parameter, but there is a technique using the ReverseStringFilterFactory to achieve the same effect (for the filter). The book provides the complete example.

-- Jack Krupansky

Thursday, July 18, 2013

Early Access Release #3 for Solr 4.x Deep Dive is now available for download on Lulu.com

Okay, it's hot off the e-presses: Solr 4.x Deep Dive, Early Access Release #3 is now available for purchase and download as an e-book for $9.99 on Lulu.com at:
 
 
(That link says "1", but it apparently correctly redirects to EAR #3.)
 
I just added:
  • Changes to NorwegianLightStemFilterFactory and NowegionMinimalStemFilterFactory for Solr 4.4
  • Added ScandinavianFoldingFilterFactory and ScandinavianNormalizationFilterFactory for Solr 4.4
  • Added description and examples for new rowid and rowidOffset parameters for Update CSV for Solr 4.4
  • Misc. random cleanup
Previously, I had added the following for EAR#3:
  • A new parameter for field selectors for the field mutating update processors to indicate whether fields must be in the schema or not.
  • Addition of the parse update processors for converting string values to numeric, date, and boolean values.
  • Addition of the Add Schema Fields Update processor.
  • The Min and Max Field Value Update processors now handle numeric values properly when using the JSON update format, but the new parse update processors are needed for numeric values when using the XML, CSV, or other non-JSON update formats.
  • Added the new Pattern Capture Group token filter (PatternCaptureGroupFilterFactory.)
Although some of those changes are in fact documented in the Lucene and Solr Javadoc and Solr Reference Manual and Solr wiki (all available FOR FREE!), I focus on providing greater detail and a lot more examples – a deeper dive. For example, the book includes a full list of all of the tokenizers, all of the token filters, and all of the update processors.
 
Over the past two weeks I did a bunch of research and wrote up some notes on SolrCloud, but I didn't have anything of publication quality yet.
 
I wouldn't bill EAR#3 as a "major" release – if you already have EAR#1 or #2, you may want to hold off for another release or two, although there have been LOTS of improvements.
 
Please feel free to email or comment on this blog for any questions or issues related to the book.
 
Thanks!

-- Jack Krupansky

Progress on EAR#3 for Solr 4.x Deep Dive - maybe tonight, maybe not

I'm getting close, but not sure if I'll be finished with Early Access Release #3 of my book Solr 4.x Deep Dive tonight. We'll see. If not tonight, then likely by noon on Friday.
 
Right now I'm laboriously sifting through the details of the changes in Solr 4.xx for the Norwegian language token filters. There are changes to two of the existing filters and the addition of two new filters for Scandinavian languages in general.
 
After that, there is a new parameter for CSV Update in Solr 4.4 that needs to be described and needs an example.
 
There will still be a number of Solr 4.4 changes that won't make it into this EAR, but that's life on a schedule.
 
Solr 4.4 had an RC0 ready to go, but I found a problem and had to vote -1 on it. They were going to spin a new RC tonight, but then another problem was discovered. Sounds like maybe next week 4.4 will hit the streets.

-- Jack Krupansky

Tuesday, July 16, 2013

Slow progress on EAR#3 for Solr 4.x Deep Dive

My progress on Early Access Release #3 of my Solr 4.x Deep Dive has been slower than I expected. I'll still have a release on Friday, but it will be more limited. I had hoped to have a first crack at SolrCloud and I did spend a lot of time researching and writing preliminary notes for SolrCloud, but I don't have anything publishable, yet. Instead, I focused on 4.4 updates. Even there, progress has been slower than expected since there are so many nuances. I still have a couple of days, but there are still a lot of 4.4 features that I still haven't covered.
 
I've tried to focus on new and changed Update Request Processors and Token Filters. Specifically:
  • A new parameter for field selectors for the field mutating update processors to indicate whether fields must be in the schema or not.
  • Addition of the parse update processors for converting string values to numeric, date, and boolean values.
  • Addition of the Add Schema Fields Update processor.
  • The Min and Max Field Value Update processors now handle numeric values properly when using the JSON update format, but the new parse update processors are needed for numeric values when using the XML, CSV, or other non-JSON update formats.
Also, in token filters:
  • Added the new Pattern Capture Group token filter (PatternCaptureGroupFilterFactory.)
All of that work is done.
 
And there were a modest number of edits here and there. As well as more indexing.
 
I'll get a few other items covered over the next two days as well.
 
I imagine that the priorities for EAR#4 will be more of the same – deeper research and notes for SolrCloud, and more 4.4 coverage.
 
-- Jack Krupansky

Tuesday, July 09, 2013

Focus for EAR#3 of Solr 4.x Deep Dive book - 4.4 Update Processors and SolrCloud

By default, I've set my priorities for the focus of Early Access Release #3 of my Solr 4.x Deep Dive book:
  1. Very preliminary coverage of Solr Cloud. Mostly research and testing some initial examples, and a start on a glossary for SolrCloud.
  2. Adding the update processors and token filters for Solr 4.4, due out soon.
Whether I get to anything else remains to be seen. Mostly it will be a question of how far I get with SolrCloud. Most of my time gets spent reading code and trying to make sense of the more obscure corners of SolrCloud, so I'm not sure how much actual writing I'll get in over the next 10 days. We'll sell.

-- Jack Krupansky

Friday, July 05, 2013

Possible priorities for Solr 4.x Deep Dive Early Access Release #3

I need to catch my breath, but here are some of the priorities I am thinking of for Early Access Release #3 of my book Solr 4.x Deep Dive book over the next couple of weeks:
  • Catch up on Solr 4.4 features as soon as the Solr 4.4 RC1 is available.
  • Start work on SolrCloud coverage
  • Start work on Data Import Handler coverage
  • Start work on SolrJ coverage
Each of the latter three items is likely to be a multi-month effort, but I think pursuing them in parallel will allow me to manage frequent stumbling blocks more effectively. And, these informal priorities do not guarantee that any of these areas will definitely be covered in EAR #3.
 
Some other areas that may get attention:
  • Highlighting
  • Autocomplete
  • Grouping
  • Core management
  • Collection management
But absolutely no promises on that front.
 
There will probably be incremental work on formatting and indexing, but as a lower priority since that was such a high priority for EAR #2.
 
Comments?

-- Jack Krupansky

Thursday, July 04, 2013

Early Access Release #2 for Solr 4.x Deep Dive is now available for download on Lulu.com

Okay, it's hot off the e-presses: Solr 4.x Deep Dive, Early Access Release #2 is now available for purchase and download as an e-book for $9.99 on Lulu.com at:
 
http://www.lulu.com/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-1/ebook/product-21079719.html
 
(That link says "1", but it apparently correctly redirects to EAR #2.)
 
My recent blog posts over the past two weeks detailed the changes from EAR#1. A lot of them were formatting and indexing, but a couple more scripting update processor examples, and a new "Solr Hot Spots" preface section to point the reader to interesting sections worth checking out, such as the grammars for the various query parsers, a complete list of functions, and complete lists of char filters, tokenizers, token filters, and update processors.
 
The next EAR will be in approximately two weeks, contents TBD.
 
If you have also purchased EAR#1, there is no need to rush out and pick up EAR#2. I mean, the technical content changes were only modest, and EAR#3 will be out in another two weeks anyway. That said, EAR#2 is a significant improvement over EAR#1.

-- Jack Krupansky

EAR#2 for the book is almost ready to go!

Okay, I'm basically done with the edits to the book, Solr 4.x Deep Dive, for Early Access Release #2. I'm just doing a few final checks for nuances I missed here and there. It's still nowhere near final, but basically ready for this release.
 
I'm toying with whether I should release it at midnight tonight (Thursday) so the guys in Europe will have it at 6 AM, or at 8 AM ET for the Yanks, or noon, or EOD Friday – 5 PM ET. I haven't decided yet. I may go with midnight since it is basically ready and that would get it out of the way, and Friday can be spent starting on EAR #3. I am a little worried about setting a precedent or expectation for the release time for future releases, but... I think I'll sidestep that issue by just saying that there is a 24-hour release window for Friday – anywhere from midnight (okay, 12:01 AM ET Friday) to midnight on Friday. Then, generally I will shoot for early in the day on Friday as a nominal goal, but that whole 24-hour window is fair game.
 
The other agonizing decision for this second release is whether to release as a new title or simply update the existing title. The latter has the advantage that existing links remain valid and will always point to the latest release, and sales numbers are cumulative. So, I'm leaning towards the latter – update existing title. Actually... now when I look at the URL, it does have the EAR number in it, so... I may have no choice but to "retire" the old EAR and publish a new one – we'll see, soon enough.
 
What's in this release? Read my earlier blog posts – it's all there.

-- Jack Krupansky

Wednesday, July 03, 2013

Split out plugin, request handler, and update handler config from solrconfig chapter of book

Some important topics were buried in the Solrconfig chapter of the book, so I broke them out into separate chapters:
  • Request Handler Configuration
  • Plugin Configuration
  • Update Handler Configuration
Also, Solr Named Lists are broken out into an appendix.

-- Jack Krupansky

Split Text Analysis chapter into four pieces for book

As I had suggested, I went ahead and split the "Text Analysis" chapter of the book into four chapters:
  • Analyzers Overview
  • Character Filters
  • Tokenizers
  • Token Filters
I also did a fair amount of indexing for those chapters and and formatting improvements as well. Still not complete, but much improved.
 
The new "Solr Hot Spots" preface page links directly to the tables of character filters, tokenizers, and token filters, among other areas of interest.
 
I'm still on track to publish EAR #2 on Friday.

-- Jack Krupansky

Monday, July 01, 2013

Spilt Text Analysis chapter into several chapters for book?

The Text Analysis chapter of the book feels a bit too large and monolithic, so I was thinking of splitting it into multiple chapters:
  • Introduction to Text Analysis
  • Character Filters
  • Tokenizers
  • Token Filters
Anybody have any objections or encouragement?
 
I used the term "Text Analysis", but I'm also thinking that maybe I should stick with the traditional Solr/Lucene term, "Analyzers" or "Analysis". I'm torn between trying to use generic language for chapter names vs. Solr-specific terminology. At this point, I'm thinking that "Analyzers" is the way to go.

-- Jack Krupansky