New search test feedback thread

Dan Murphy · Aug 26, 2007

jfulcer said:
....Patience, grasshopper.

I just noticed we missed the one year anniversary last week of the new BT search engine. :surfweb:

http://www.disboards.com/showthread.php?t=1200933

I do not think it is caught up as of yet in its indexing though.

flyinglizard · Aug 30, 2007

BoardTracker is not working right now for "search all posts" by ID name. I don't know what "California Gold" is... I just know BoardTracker doesn't work!

agnes! · Aug 30, 2007

flyinglizard said:
BoardTracker is not working right now for "search all posts" by ID name. I don't know what "California Gold" is... I just know BoardTracker doesn't work!

DIS'ers can choose which color scheme they want to DIS in. I like the Swan & Dolphin pastel coral & aqua/green myself.

Go to the bottom right hand of this page. There will be a drop-down menu that in your case probably says "Default". Click on the drop-down & you will see the following choices after Default...
Orlando Blue, Orlando Blue Night, Serenity Green, California Gold, DIS unplugged, etc., etc.
Highlight California Gold.
Click on that.
Now you can do a Search the BoardTracker way.

All available BT Search features will work in the California Gold 'skin', but be aware that the BoardTracker system is still apparently in Beta(testing) mode, so features sometimes come and go. One of the reasons the California Gold color scheme was chosen for the testing phase is exactly because it is one of the less-favorite DIS 'skins' & therefore wouldn't cause as much of a load on the servers.

agnes!

bicker · Aug 31, 2007

I bet part of the problem, once folks figure out the California Gold skin thing, is that some forums override the user's choice of skin, and force their own skin, thereby denying folks reading that forum direct access to the new search.

agnes! · Aug 31, 2007

And then there is always the ever-popular Yahoo or Google domain-name site-search. If the BoardTracker Search isn't.

agnes!

Dan Murphy · Aug 31, 2007

I use Google to try to search best I can.

jfulcer · Sep 1, 2007

Ok geek in me again. Why is this indexing so hard?

1) Query your index to find what the index has as a last message (Message X).

2) Query the board database to find the 'last message' in the system (Message Y).

3) Cycle through from Message X to Message Y and view each post using the view post tool: http://www.disboards.com/showpost.php?p=20594150 (or even a hybrid of this page that shows nothing but the post)

4) Add to Index.

5) Repeat every 5 minutes.

Am I just too big of a geek?

bicker · Sep 1, 2007

One hard part about indexing is the load the ETL process has on the database server. An efficient indexer, such as you propose, would do a great job at keeping an index up to date, but it would seriously degrade the database server's ability to serve end-user needs associated with data retrieval (reading messages) and storage (posting messages), which are the two primary purposes for a message board like the DIS. Indexing needs to be an activity with secondary priority, respecting the end-users' lack of patience with server-swap times rather than worrying as much as you suggest about index latency affecting search.

Beyond that, your approach only addresses keeping the index up to date. Board Tracker's challenge here was to populate the index in the first place. There were about 18 million messages already posted on the DIS when Board Tracker first started populating its index. I bet there was a specific reason why they populated the index from start to finish (rather than vice versa) so given what I mentioned above about indexing being a secondary priority activity, it will take a good bit of time to "catch-up" with 18 million messages, given limited processing resources on the database server.

How's that for geeky?

BoardTracker · Sep 1, 2007

pretty geeky indeed

Let me turn the notch a bit to uber-geeky though..

jfulcer... who said posts IDs are sequential?

What if the current PostID is 20594802 but the next one is actually 20594902? should the crawler try to read 100 non-existing posts?
Also, what about boards that use non-numerical post IDs? or ones like the flickr forums that have a combination of IDs from different ranges, where some new posts have IDs of 1003232 and some new ones have IDs like 700000323412? Certainly the system can not read everything from 1003232 to 700000323412 (when we know for a fact that almost nothing exists in the middle.

Furthermore, do you read a post at a time? Or rather full threads and full pages? Why should the crawler do differently? Instead of reading one post at a time, it can do 25 at a time.

The crawler indeed needs to "hammer" the board as little as possible. Searching a board that doesn't respond to its members just because the crawler is inefficient, now thats a very bad crawler. So the crawler not only needs to be a secondary priority (which naturally is one of the reasons it takes a while to get to all the data on Disboards) but also not try to needlessly "read" things that either doesn't exist or not urgent. BoardTracker for example doesn't read a thread again and again in hope that there will be new data in it. It "knows" when there are new posts in a thread

Part of the reason you don't see all the posts in BoardTracker (the newer version) yet is technical issues n our side. It takes time to prepare all the data we have for searching. But the main reason is DisBoards related.. we can't get to all the data as fast as we would have wanted. If we did that, DisBoards servers would have been brought to its knees and the board would slow to a crawl. So the koalas at BoardTracker have to be patient... and so should we

We are working hard to get the system out by September 5th. That is very soon (in dog's years

) Even then, when we launch the new version for public Beta, not all data in DisBoards will be there. But it will eventually get to all of it.
"So say we all" [now, thats an UberGeek ending to a post]

bicker · Sep 1, 2007

Just playing devil's advocate for a second...

It isn't clear to me that a crawler is the most efficient tool for the job. I'm sure you'd agree that it would be better to access the data natively, rather than via a crawler. Unless I'm mistaken, that's how the built-in vBulletin search works.

Also, it seems to me that another approach would have been to work with a backup of The DIS from a specific date, on an app server within your local network -- a copy you could window-dress to look like the real thing, but one you could "hammer" all you'd like because there are no other users access that data set. Then, once you've rifled through the three or four years worth of data, you can turn your crawler onto the "real" site, looking for changes since the time the copy was made.

I'm absolutely sure there are good reasons why neither of these were attempted.

I think it is important, thought, to keep in mind that what we're seeing, with these various threads of complaints, is frustration that the process has taken so long, and there has been so little pay-off evident so far, and it isn't clear what pay-off there will be forth-coming. Arguably, timeline expectations were set (for good or ill, with or without merit) and were, again arguably, not met. There's got to be acknowledgement of that fact.

Personally, I feel we're getting a good value for our contributions. I contribute only $60 a year to The DIS, and feel that I get sufficient value for my contribution, but I could understand people feeling different from me.

BoardTracker · Sep 1, 2007

Indeed working directly with the data is a better way to get to the data. As for approach, under that most boards and board owners do not have easy way to supply safe and secure access to the raw data or may not have the technical means to enable that.
However, this approach is something we DO support, technically. But as I said, not something that is being used in almost all cases.

I would agree that frustration may have been caused by delays and unmet expectations. I must note however that we did not set (certainly not intentionally), any timeline and the most we did was stating that we hope the search will be ready at a certain point in time. The new search was released in Beta state and was emphasized to be so.
Having said that, while we are working on the new search, an existing one which works for many of the search needs is fully available to all.

jfulcer · Sep 1, 2007

BoardTracker said:
jfulcer... who said posts IDs are sequential? What if the current PostID is 20594802 but the next one is actually 20594902? should the crawler try to read 100 non-existing posts?
Also, what about boards that use non-numerical post IDs? or ones like the flickr forums that have a combination of IDs from different ranges, where some new posts have IDs of 1003232 and some new ones have IDs like 700000323412? Certainly the system can not read everything from 1003232 to 700000323412 (when we know for a fact that almost nothing exists in the middle.

Every database has a uniquie identifier on every post. Referential integrity and all. So don't use post ID, use that.

BoardTracker said:
Furthermore, do you read a post at a time? Or rather full threads and full pages? Why should the crawler do differently? Instead of reading one post at a time, it can do 25 at a time.

I agree. So using that Unique Identifier above, ask for 200 posts at a time. Index that way. I just mentioned that one method because that's the one that I know of. It would be trivial for a basic developer to create you a page for each and every database you deal with to do this.

BoardTracker said:
The crawler indeed needs to "hammer" the board as little as possible.

Query the boards at night. Add to index during the day. You could easily gets tens of thousands of records this way.

BoardTracker said:
Part of the reason you don't see all the posts in BoardTracker (the newer version) yet is technical issues

Again?

BoardTracker said:
We are working hard to get the system out by September 5th. That is very soon

Boardtracker has repeatedly made promises (coming soon. Shortly. Couple of weeks.) I feel bad that the DISboards has been let down over and over. I certainly hope that all of this is worth it.

I won't hold my breath on September 5th. I do programming on the side (should be doing that and not be on the DISboards but...) and am giving my client a release of their software on the 6th. I know what THEY will do if I don't deliver. They'll find someone else.

bicker said:
Just playing devil's advocate for a second...

It isn't clear to me that a crawler is the most efficient tool for the job.

I agree, it's not. But a modified crawler would work - they just have to do it right.

bicker said:
Also, it seems to me that another approach would have been to work with a backup of The DIS from a specific date

I wouldn't even begin to guess how much space 18 million records would take up.

bicker said:
I'm absolutely sure there are good reasons why neither of these were attempted.

Poorly paid or poorly fed koalas. You take your pick

BoardTracker · Sep 2, 2007

jfulcer said:
Every database has a uniquie identifier on every post. Referential integrity and all. So don't use post ID, use that.

The point here is that you mentioned a sequential scan. This doesn't work in a non-numerical index or a non-sequential ones. This is the point.

jfulcer said:
I agree. So using that Unique Identifier above, ask for 200 posts at a time. Index that way. I just mentioned that one method because that's the one that I know of. It would be trivial for a basic developer to create you a page for each and every database you deal with to do this.

Most board owners are not developers at all. That is why this method is not popular or not trivial.

jfulcer said:
Query the boards at night. Add to index during the day. You could easily gets tens of thousands of records this way.

Scanning has to be ongoing and live, otherwise people will get their posts in the search results in sometimes 24 hours delay or more.

jfulcer said:
Again?

Nothing new here. Its the same ol same ol. Scanning and indexing hundreds of millions (soon billions) of posts into a new system takes time. A technical fact. Disboards is not the only board in BoardTracker and even Disboards alone has 16 million posts.

jfulcer said:
Boardtracker has repeatedly made promises (coming soon. Shortly. Couple of weeks.) I feel bad that the DISboards has been let down over and over. I certainly hope that all of this is worth it.

Disboards can use search now which works. For those that want the added features that the new version offers, more patience is needed. But both your empathy and frustration are duly noted and understood.

jfulcer said:
I won't hold my breath on September 5th. I do programming on the side (should be doing that and not be on the DISboards but...) and am giving my client a release of their software on the 6th. I know what THEY will do if I don't deliver. They'll find someone else.

We didn't give a release date, and Boardtracker is not a software but a service, and maybe somewhat a more complicated one that you might imagine. As for your clients, they are probably paying you for your software and for your delivery promises. This is not the case here.. nor did we give delivery dates and promises. Even the 5th that we stated is simply a milestone we are trying very hard to meet.

jfulcer said:
I agree, it's not. But a modified crawler would work - they just have to do it right.

If you have suggestions for whats 'right' we are happy to hear

jfulcer said:
I wouldn't even begin to guess how much space 18 million records would take up.

The reason why we don't work with a backup of disboards has nothing to do with us.

bicker · Sep 2, 2007

jfulcer said:
It would be trivial for a basic developer to ...

I bet you're wrong about that. That's really a safe bet, whenever someone starts a sentence with "It would be trivial..." and especially when they start a sentence with "It would be trivial to develop..." :rotfl:

Unless you've done the system engineering analysis, there is no way to know how much time and effort will be involved. There are almost always challenges that are not readily apparent to the casual observer. As a developer yourself, you should know that.

jfulcer said:
Query the boards at night. Add to index during the day. You could easily gets tens of thousands of records this way.

18,749,296 / 30,000 = 625 days to index the data.

BoardTracker · Sep 2, 2007

bicker said:
I bet you're wrong about that. That's really a safe bet, whenever someone starts a sentence with "It would be trivial..." and especially when they start a sentence with "It would be trivial to develop..." Unless you've done the system engineering analysis, there is no way to know how much time and effort will be involved. There are almost always challenges that are not readily apparent to the casual observer. As a developer yourself, you should know that.

18,749,296 / 30,000 = 625 days to index the data.

Right on both accounts :cool2:

as for the 625 days... So just think now how much more we are actually scanning a day here without interrupting Disboards to be able to get to all the data in 60 days, and not 600+ days

bicker · Sep 2, 2007

BoardTracker said:
Scanning has to be ongoing and live, otherwise people will get their posts in the search results in sometimes 24 hours delay or more.

I can attest to that. We've had some push-back from our customers about some of our ETL activities being off-loaded to the evening, to even-out the load on their application servers. They want to create a record and have it show up in the search results minutes later.

annie1995 · Sep 2, 2007

I have not been able to get the search under a persons user name to ever work. Am I doing something wrong??!! :confused3

agnes! · Sep 2, 2007

annie1995 said:
I have not been able to get the search under a persons user name to ever work. Am I doing something wrong??!!

I'm not sure what is working these days with the Search or what isn't...apparently the DIS/BoardTracker still in Beta/testing mode, so features come...features go. All that aside, are you using the California Gold 'skin'?
Go down to the bottom left corner of this page.
Your 'skin' menu will probably say 'Default'.
Click on the drop-down, choose 'California Gold'.
You will now be able to access the available Search features.

Other than using BoardTracker, many have reported success using Yahoo or Google to do a domain-name/user-name search.

hth,
agnes!

PlutoAddict517 · Sep 4, 2007

Agnes thanks for that tip. Since I switched to the California Gold screen I'm able to use the search section.

jfulcer · Sep 4, 2007

bicker said:
I bet you're wrong about that. That's really a safe bet, whenever someone starts a sentence with "It would be trivial..." and especially when they start a sentence with "It would be trivial to develop..." Unless you've done the system engineering analysis, there is no way to know how much time and effort will be involved. There are almost always challenges that are not readily apparent to the casual observer. As a developer yourself, you should know that.

18,749,296 / 30,000 = 625 days to index the data.

Oh, I know what's involved. I know that something that seems like 'one easy page' never really is. But for as long as Boardtracker has been working on this, and the number of boards they do index, this would really be something that *could* have been written. Flexible enough to deal with different board configurations. And yes, I know board owners are not developers. That would be why Boardtracker develops something and offers it to the board owners as a plug in or value added service.

I agree that people would most likely like to have searches that show up new information imeediately. But given the choice between NOT having a decent working search for over a year and having delayed searching, I know what I would choose.

Of course I'm on the outside looking in so I don't know everything that is involved.

Whatever. It's not worth it.

New search test feedback thread

We are family.

DIS Veteran

<marquee behavior=alternate><font color=darkorchid

DIS Veteran<br><img src="http://www.wdwinfo.com/di

<marquee behavior=alternate><font color=darkorchid

We are family.

DIS Old Timer

DIS Veteran<br><img src="http://www.wdwinfo.com/di

Keeping an eye on things

DIS Veteran<br><img src="http://www.wdwinfo.com/di

Keeping an eye on things

DIS Old Timer

Keeping an eye on things

DIS Veteran<br><img src="http://www.wdwinfo.com/di

Keeping an eye on things

DIS Veteran<br><img src="http://www.wdwinfo.com/di

<font color=FF0066>I have not used mine outside th

<marquee behavior=alternate><font color=darkorchid

Earning My Ears

DIS Old Timer

Share this page