
No web app. Just a Visual C# Windows app, and dll.
Project Source Visual Studio Community is free. You should be able to build it yourself.
No build yet of executeable until we have a package verification strategy.
It’s a little more than just a little messy. I first didn’t intend on showcasing it. I just wanted to stop cutting and pasting. So I built something for the specific purpose of extracting data from HTML. And I ended up needing to build an entire damn parser.
The CrawlerCommon.dll is the brains of the operation. It’s a lot of cookie cutter code, intended to make HTML Parsing as simple as possible.
There are 5 major parts
1. The tokenizer
2. tag identification
3. The document tree builder
4. The node test (think xpath and css selector operations), which is xpath-like node selector
5. the “hpath” expression parser. Hpath generally follows the same logic as a CSS selector, with no special emphasis on the element id or css class declared, using Xpath syntax.
Each successive step depends on previous step correctly doing it’s job.
There will be additional posts to
1. explain how to use it in different use cases.
2. if you wish to modify it, for your own purposes, an explanation of how it is organized.
It isn’t a finished product. It’s own unit tests are failing bc I wrote many of the tests in anticipation of completing a feature, but decided I wouldn’t use it, so I left it incomplete. But I consider it an excellent teaching case. If you understand the concise code, you will understand why a web browser works generally the way it does.