Welcome to the Invelos forums. Please read the forum rules before posting.

Read access to our public forums is open to everyone. To post messages, a free registration is required.

If you have an Invelos account, sign in to post.

Invelos Forums->Posts by mediadogg Page: 1 2 3 ...8  Previous   Next
Message Details
Quoting AiAustria:
Quote:
...
It is one of DVDProfilers legacies, that an American programmer can't think of anything else but UPCs. Most of the other numbering schemes are compatible, some are not.

Awww man. Poor guy is not here to defend himself ... I wonder what he would say? I remember he used to take some pretty tough stances in the contributions Forum.
Posted:
Topic Replies: 202, Topic Views: 15345
Quoting GSyren:
Quote:
Yep, they all have the same UPC.   So blame the Media Company, not the user who submitted them.

Oh, so I think I am getting it. Because of the overlap, the user was forced to use a variant profile ID or Disc ID to submit the profile. I think I got it. Actually, I'll bet that situation was the motivation for Ken to implement UPC Variants.
Posted:
Topic Replies: 202, Topic Views: 15345
Thank you, but I wish I understood that thread better. Does that mean that it is legitimate? Avoidable by the contributor, or was "he/she" forced by the technology or whatever?
Posted:
Topic Replies: 202, Topic Views: 15345
If you are interested, here is an example of some of the weird stuff I am finding.

As you know, I am using the base UPC to call the online to get all variants to search for credits.

At the moment, messing with Clint Eastwood. Somebody using the Belgium Locality has used this UPC 5-413660-909616, which includes a profile for Dirty Harry, to create over 50 variants!!!    Of course, none of them (edit: actually one does) has any credits for Clint Eastwood, as they are all from a wide range of unrelated movies. But in order to find out, I have to scan them.

Who would do such a thing and why?
Posted:
Topic Replies: 202, Topic Views: 15345
Sorry, one more. But you will love it.

Something you might not have noticed is that CLTBoss will load the CLTPlus XML. If one trusts CLTPlus scraping (when it doesn't crash on a corrupt profile), then go ahead and load it into CLTBoss. You will have a set of credited profiles that you trust. Simply ask CLTBoss to download the Invelos XML into a collection and then search / sort / dance to your heart's content. Don't worry about how many spaces I squeeze out!

Oh yes, it might not be your Birthday, but you can go ahead and celebrate anyway. 

(Edit: If you want the current CLTBoss just for the purpose of using the XML download, I can post you a link in your PM. But it will still be unreleased for testing of the scraping and XML scan)
Posted:
Topic Replies: 202, Topic Views: 15345
Actually, what I am leaning towards is a plugin that has NO search smarts. It would be a solid, reliable, as fast as possible scrape of the CLT - get the list of profileIDs, grab the Invelos XML and sayonara.

Then all the search smarts, multiple views, etc., "SuperCLTPlus" or whatever, would be in an external tool(s).

Actually CLTBoss will deliver that, along with its extra baggage. Since it you can now scrape the profiles and immediately dump an Invelos Collection based on the profiles, you can completely ignore the CLTBoss XML scan and not worry about my interpretation of the search. I scrape, you search. Already possible!  (yes, I was thinking ahead).
Posted:
Topic Replies: 202, Topic Views: 15345
Quoting GSyren:
Quote:
Quoting AiAustria:
Quote:
And yes, it was an idea popped up a while ago, to scan for all known name variants at once

Sorry if there is something I have missed, but does this mean that CLTBoss will now only scan one single name at a time?

Regardless, it would still be advantageous to have the actual search argument as part of the output.

The current design and implementation of CLTBoss is to allow for scanning as many variants as it finds in the variants table. The resulting set of profile IDs collected and the resulting XML scan will be a collection that includes the results for all variants. The user has the option to use this feature or not.

That being said, given the issues I am having stabilizing the scraping operation, I obviously am focusing on getting 1 right. Then I will worry about more than 1. But the design remains the same. As we all are (or have been) programmers, I am sure you understand that it is very difficult to rip out features and change a program design after the fact. I thought the idea of multiple variants was a good one, so I took a shot at it.

In hindsight, it might have been better to go with the idea of an automatic generation of variants from a single search field. If I live long enough, and if people continue to care, I have a list of things to try in "son / daughter of CLTBoss" which includes:

- separate plugin
- use of Chrome instead of IE
- single search field with auto variants

And BTW, I am not going to spend time proving anything about the Profiler database. If you have a point to make, YOU prove it with a profile example. You show me, and I will write code to handle it.
Posted:
Topic Replies: 202, Topic Views: 15345
Still waiting for Tom Cruise to finish, so I will post a bit of a preview what I am trying to perfect:

I have given up trying to choose a set of "delays" that always work with an unpredictable network.

So, here is what I will provide, three ways to scrape:

(1) Click-scrape-next: this is a very fast and very accurate if you have less than 10 to 20 pages. Takes 10 min.

(2) An auto-scrape that uses AutoIt to press a "Scrape Displayed Page" button combination with javascript to click on the page, in a loop. The start and end pages of the loop, I scrape off the CLT screen. After each page, if I get less than 25 profiles, then I scrape again, and if still not, I add to an error list. At the end of the all pages scraped, the error report is presented and the user can manually click on the few pages that had errors, before running the XML scan.

(3) A pop-up scrape, that interrupts the user after each page that somehow triggers the complete download of the data. This is annoying, but you can be "sure" to get all the profiles scraped before running the XML scan.
Posted:
Topic Replies: 202, Topic Views: 15345
That being said, please remember that the CLT tool is code, written by a human, many years ago. If anybody thinks that code cannot have bugs, then they are mistaken. Not saying it does, just making the point that the only way I could guarantee identical results would be if I duplicated their program logic identically, including bugs.

So, it is entirely possible, that I could have code that matches CLT 999 times, and then for some weird case, have a difference for some reason I could not predict (example, people have talked about CLTPlus crashing on corrupt profiles, or profiles that won't even download into DVD Profiler). So, please don't try to hold CLTBoss to a higher standard than it is possible to reach.

Once it is (ever?) released, if you detect an error, all I need is the profile ID (s) that are included in the result that does not match the search, or the profile ID (s) that should have been included, but were not. Period. I can figure out the rest. A lecture on why the profile was poorly created, or the messy database, or tutoring on how to write code, does not matter in terms of the goals of CLTBosss. Match the search. Period. And BTW, if the results are different from the CLT results, but the matches are accurate, then as far as I am concerned, there is no bug.
Posted:
Topic Replies: 202, Topic Views: 15345
Quoting GSyren:
Quote:
Ok, that's not how I initially understood that it should work, but if that's what you guys agree on, then that's fine with me.

We are actually all on the same page. Remember I considered the 366 a bug and asked for help figuring it out. I did finally discover a bug in my duplicate detection code, which allowed the extras to slip through and somehow get allocated as having a match when they didn't. And I also stated up front that my goal was to match the CLT exactly.

So, guess what, we are all saying the same thing. The biggest limitations for me are my poor programming skills and frustration with the plague that has been cast upon us all.
Posted:
Topic Replies: 202, Topic Views: 15345
I have found a way to analyze the results using CookTop XPath tool. Not sure what your point is the about meta code. We have had the discussion before, unless you think somehow I didn't understand. Aside from the fact that your code will often crash when run with the plugin API, it would not return the same results as the CLT tool, which is what I am trying to match. The CLT results DO NOT ignore multiple spaces. That is part of the problem.

99% of the problems I am having, have nothing to do with the search code - it is mostly the tricks I am inventing to getting around the timing dependencies of screen scraping sequential web pages with no notification of when the pages are completely downloaded.
Posted:
Topic Replies: 202, Topic Views: 15345
Quoting GSyren:
Quote:
I don't think we need to go into per-profile. Last keep it simple, at least for now.

Cool. It can always be added later.
Posted:
Topic Replies: 202, Topic Views: 15345
Ok, I am just discovering a fast and powerful Xpath tool inside CookTop XML editor.

Using this statement, "nodes: /CLTInfo/DVD/CLTCredits/CLTCredit[@CreditedAs!='' and @FirstName!='' ]",

I confirmed that the file in question had exactly 36 credits where F/M/L = "Zhang Ziyi", but CreditedAs was "Ziyi Zhang". That why we got the same "36" from the same profiles. There are 3 others returned by the same XPath query where the exact opposite is true!!! F/M/L is "Ziyi Zhang", but CreditedAs is "Zhang Ziyi".

This is just way too much for CLTBoss. I want to just spit out the results (in either CLTBoss or Invelos formats), consistent with CLT, and make it clear where the profiles came from, and then hopefully other tools will allow more sophisticated filters to be applied. Above my pay grade and intentions.
Posted:
Topic Replies: 202, Topic Views: 15345
Quoting GSyren:
Quote:
Ok, so basically <Variants> would be the search arguments used. That sounds like it might be useful in some cases.

Thanks. I hope so. I will move along this path. So if somebody uses CLTBoss to scrape multiple variants (I can't tell whether they are really the same person in the code), then the resulting XML will always be an "OR" of the credits, with the combined set of profiles. My profile list grid actually contains the "hashname" version of the name, as part of my duplicates detection, so it makes me think ...

Your question about a variants per profile is now coming to mind, but I am still fuzzy. Do I need to reveal how each profile made it into the list? I think so, since my XML contains only the matched credits. If instead, one is looking at the "Invelos Export" output, you will of course get all credits, and all XML contents of each profile.

What should I do? I'm a bit confused. I hate to clutter up the XML, but how else to avoid the issue we hit earlier, by not knowing what filter was used to create the XML?
Posted:
Topic Replies: 202, Topic Views: 15345
Quoting GSyren:
Quote:
Ok, I'm still not clear on how you intended to use the <Variants> field. Would it contain all name variants found, or just the variant(s) used in the search? Or something else? Would it be a single field for the entire export, or a field for every profile?

Well it is just an idea at this point, not implemented. The problem just became apparent, and once again it demonstrates the value of collaboration. It hadn't occurred to me that it was a problem.

It should reflect all the variants that were used to produce the XML credits collection in question. In the case we are just discussing, it would have "ziyi zhang". That would indicate that while other variations might be found in the profile XML (or even in the same credit because by definition creditedAs and F/M/L can coexist - its just that is inconsistently used). One entry for the entire collection.

So, you can see this by typing into the CLT tool:

"ziyi zhang", " ziyi zhang", "ziyi zhang " or " ziyi zhang "

all will yield 364 profiles (as of today).

Note that "ziyi  zhang" yields 0 profiles!!!!  (double space between the names)

This match can come  from either CreditedAs, or from any concatenation of F/M/L.

I have seen the credit completely contained in the firstname with middle and last and creditedas blank, or as first+last, or as middle+last. And in this case you have the possibility of getting double blanks, and is stuff like this that is driving me nuts. In order to be consistent with the CLT, I have to actually ignore a possible intended match that doesn't work just because somebody typed "ziyi  zhang" instead of "ziyi zhang" into the CreditedAs field.

I already have a "CLTName" class in my code, that I think I will map over into the XML. It will be something like:

<CLTSearchNames>    (or "<Variants>)
  <CLTSearchName firstname ="ziyi", middlename="", lastname="zhang", birthyear="", hashname="zhang_ziyi_">ziyi zhang</CLTSearchName>
</CLTSearchNames>

My instructions from AiAustria were to match "search name" first on CreditedAs if found in the credit entry, otherwise match on the concatenation of F/M/L, case insensitive and with leading and trailing blanks removed.

So, that's where I am. I appreciate your patience, and I am open to suggestions.
Posted:
Topic Replies: 202, Topic Views: 15345
Quoting GSyren:
Quote:
Quoting mediadogg:
Quote:
So basically, I am asking if you agree that it would be a good idea for me to include a "<Variants>" element in my export, to make it clear how the included credits were found. That way, CLTInfo (for example), would know up front how the data was gathered, then of course offer other ways to view it.

Does that make sense?

Not sure what you mean by variants. Are those the (1), (2), (3) you mentioned before? Or are we talking about something else?

The whole point of the CLT and the notion of "common names" is that people are often known by different variants or aliases or variations in their names. It happens a lot with Asian names, as they are often reversed from what the actor uses in their native country. I apologize if "variant" is incorrect terminolgy.

In the contributions threads, it is often referred to as "name variant."

So, "ziyi zhang", "zhang ziyi", and "zhang zhi" are three variants of the credited name for the same "Crouching Tiger" actress.

But I know you knew all that, so maybe I misunderstood the question?
Posted:
Topic Replies: 202, Topic Views: 15345
So basically, I am asking if you agree that it would be a good idea for me to include a "<Variants>" element in my export, to make it clear how the included credits were found. That way, CLTInfo (for example), would know up front how the data was gathered, then of course offer other ways to view it.

Does that make sense?
Posted:
Topic Replies: 202, Topic Views: 15345
Quoting GSyren:
Quote:
There is no "search" in CLTinfo. It takes all of the info in your output and presents it in a structured way.

You certainly didn't offend me. I was just trying to clarify why my results corresponds to yours.

Ok, but if you could just read my prior post (I mean the technical parts) and see if it adds anything to your thoughts, I would appreciate it.
Posted:
Topic Replies: 202, Topic Views: 15345
Quoting GSyren:
Quote:
Why would the correspondence be a coincidence? CLTinfo takes your output from CltBoss and formats it according to how AiAustria wanted it. If you get 366, I get 366.

Again, please please please, I am honestly trying to not to offend you in any way - just trying to understand how to fix a bug if I have one. Please excuse me if it comes across any other way.

I understand that you were reading my dataset, that's why I was trying to understand what appeared to be a difference in the numbers. If you are not including "CreditedAs" in your search criteria, but I created the data that way, then I was simply postulating that perhaps the 366 was a coincidence. The two searches were based on different criteria is all I was saying, not right or wrong, just different.

This is what I think is going on: the matches that you found using F/M/L with "Zhang Ziyi" must also have CreditAs "Ziyi Zhang", hence we both pick up the credit. The difference is that I attribute all of the credits to the variant "Ziyi Zhang," whereas you method splits the credits.

That tells me that there is a flaw in my export process. For there is no way, apriori for anyone to know that my search was ONLY on "ziyi zhang". So, I am thinking I need to include an XML element that describes the search criteria, otherwise, I will need to include an output for all possible variants, which would be very difficult.
Posted:
Topic Replies: 202, Topic Views: 15345
Quoting GSyren:
Quote:
Quoting mediadogg:
Quote:
CLTInfo gets 330

No, CLTinfo gets 366, same as you, only divided into two separate groups based on F/M/L.
If you feel that that's the wrong way to do it, take it up with AiAustria. I have no personal stake in how the data is presented.

Ok, I wasn't taking issue, I was just trying to take advantage of your help, but pointing out that the correspondence was likely a coincidence because my search did not (intentionally) include Zhang Ziyi, even though the profiles do include that variant.

But thanks for your quick response. I'll figure it out I guess.
Posted:
Topic Replies: 202, Topic Views: 15345
Quoting GSyren:
Quote:
Quoting mediadogg:
Quote:
I am counting a match on either (1), (2) or (3)

Sounds right to me. And I am presenting tha data based on F/M/L since that is the data that the Common Names are based on. Or at least that's how I have understood it. Personally I have no opinion on this, I am trusting that AiAustria will tell me if I'm doing it wrong.

That was my point. Given the same criteria, all programs should get the CLT result of 364. At the moment, I am getting 366, CLTInfo gets 330 and we don't know what CLTPlus gets (yet).

I am only trying to clean up my own bugs and looking for validation that I am counting correctly.
Posted:
Topic Replies: 202, Topic Views: 15345
Quoting GSyren:
Quote:
My program shows how they are actually credited in Profiler.

So the result means that there are 330 profiles where the credit is Ziyi Zhang and 36 profiles with Zhang Ziyi [Ziyi Zhang].

Correct, but what does "credited" mean? The CLT says 364, and I believe that corresponds to is AiAustria's definition, that includes "CreditedAs". And remember, the XML prepared by CLTBoss in this case did not include a search for Zhang Ziyi, but of course you can find those variants in the profile. I actually thing the 330 + is a coincidence in this case.

There are multiple cases:
(1) - credited only in the creditedAs field
(2) - credited in both creditedAs and F/M/L
(3) - credited only in F/M/L
(4) - neither

I am counting a match on either (1), (2) or (3), as a single (need to check for a double count bug here) match for a specific credit entry in the profile. A profile is counted (once) when there are 1 or more credit matches. At least that's what I am attempting to code.
Posted:
Topic Replies: 202, Topic Views: 15345
Aha, I just found something. The CLT search seems to only trim blanks from the beginning and end of the strings to be compared.

I think my code does it two ways, inconsistently sometimes compressing mutiple blanks down to 1 blank between tokens. That might cause me to pick up extra matches if the profile has multiple blanks inside a "creditedAs" due to  a typing error. I will check for that.
Posted:
Topic Replies: 202, Topic Views: 15345
Quoting GSyren:
Quote:
Just counting them in Editpad Pro I get 330 Ziyi Zhang and 36 Zhang Ziyi. That would seem to account for all 366 profiles.
And my CltInfo program (not yet released) gave the same numbers.

Thank you. What I am trying to determine is under what circumstances I get a different result from the CLT, which people consider to be the gold standard.

So, yes, while any profile can contain any variant, I want to understand why when I search for "ziyi zhang" (case independent), I get 366, when the CLT says 364. So in this case, I "don't care" that other variants are in the profiles. I can also search for those, and the value of a program like CLTInfo, is that it takes the raw XML and squeezes out all variants, so great, I am not attempting to duplicate that (but maybe I should?).

Most of the time, CLTBoss returns dead on the same number as the CLT for a specific variant ... just when I think I'm ready to release, I hit an exception, and when I do, I try to figure out why. My search attempts to be exactly what AiAustria has suggested: for any given profile, first accept a case-independent match on creditedAs, and failing that accept a case-independent match on the concatenation of first/middle/last, ignoring birthyear (unless optionally chosen by the user).

The fact that your counts add up to 366 is interesting, but in fact, my code (and I will double check) was actually ONLY searching for "ziyi zhang" - that specific string. So, just trying to figure that out. And then why would CLTInfo not get 364 for "ziyi zhang" - would't we expect the same number as the CLT (assuming that the set of profiles has been correctly collected).

Since we don't actually know what the search parameter for the CLT is, and how it is coded, I am taking the description by AiAustria as "gospel."
Posted:
Topic Replies: 202, Topic Views: 15345
Man I hope you guys are not totally sick of me, but I am determined to get this damned thing right. If I were smarter, it would be faster. Sorry.

Anyways, is there anybody that can confirm how many profiles in this XML file have credits for "ziyi zhang" spelled exactly that way? (I know there are 366 profiles in the file. But do I really have two profiles with no valid credits???? If so, which two?)

I appreciate the help in advance. I am so dizzy with code variations and watching progress bars ...
Posted:
Topic Replies: 202, Topic Views: 15345
Invelos Forums->Posts by mediadogg Page: 1 2 3 ...8  Previous   Next