Dealing with web addresses that use delimiter other than ?


SUBMITTED BY: Guest

DATE: Oct. 29, 2019, 2:12 p.m.

FORMAT: Text only

SIZE: 3.8 kB

HITS: 411

  1. Dealing with web addresses that use delimiter other than ?
  2. I've been removing tracking IDs from website addresses that users post to my message boards and classifieds, which has admittedly gotten WAY more complicated than I intended. But I've recently run across a new one, so I'm curious how you guys and gals would suggest dealing with it.
  3. ++++++++++++++
  4. list of top cheapest host http://Listfreetop.pw
  5. Top 200 best traffic exchange sites http://Listfreetop.pw/surf
  6. free link exchange sites list http://Listfreetop.pw/links
  7. list of top ptc sites
  8. list of top ptp sites
  9. Listfreetop.pw
  10. Listfreetop.pw
  11. +++++++++++++++
  12. In this example, the link looked like:
  13. https://example.com/foo/bar|pcrid|391022977133|pkw||pmt||pdv|m|slid||product||pgrid|78378217177|ptaid||&pgrid=78378217177&ptaid=&source=WFP2019-DD-NATL-GD-US-BCON&subsource=78378217177---391022977133&refcode=WFP2019-DD-NATL-GD-US-BCON&refcode2=78378217177---391022977133&utm_source=Google&utm_campaign=WFP2019-DD-NATL-GD-US-BCON&utm_term=-391022977133&utm_medium=Display&gclid=EAIaIQobChMIpMLsjry95QIVQqFRCh3slA_eEAEYASAAEgLLc_D_BwE
  14. I use Perl's URI::Find to find links in the text and convert it to a <a href=...>...</a> tag, but it doesn't recognize the | delimiter so I end up with:
  15. <a href="https://example.com/foo/bar">https://example.com/foo/bar</a>|pcrid|391022977133|pkw||pmt||pdv|m|slid||product||pgrid|78378217177|ptaid||&pgrid=78378217177&ptaid=&source=WFP2019-DD-NATL-GD-US-BCON&subsource=78378217177---391022977133&refcode=WFP2019-DD-NATL-GD-US-BCON&refcode2=78378217177---391022977133&utm_source=Google&utm_campaign=WFP2019-DD-NATL-GD-US-BCON&utm_term=-391022977133&utm_medium=Display&gclid=EAIaIQobChMIpMLsjry95QIVQqFRCh3slA_eEAEYASAAEgLLc_D_BwE
  16. And since the rest of that isn't recognized as part of the link, my system doesn't remove any of those parameters, including the parts that are actually delimited by & (and it would usually remove all of them).
  17. I'm kind of at a loss on how to handle this one. I could use a regex to find http, followed by anything that's not a space, until it gets to a |, and then remove everything after and including that |. That's a bit dangerous since someone could realistically use a | in the parameter value that I wouldn't want to remove, though.
  18. I guess that the regex would look something like:
  19. $text =~ s#\b(https?://[^\s])\|[^\s]*\b#$1#i;
  20. What do you all think?
  21. I say: Crikey.
  22. [^\s]
  23. ==
  24. \S
  25. unless there's something I am overlooking.
  26. Did you mean [^\s]+ (i.e. \S+) ? That was my own question mark, but in fact you'd need to say
  27. \S+?\|
  28. in order to stop as soon as possible, i.e. before the first | character if there's more than one of them.
  29. .www.affiliate-traffic-builder.com
  30. www.earthcam.com
  31. host the oscars
  32. witcher 2 make money
  33. happybirthdates.com
  34. x domain y
  35. hosting angular 6
  36. o domain registrar
  37. 1 domain 2 email servers
  38. domain xyz
  39. Seems like you'd want something like
  40. [\w/.,~-]+
  41. for the path part, to constrain it to things that can reasonably occur. If you could be certain that the | is the only no-no that will ever show up, the pattern would be more like
  42. [^\s?|]+([?|]blahblah)?
  43. where ? and | don't need to be escaped inside grouping brackets (but it does no harm if you do escape them).
  44. I suppose you have already considered the possibility of telling your code to disregard (don't make links from) URLs that don't follow the rules :( Your forums members really are an unruly bunch aren't they.
  45. Disclaimer: I am about to disappear for at least 24 hours, possibly longer (thank you very much, PG&E), so if I said something hopelessly misleading you will have to remain misled.
  46. Yikes! 24 hours without lucy24!
  47. (get those folks out there to solve the power problems!)

comments powered by Disqus